Teaching stats and software

12/03/2009

Forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, and we use all sort of complex linear and non-linear regression models to predict growth, caries linear mixed models are the bread and butter of the analysis of experiments, women’s health etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reach the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Our current robust tool is SAS (mostly BASE and STAT, with some sprinkles of GRAPH), which is introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

  • We spend a very long time explaining how to write simple SAS scripts. Students forget the syntax very quickly.
  • SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
  • SAS is extremely expensive, and it is dubious that we could afford to add the point and click module.
  • SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
  • Users tend to store data in SAS’s own format, which introduces another source of lock-in.

In my research work I use mostly ASReml§ (for specialized genetic analyses) and R§ (for general work), although I am moving towards using ASReml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I use SAS to be consistent with second year material.

Considering the previously mentioned barriers for students I have started playing with R-commander§, a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books§, by the way). As I see it:

  • Its use in command mode is not more difficult than SAS.
  • We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
  • It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
  • It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
  • R is open source and easily extensible.

I think that R would be an excellent fit for teaching; nevertheless, there would be a few drawbacks, mostly when dealing with postgrads:

  • There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines.
  • Some people have an investment in SAS and may not like the idea of using a different software.

We will see how it goes because–as someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6.§

Filed in software, statistics

There are 6 comments in this article:

  1. 12/03/2009Tore say:

    Nice overview and you are hitting the nail on the head; we use basically SAS for data administration though paying for a lot of facilities we never use. (Read it easily in my RSSOwl; that blackish is mainly for younger eyes, isn’t it?)

  2. 12/03/2009Luis say:

    Hi Tore. I am just venting my frustration at the status quo and trying to convince people to take a look at some very nice open source alternatives.

    Concerning the template for the blog, I think it looks elegant, although I will concede that is not the easiest design on the eye. These days I am doing most of my blog reading through Google News (another RSS aggregator). Do you run RSSOwl in Linux? If so, which distro are you using?

  3. 13/03/2009Tore say:

    The owl sits on my WinXP toolbar at Skogforsk and I’ve thought about introducing it on my Kubuntu (Hardy) laptop at home, but since KDE has a decent reader (Akregator) integrated with Kmail etc., I use that. – By the way, I would rather recommend Ubuntu and Gnome for anyone since it’s nicer, still if a little less advanced than Kubuntu. And the Ubuntu family also contains server setups for almost any purpose. And there is a clever community if you get stuck like usual people do; the others may well choose Debian of course, the mother of Linux!

    (thanks btw for that blogtip regarding GeSHi, have made an asreml.php file, use it with an Italian Txp plugin, worked at first attempt!)

  4. 13/03/2009Luis say:

    Great! I will be soon updated the asreml cookbook to include two things: i) new features and syntax of version 3 and ii) asreml-R, so one can use asreml directly from R.

    In my office I use a laptop (a mac) and a desktop (currently running windows). I am thinking of running the latter with Linux, so Ubuntu sounds like a good choice. Then I would have a full unix office!

  5. 19/03/2009Will Dwinnell say:

    I don’t know if you’re soliciting suggestions, but MATLAB, while not free, is certainly much cheaper than SAS. Also, student versions are available.

  6. 19/03/2009Luis say:

    Thanks for the suggestion Will. Several years ago I used MATLAB a lot for personal (meaning non-teaching) applications, mostly for simulation and writing algorithms. Since then I moved to Python + Numerical routines (and then to R) for doing so. I like MATLAB’s language a lot, but it feels to me that stats is something that has been ‘bolt on’ rather than being integral part of the language.

    Anyway, it seems that my comments have sparked people’s interest in our university, at least to question why we are using some languages today. Do our choices really reflect today’s needs or, in fact, they reflect historical decisions which are no longer relevant?

Write a comment: