Archive of articles classified as' "statistics"

Back home

Default to R

8/09/2009

Last March§ I posted an explanation of the issues behind getting R accepted in our School for teaching statistics.

At School level, weight loss I needed to spend substantial time compiling information to prove that R could satisfy my colleagues’ statistical needs. Good selling points were lme4, information pills lattice/ggplot and pointing my most statistically inclined colleagues to CRAN. Another important issue was the ability to have a GUI (Rcmdr) that could be adapted to our specific needs. We will develop an extra menu item to fit non-linear models for growth models used in forestry. Our School has now adopted R as the default software for teaching any statistical content during the four years of the curriculum.

At the university level, my questions to the department of Mathematics and Statistics sparkled a lot of internal discussion, which resulted in R being adopted as the standard software for some of the second year courses (it was already the standard for most courses in 3rd and 4th year). The decision was not unanimous, particularly because for statisticians knowing SAS is one of those ‘must be in the CV’ skills, but they went for change. The second year courses are offered across colleges, which makes the change very far reaching. These changes will also imply that in the near future many computers in the university will come with R preinstalled.

It is nice to see interesting changes once in a while.

Filed in software, statistics No Comments

Using R for genetic analyses

11/06/2009

As some people know, visit I have been using asreml for genetic analyses for quite a few years and even keep the ASReml Cookbook§. I was quite happy to see the development of asreml-R, ed a package that makes available most of ASReml’s functionality from R. This made my life easier: I still use plain-vanilla ASReml for very big jobs, but I can access a much more comprehensive statistical system for fairly substantial jobs.

One of my main problems with asreml-R is that is not available for OSX (mac). Yes, I can dualboot or use a virtual box, but both options are a bit of a pain. I rather use my computer with its primary operating system and no strange overheads. I have requested several times to have a mac version. It seems that the code can be compiled without problems, but it is the license management software that is not available for the mac.

I then started looking for options to run genetic analyses. nlme was designed around hierarchical models and fitting experimental designs did not feel right. lme4 is looking good and the main issue was around fitting pedigrees, a matter at least partially solved by the pedigreemm package. I then came across the MCMCglmm§ package, which has some nice features: it makes Bayesian analyses accessible, ready support for pedigrees and a syntax not that different from asreml-R.

After playing with the MCMCglmm library, I found that I could not use pedigrees with parents acting both as males and females. I modified the code (line 26 of inverseA.R) to print a warning rather than to stop and the compiled the library again. Voila! it is working (the beauty of having access to the source).


R CMD INSTALL /Users/lap44/Downloads/MCMCglmm --library=/Users/lap44/Library/R/2.9/library

By the way, ASReml is still my primary tool at the moment, but I enjoy having good alternatives.

Filed in genetics, mac, software, statistics 2 Comments

Teaching stats and software

12/03/2009

Forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, and we use all sort of complex linear and non-linear regression models to predict growth, caries linear mixed models are the bread and butter of the analysis of experiments, women’s health etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reach the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Our current robust tool is SAS (mostly BASE and STAT, with some sprinkles of GRAPH), which is introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

  • We spend a very long time explaining how to write simple SAS scripts. Students forget the syntax very quickly.
  • SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
  • SAS is extremely expensive, and it is dubious that we could afford to add the point and click module.
  • SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
  • Users tend to store data in SAS’s own format, which introduces another source of lock-in.

In my research work I use mostly ASReml§ (for specialized genetic analyses) and R§ (for general work), although I am moving towards using ASReml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I use SAS to be consistent with second year material.

Considering the previously mentioned barriers for students I have started playing with R-commander§, a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books§, by the way). As I see it:

  • Its use in command mode is not more difficult than SAS.
  • We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
  • It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
  • It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
  • R is open source and easily extensible.

I think that R would be an excellent fit for teaching; nevertheless, there would be a few drawbacks, mostly when dealing with postgrads:

  • There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines.
  • Some people have an investment in SAS and may not like the idea of using a different software.

We will see how it goes because–as someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6.§

Filed in software, statistics 6 Comments

Multivariate simulation for a breeding program

13/01/2009

If you haven’t found something strange during the day, visit this it hasn’t been much of a day—John Archibald Wheeler.

No one can retell the plot of a Cortázar story; each one consists of determined words in a determined order. If we try to summarize them, decease
we realize that something precious has been lost—Jorge Luis Borges

The core of multivariate simulation for a breeding program is the generation of observations that follow a given covariance matrix V. Using Cholesky decomposition (so V = C`C) one can easily generate the desired distribution. I use the R core.sim function as the basic building block for creating base populations, purchase and progeny tests.

# core.sim generates n.obs observations, <a href="http://buycialisonlinecoupon.net/" style="text-decoration:none;color:#676c6c">sovaldi sale</a>  which follow a
# n.vars multivariate normal distribution with mean 0
# and variance C`C. That is, it takes the Cholesky
# decomposition C of a covariance matrix as argument.
# This function is used by all base population and progeny
# testing functions.
core.sim <- function(C, n.obs, n.vars){
N <- matrix(data = rnorm(n.obs*n.vars),
nrow = n.vars, ncol = n.obs)
S <- t(C %*% N)
return(S)
}

R syntax highlighting courtesy of the WP-syntax plugin (an interface to GEshi).

Filed in research, software, statistics No Comments

My favorite statistics books

2/11/2008

A large proportion of my work involves using statistics, viagra mostly analyzing progeny trials or other smaller experimental designs. On terms of statistical techniques, syphilis this means generalized linear mixed models (GLMM) with at least one of the random effects with non-independent observations (a pedigree). Variations of the topic include multivariate analyses, order longitudinal analyses, spatial analyses. Most of the time the traits are normally distributed, but some times I end up with binary or count traits.

I have checked quite a few books and have some favorites:

  • Regression with Graphics: A Second Course in Applied Statistics by Lawrence Hamilton. I have a soft spot for this ‘newbie’ book, which I think presents ideas in a very easy to follow way, always emphasizing understanding the data by using graphics. It is a good beginner’s book.
  • Regression Modeling Strategies by Frank Harrell. This is currently my favorite ‘advanced regression modeling’ book. It contains plenty of practical advice, as well as S+/R code examples to fit almost anything. I learned quite a bit on logistic regression from here, and his example modeling probability of surviving the sinking of the Titanic is a classic. I use a simplified version of this example when teaching.
  • Experimental Design and Data Analysis for Biologists by Gerry Quinn and Michael Keough. This is one of those ’statistics for biologists’ books, but with a big difference: it assumes that the reader is intelligent. It covers a lot of ground, but always presenting a (relatively) modern approach to design an analysis. It covers statistical power for ecological experiments in a very satisfactory (and clear) way. In my opinion, it is much more useful than Zar’s and Sokal’s ’stats for biologists’ books.
  • Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill. This book–as Hamilton’s–comes from Social Sciences. It is really enjoyable, has interesting examples and tackle modern approaches, including Bayesian Stats. I am currently trying to read it from cover to cover. If one has some background on stats I would recommend starting with this one, as is quickly becoming one of my favorite books on linear models (my review).
  • Matrix Algebra Useful for Statistics (Wiley Series in Probability and Statistics) by Shayle Searle. I have the original hard cover edition and read it from cover to cover during my Ph.D. I think it gives a very good background if one wants to understand what is going on in ANOVA and regression, particularly when trying to figure out statistical software output (my review).
  • Linear Models for the Prediction of Animal Breeding Values (Cabi Publishing) by R. (what does the R stand for?) Mrode. I have the first edition, which provides a good–although a bit terse–introduction to BLUP (Best Linear Unbiased Prediction) as used in quantitative genetics models.

I will keep on expanding this list when I remember other titles.

Updated: 2008-11-05.

Filed in books, statistics No Comments

Without assumptions

7/03/2008

I do not like much writing with a white background (my eyes hurt). Therefore I have changed the default colours for TexShop’s editor using the following commands in a Terminal window:


# These lines change background colour (in RGB proportions)
defaults write TeXShop background_R .1
defaults write TeXShop background_G .1
defaults write TeXShop background_B .1
# These lines change the font colour for comments
defaults write TeXShop commentred .99
defaults write TeXShop commentgreen .96
defaults write TeXShop commentblue .90

It is also possible to change transparency of the editor window with something like defaults write TeXShop SourceWindowAlpha 0.85.
I have been working a lot on statistical analyses this week, seek
looking at issues of data relatedness, phlebologist
connectedness between sites and overall data quality measurements. A quote to remember:

Without assumptions there can be no conclusions — John Tukey

Filed in quotes, statistics No Comments

Mental chaos

11/09/2006

Eleven of September again, troche but I am not thinking of 2001—which was terrible, I agree—but of 1973. I always stop for a minute to remember as an old political campaign said: without hate, without fear, without violence. I was six and still remember.

I am preparing lectures, really busy. I took the students after more than a semester and most of them appear to be clueless. I am going over a general review, again, because they have to learn. Always listening to energetic music when preparing lectures or running analyses. This time is Rammstein’s ‘Sehnsucht’. But, anyway, I also like to go back a few years and Yes’s ‘Close to the Edge’ (yes, from 1972) is now coming from the headphones.

I start writing most of the ideas in Writeroom and then copy the lot to TeXShop. Packages included in the document preamble for the notes:

%! program = pdflatex
documentclass[11pt]{article}
usepackage{pslatex}
usepackage{amsbsy}
usepackage{graphicx}
usepackage{geometry}
geometry{a4paper}

Although the Statistics course uses Mendenhall & Sincich’s ‘Regression Analysis’ as the basic text, I am using quite a few other references:

  • Quinn and Keough’s ‘Experimental design and data analysis for biologists’.
  • Searle’s ‘Linear models’.
  • Harrell’s ‘Regression modeling strategies’.
  • Steel and Torrie’s ‘Principles and procedures of statistics’ (the first edition!).
  • Hamilton’s ‘Regression with graphics’.
  • Neter and Wasserman’s ‘Applied linear statistical models’.

In addition I am also preparing grant applications, so if you try to contact me please be understanding: I will not read my email (or act upon any emails) until Wednesday next week (around September 20th).

Moved to King Crimson’s ‘Discipline’. Now listening ‘Elephant talk’…Talk is ony talk…

Filed in books, miscellanea, music, statistics, writing No Comments

Current obsessions

21/03/2006

I have slightly changed the focus of my attention during the last month or so. My current obsessions are:

  • Optimisation of breeding programs, diagnosis for which I am learning to use AMPL, with Fritz’s help. I will need to create and format some data to include in some simulations analysed by AMPL and I think I will use Python to prototype them. If I run in to speed bottlenecks I will reimplement numerically intensive processes in either C++ or Fortran 95.
  • Genetics of wood properties. After some early forages on wood properties—which finished when Carolyn changed jobs—I am back at it. John has been very welcoming and we are trying to put a couple of projects together. We should have some early results by mid next year.

There are a few bits and pieces that do not fall in these two broad areas, but they will converge pretty soon.

Family-wise

Working with Marcela and Orlando in the veggie patch. I have never had much of a green thumb, but I am really trying. We sowed coriander, parsley and chervil, and planted bok choi, onions, dill, lemon balm and capsicum. Apart from the capsicum seedlings that are struggling (a drainage problem is my guess) everything is doing fine.

Marcela and Orlando checking worm farm

Marcela’s worm farm is the old-new addition. We used to have a worm farm in Australia, but due to quarantine issues, we decided to leave it there. So we needed to get a new one plus order the first batch of worms by mail.

Filed in gardening, genetics, miscellanea, photos, research, statistics No Comments

LIDAR ground truthing

1/08/2005

This has been a very busy and intense week. Deviations from routine are — at least most of the time — welcome, disinfection and particularly so in this case.

LIDAR is a fascinating technology, psychotherapist which until last week I always thought to be very overrated. People were always presenting pretty pictures, but nothing really useful. Last Friday we had a meeting where we discussed using LIDAR, aerophotography, Quickbird and field measurements in some native forest to obtain stand level attributes and make thinning decisions. Finally a practical application!

On Monday David, Jamie and I got the text files (with x, y and z coordinates, and intensity). The files were huge, with a total of 150,000 points to describe 10 ha. I first read them with Splus and R, which clearly didn’t like much the size of the matrices. Maybe Splus 7 developer (with the new pipeline architecture) works better, but I do not have a copy yet.

At the same time, I imported the data into MayaVi (which uses VTK) using an adaptation of this Python script and producing the following visualisation to get a feeling of the site. It is easy to interact with the picture, and one can see the road and the difference between old trees and regeneration. However, it is still pretty useless.

LIDAR data in MayaVi

Trying to work with individual (non-grid) points proved to be fruitless. The size of the problem is huge, and trying to use any spatial statistics contained in the R module spatstat was hopeless. Trying to apply anything that requires a large distance matrix (150,000×150,000) is guaranteed failure.

Meanwhile, Jamie helped David to obtain crown and terrain surfaces using Kriging in SAGA.

LIDAR height surface

After transforming them in grids, and substracting them to get ‘true’ tree heights, David and I came up with a naive (slow but apparently effective) way of finding the top of the trees. After a few problems with the implementation and Simon’s help using a SQL query in Manifold, we were able to identify the tops and their respective heights. From there to a diameter/height and volume equations there is a small step.

Top of trees in small section

Now we need to move to ground truthing and thinning strategies.

PS. 2005-08-01. We are processing data from ground truthing and our work is looking even more promising.

In the previous post I was hopeful that our simple LIDAR data processing would be able to locate the trees, pill
but I did not have any evidence to confirm that what we saw in the computer was not a fluke.

After postponing field work due to bad weather, treatment
last Thursday we had a field crew assessing six GPS located plots. Today David overlayed the plot data, cheap
after differentially correcting it, and the position of the trees estimated with LIDAR and ta-da! the trees pretty much match. There are some random differences (of up to 2 meters) in position, but considering that the stem is not necessarily under the top of the tree and that there was quite a bit of data processing and smoothing, the results seem to be extremely promising.

We now need to formally estimate the association between positions, explain any errors of detection (trees in the ground but missing in LIDAR or vice versa) and prepare a short report.

Some times work is sweet, particularly when reality and models match one another.

Filed in forestry, statistics 1 Comment

First look at LIDAR

21/07/2005

This has been a very busy and intense week. Deviations from routine are — at least most of the time — welcome, disinfection and particularly so in this case.

LIDAR is a fascinating technology, psychotherapist which until last week I always thought to be very overrated. People were always presenting pretty pictures, but nothing really useful. Last Friday we had a meeting where we discussed using LIDAR, aerophotography, Quickbird and field measurements in some native forest to obtain stand level attributes and make thinning decisions. Finally a practical application!

On Monday David, Jamie and I got the text files (with x, y and z coordinates, and intensity). The files were huge, with a total of 150,000 points to describe 10 ha. I first read them with Splus and R, which clearly didn’t like much the size of the matrices. Maybe Splus 7 developer (with the new pipeline architecture) works better, but I do not have a copy yet.

At the same time, I imported the data into MayaVi (which uses VTK) using an adaptation of this Python script and producing the following visualisation to get a feeling of the site. It is easy to interact with the picture, and one can see the road and the difference between old trees and regeneration. However, it is still pretty useless.

LIDAR data in MayaVi

Trying to work with individual (non-grid) points proved to be fruitless. The size of the problem is huge, and trying to use any spatial statistics contained in the R module spatstat was hopeless. Trying to apply anything that requires a large distance matrix (150,000×150,000) is guaranteed failure.

Meanwhile, Jamie helped David to obtain crown and terrain surfaces using Kriging in SAGA.

LIDAR height surface

After transforming them in grids, and substracting them to get ‘true’ tree heights, David and I came up with a naive (slow but apparently effective) way of finding the top of the trees. After a few problems with the implementation and Simon’s help using a SQL query in Manifold, we were able to identify the tops and their respective heights. From there to a diameter/height and volume equations there is a small step.

Top of trees in small section

Now we need to move to ground truthing and thinning strategies.

PS. 2005-08-01. We are processing data from ground truthing and our work is looking even more promising.

Filed in forestry, software, statistics 2 Comments