12/02/2004
It has been a while since writing the previous post. Most of my ‘web time’ has been devoted to put together the Forestry in Tasmania site (where I am still doing some work). Anyway, this week I imported an Excel spreadsheet into Splus, where I had defined a few columns as time. I usually do not use time formats in Excel, but this was not my project and I needed the times in Splus. Surprise! All times were there, but with weird dates attached. Then I tried importing dates and, surprise again, dates had this 00:00:00 attached. To be more precise, only the time showed in the column when using the data viewer.
A few visits to the online language reference and to the discussion group archives where useful. Times and dates are stored using the timeDate class, and it is possible to query the variable containing time and date using the functions: hours, minutes and seconds for time, and days, weekdays, months, quarters, years, and yeardays (which gives the number of day within a year) for dates. These functions return vectors. In addition, it is possible to create data frames where each column contains an integer with the following functions: hms (hour, minutes and seconds), mdy (month, day and year) and wdydy (weekday, yearday and year).
timeDate can also be used to transform strings to objects of class timeDate. For example:
x <- timeDate(c("1/1/1998 3:05:23.4",
"5/10/2005 2:15:11.234 PM"))
which then can be queried as:
mdy(x)
hms(x)
wdydy(x)
hours(x)
etc
Filed in programming, statistics
15/01/2004
I wrote a small script in Splus to test the effect of different sampling intensities for mammal browsing in plantations. The function included a call to levelplot, a trellis function that represents three dimensional data using different colours for the z variable. You can see an example below:

The function would run with a normal ‘plot’ function, but it would not produce any output if placed inside my function. However, a call to levelplot in the command line would produce the desired results. Quick email to the Splus discussion group and in ten minutes I had the answer: “Trellis functions don’t actually plot anything, but return an object; printing that object actually creates the plot. Outside of a function the ‘printing’ happens automatically, but inside the function you have to do it explicitly” (from an email by Tim Hesterberg, Insightful Corp.). Thus, something like print(levelplot(…)) does the trick.
Filed in programming, statistics
6/12/2003
This week I spent three days in Melbourne, attending a meeting of the Research Working Group on Forest Modelling and Information Systems (RWG2). It was my first experience in this group, having participated in meetings of RWG1 and RWG7 in the past. One of the nice things of working in a new area is that there is no sense of respect for the ‘big names’. There is no ‘they were my lecturers’ or ‘I have read their papers’, so meetings and relationships are much freer and relaxed.
People were very nice and welcoming, and I was able to put faces to names like Cris Brack (Australian National University), Ian Ferguson (University of Melbourne), Chris Goulding (New Zealand Forest Research), Valerie LeMay (University of British Columbia) and Jerry Vanclay (Southern Cross University). There were many other ’small names’ (including mine) and we have quite a bit of fun and exchanged lots of ideas. Thanks guys!
I always have lots of fun presenting and this time was no exception, talking about spatial models for modelling mammal browsing in plantations. This is one of my current pet topics (pun intended), I mean spatial models, not only for browsing, and I hope to have something in publishable form soon.
Filed in environment, forestry, geocoded, research, statistics
31/10/2003
I did some work in analysis of longitudinal data around 1998 and 1999, mostly modelling variance structures for forest progeny tests (see my articles 7, 8 & 10 in the equations part of this site). However, I never was interested in fixed effects like treatments, so did not read much about it. Well, times are different, and I am analysing a couple of trials that do provide longitudinal data and where the main focus is the treatment effect.
Back to the books, and I am revisiting an old favourite of mine: Analysis of Longitudinal data by Peter Diggle, Kung-Yee Liang and Scott Zegger. This is an excellent book for refreshing concepts, and not only covers hard-core analysis, but also exploratory approaches and a few interesting examples (I have the first edition, so there could be some changes on the second edition).
Yesterday, I wanted to repeat the analysis of the Sitka Spruce dataset, just to check if I was properly translating the concepts from book to software. First problem, I needed the dataset. The good news: there is a file (lda.dat) in Peter Diggle’s site containing all the datasets used in the book (see under public-domain longitudinal and time series datasets). The bad news: all datasets are in one file without any reasonable structure. Solution: the ugly news. I wrote the crummiest Python script ever (here for a laugh — right click to save it) to recover the Sitka Spruce data and reformat it, obtaining sitka.csv (23 Kb, a Comma Separated Values) file that can be read into anything. Now is time to analyse the data.
P.S. Four hours later, I found Chiung-Yu Huang’s web site. There are some notes for a course he runs on longitudinal analysis, and includes the datasets for the book, although in a ‘raw form’ (just chopped lda.dat). I should have a look at the notes, though.
P.S.2Now three hours later I found Francesca Dominici’s course notes, that seem to cover most of the book (including Stata, SAS and Splus code for the examples). Great!
Filed in genetics, statistics
28/10/2003
One of the “style” changes when moving from quantitative genetics to biometrics was the idea of crossvalidation: subsetting datasets to test model fit. Given the expense of obtaining data it did not make any sense to me; the trade off between some independent comfort and not using all data was… well, disturbing. This morning I was reading “Does cross validation provide additional information in the evaluation of regression models?” by Kozak and Kozak (Can. J. For. Res. 33: 976-987). At last, someone showing some sense: These results suggest that cross validation by data splitting and double cross validation provide little, if any, additional information in the process of evaluating regression models
.
Now I can continue evaluating models without the need to justify why did not I split the dataset. Some papers just make me smile.
Filed in forestry, statistics
25/10/2003
Scamander, Tasmania
I spent a few days based in Scamander, North Eastern Tasmania. The drive up North with Rebecca and Andrew was quite pleasant; only three hours and we reached the first stopover. Note: this was written last Tuesday, but there was no way to connect to internet from the hotel.
In the last couple of months I came across two papers about the analysis of progeny trials that require some comment: one on diallels and one on clonal data. Both of them expand on the tricks and contortions required to run those analysis using SAS, specifically using proc mixed and proc iml.
I was quite surprised about the transformation of computational limitations of a specific piece of software (namely SAS) into research papers. Both problems can be tackled in one line of model specification in ASReml. Potentially, one could write a macro in MSWord for BLUP analyses, but is it worth doing it? So what is going on? It seems that being provincial and sticking to old school models pays on terms of boosting publication records. Please guys, just visit the ASReml cookbook and get a life!
Filed in geocoded, research, statistics
20/10/2003
I was conducting some preliminary analysis of regeneration data: counts of seedlings in a series of plots. The distribution was certainly non-normal, with a strong ‘Poisson look’, so I went for a generalised linear model with a Poisson family and a log link. Lately, I’ve been running analysis in parallel using Splus and Genstat. I was quite surprised with the results: the coefficients of the model were different but the predicted values were identical. After meditating about the wonders of statistical software (read #@$) for a while, I did remember about the default Helmert contrasts in Splus. In Helmert contrasts, the jth linear combination is the difference between the j+1st and the average of the first j levels. However, most statistical software (including Genstat and SAS) reports the results using Treatment contrasts.
Note to self: use treatment contrasts as default, so results from Splus are comparable with the ones obtained with other software. Change the defaults using the following code:
options(contrasts = c(factor="contr.treatment",
+ ordered="contr.poly"))
Filed in statistics
14/10/2003
One of the big changes coming from quantitative genetics is the emphasis on fixed over random effects. In genetics the idea is to partition observed variance in different effects that can be related to “causes” (additivity, dominance, etc). Now variances are mostly nuisance parameters, and the main objective is to estimate the effect of different treatments (e.g. fertilisers or silviculture) on growth. For now I am using the typical Likelihood Ratio Test for the variances and Wald’s tests for the fixed effects. This works OK under Restricted (or residual) Maximum Likelihood (REML), but the problem is how to run multiple comparisons or specific contrasts, where I need to recur to approximations. This seems to be the price to pay for the flexibility of working with unbalance and accessing many different covariance structures.
So, what are the approaches that I am using at the moment? In Splus, I use lme to fit the models and then create whatever contrast matrix I want and use the L option in anova to test that contrast. In GenStat, I use reml with vkeep to save some of the output to then use pairtest. I need to figure out how to save the level of replication for each level and then I could use allpairwise. It seems that using these approximations is not a capital sin, but just a peccadillo.
Filed in statistics
13/10/2003
It has been a while, months actually, since the last note. It hasn’t been lack of motivation, but actually many things going on at the same time. Starting on mid-July and going through August there were quite a few field trips across Tasmania, participating in the measurement calendar of trials for this year. Some days were cold, windy and rainy (even with some snow), while others were full of sunshine.
It was a time for programming too, switching from my old, trusty Python to Visual Basic 6 (we will later switch to VB.net). As a language, I think that Python is much better, but it is much more difficult to deploy than Visual Basic, especially if I needed a decent GUI. At the same time, I decided to learn VB.net, which seems to be a much more civilised language. I purchased a copy of Programming Microsoft Visual Basic .NET (Core Reference) (link to Amazon.com) that, with over 1,600 pages, does quite a good job at explaining the intricacies of VB.net. When reading the comments of users in Amazon, it appears to be a good choice for C# programmers too.
I have also been quite active running longitudinal analyses for designed experiments using Splus and GenStat.
P.S. 2008. At the end I did not need to get into VB.net.
Filed in miscellanea, programming, statistics
1/08/2003
Yesterday I wanted to generate a virtual forest with aggregated spatial distribution (clusters of trees) to test different sampling schemes. I started with the number of clusters defined by a Poisson distribution and then randomly assigning trees to each cluster with probability p and not clustering them with probability 1 – p. The position around the clusters was determined by a random allocation using polar coordinates; i.e., randomly generating an angle (between 0 and 2*PI) and a proportion of a maximum radius rand()*R. Adrian pointed out to me that the procedure was not uniformly random in the circle, with higher density near the centre than further away.
If you like analogies, tracing a large number of lines with a pencil at various angles, but always though the centre of a circle, leaves more gray near the centre of the circle than near the border of it. Same thing happens with the spokes of a wheel. To make a long story short, this can be corrected taking the square root of the uniform random number that is used to multiply by the radius. Then, the proportion of the maximum radius is sqrt(rand())*R. Note to self: always take the square root when randomly assigning points in a circle.
Filed in programming, statistics