## Friday, October 4, 2013

### The Correlation Coefficient R and the reduction of range.

Those who followed my series on climate change might remember that I would take some area I could define as a rectangle in longitude and latitude and track the average temperature by year, comparing equivalent seasons. (To be precise, if a longitude/latitude area include either the North or South Pole, it would be more like a slice of pie than a rectangle.) I then split the time span from 1955 to 2010 into four eras based on the El Niño/La Niña cycles, these in particular each starting and ending in a strong La Niña year.

This particular region and season, Siberia in Spring, has a clearly increasing trend of the median temperature, the dotted red line moving up in four separate steps. The lowest temperature registered only moves upward twice and the maximum average temperature takes a step down in the era of 1999-2010, as the highest temperature was registered back in the 1990s.

Another method would be to add a line of regression, also known as a trendline or predictor line or the line of least squares. Excel has an option which gives the equation of the line and the variance R². R is called the correlation coefficient and it varies between -1 and 1. R² must be between 0 and 1 and is sometimes thought of as a proportion. In this case, the .3655 would be the proportion we would assign to the general increase we see in the temperatures, while the fluctuations are about .6345 of the influence.

This doesn't sound very convincing, but if R² = .3655, R in this case would be +.6046, a value that by nearly every standard of correlation is considered high, though it doesn't meet the Rule of Thumb criteria for very high, which would be over .8.

This is one of the many reasons I don't love using the predictor line and the correlation coefficient. The statements of confidence seem arbitrary - I know of three different systems and they disagree radically on whether an R score is strong or not - but also there is a way to cherry pick data in both directions, either to show more correlation or less.

Generally though not always, taking a subset of a sample by restricting the range will result in R and R² being reduced. For example, if we look at the first Consistent Oceanic Niña Interval from 1955 to 1975, we see somewhat less overall increase (here we check the number multiplying x, which went from .0417 to .0346) and a drastic drop in R² from .3655 to .05097. Here I can say without fear of contradiction that the correlation is not impressive.

In our second interval, R² is stronger at .19237, but still well below the larger set's value of .3655. It could be considered moderately strong by some measures, but notice that here the trend shows the region cooling. (It really is coincidence that the first year of this era shows a large jump in temperature over the previous.)

Here again we see a downward sloping trendline and an extremely weak R² value of .01592, which is to say nearly no correlation.

Yet again, a small downward slope and a low R² value.

In my view, the problem is cherry picking in both directions. People who wish to downplay or deny warming temperatures can take smaller samples, but when they do, the R² value will often give little confidence in the trend they try to show. On the other hand, people wanting to show strong evidence of warming have a natural advantage of generally higher R² scores in larger data sets. To be fair, in this particular set it is impossible to create a subset longer than thirty years that doesn't show a warming trend, though it can be minimized and so can the correlation coefficient.

If anyone is coming to the blog for the first time, you should know that I am not a denier of the general warming trend in temperatures around the globe in my lifetime, which started in the Strong La Niña year of 1955. What I hope for is a discussion where both sides can agree on terms and methods and avoid cherry picking at all costs. I realize this hope may very well be in vain, and yet I hold on to it.