Wednesday, 8 October 2014

A framework for benchmarking of homogenisation algorithm performance on the global scale - Paper now published

By Kate Willett reposted from the Surface Temperatures blog of the International Surface Temperature Initiative (ISTI).

The ISTI benchmarking working group have just had their first benchmarking paper accepted at Geoscientific Instrumentation, Methods and Data Systems:

Willett, K., Williams, C., Jolliffe, I. T., Lund, R., Alexander, L. V., Brönnimann, S., Vincent, L. A., Easterbrook, S., Venema, V. K. C., Berry, D., Warren, R. E., Lopardo, G., Auchmann, R., Aguilar, E., Menne, M. J., Gallagher, C., Hausfather, Z., Thorarinsdottir, T., and Thorne, P. W.: A framework for benchmarking of homogenisation algorithm performance on the global scale, Geosci. Instrum. Method. Data Syst., 3, 187-200, doi:10.5194/gi-3-187-2014, 2014.

Benchmarking, in this context, is the assessment of homogenisation algorithm performance against a set of realistic synthetic worlds of station data where the locations and size/shape of inhomogeneities are known a priori. Crucially, these inhomogeneities are not known to those performing the homogenisation, only those performing the assessment. Assessment of both the ability of algorithms to find changepoints and accurately return the synthetic data to its clean form (prior to addition of inhomogeneity) has three main purposes:

1) quantification of uncertainty remaining in the data due to inhomogeneity
2) inter-comparison of climate data products in terms of fitness for a specified purpose
3) providing a tool for further improvement in homogenisation algorithms

Here we describe what we believe would be a good approach to a comprehensive homogenisation algorithm benchmarking system. Thfis includes an overarching cycle of: benchmark development; release of formal benchmarks; assessment of homogenised benchmarks and an overview of where we can improve for next time around (Figure 1).

Figure 1 Overview the ISTI comprehensive benchmarking system for assessing performance of homogenisation algorithms. (Fig. 3 of Willett et al., 2014)

There are four components to creating this benchmarking system.

Creation of realistic clean synthetic station data
Firstly, we must be able to synthetically recreate the 30000+ ISTI stations such that they have the correct variability, auto-correlation and interstation cross-correlations as the real data but are free from systematic error. In other words, they must contain a realistic seasonal cycle and features of natural variability (e.g., ENSO, volcanic eruptions etc.). There must be a realistic persistence month-to-month in each station and geographically across nearby stations.

Creation of realistic error models to add to the clean station data
The added inhomogeneities should cover all known types of inhomogeneity in terms of their frequency, magnitude and seasonal behaviour. For example, inhomogeneities could be any or a combination of the following:

- geographically or temporally clustered due to events which affect entire networks or regions (e.g. change in observation time);
- close to end points of time series;
- gradual or sudden;
- variance-altering;
- combined with the presence of a long-term background trend;
- small or large;
- frequent;
- seasonally or diurnally varying.

Design of an assessment system
Assessment of the homogenised benchmarks should be designed with the three purposes of benchmarking in mind. Both the ability to correctly locate changepoints and to adjust the data back to its homogeneous state are important. It can be split into four different levels:

- Level 1: The ability of the algorithm to restore an inhomogeneous world to its clean world state in terms of climatology, variance and trends.

- Level 2: The ability of the algorithm to accurately locate changepoints and detect their size/shape.

- Level 3: The strengths and weaknesses of an algorithm against specific types of inhomogeneity and observing system issues.

- Level 4: A comparison of the benchmarks with the real world in terms of detected inhomogeneity both to measure algorithm performance in the real world and to enable future improvement to the benchmarks.

The benchmark cycle
This should all take place within a well laid out framework to encourage people to take part and make the results as useful as possible. Timing is important. Too long a cycle will mean that the benchmarks become outdated. Too short a cycle will reduce the number of groups able to participate.

Producing the clean synthetic station data on the global scale is a complicated task that has now taken several years but we are close to completion of a version 1. We have collected together a list of known regionwide inhomogeneities and a comprehensive understanding of the many many different types of inhomogeneities that can affect station data. We have also considered a number of assessment options and decided to focus on levels 1 and 2 for assessment within the benchmark cycle. Our benchmarking working group is aiming for release of the first benchmarks by January 2015.

4 comments:

Gregor Vertacnik said...

I've just read the paper. Seems to be pretty interesting project and a chance for homogenisers to compete a little bit again :))

I would like to ask if the final results ought to be interpolated (for missing values) or only homogenised values are going to be compared?

The paper mentions that wrong trend sign could be problematic, but I would stress that only if the true and the homogenised trends are opposite and statistically significant. On the other hand, if both trends are insignificant, the sign doesn't matter very much.

I'm looking forward to seeing the benchmark dataset :)

Regards,

Gregor

Victor Venema said...

Gregor, does that mean that you would like to compete? That would be great! Everyone is invited.

I had thought that a global dataset is a bit too large to be homogenized with Craddock, though. :-) We will also select some smaller regions where people with less automatic and robust methods can show off their skills.

Filling (and later gridding) will not be studied in this first cycle of the ISTI, is my current understanding. That is a pity, but we are quite limited in manpower, it is basically a volunteer project. Funding agencies find impact studies for cauliflower agriculture more important.

You are right, the sign of the trend in the inhomogeneities does not matter. That should not have slipped through. :-(

Gregor Vertacnik said...

I intend (not sure yet if I will manage to do it) to compete with HOMER, not Craddock, to get some more hints about the reliability of our results regarding Slovenian climate time series.

I didn't mean the sign of the trend in the inhomogeneities, but the trend in data itself, i.e. trend in a homogenised time series vs. trend in a clean-world time series for the same station. For example, you get -0.3 °C/century for clean-world time series of Ljubljana and +0.2 °C/century after homogenisation. If the uncertainty at 5 % level is, let say +/- 1 °C/century (both trends insignificant), this is very different from the uncertainty of +/- 0.1 °C/century (both trends significant). This applies to Willet et al. (2014), Figure 2. You may consider statistics counting hits or faults regarding the sign & statistically significance of trends (e.g. trend is either positive significant or negative significant or insignificant).

Best regards,

Gregor

Victor Venema said...

Yes, the trend in the data itself is also not relevant (for relative homogenization methods).

I looked at Figure 2 again and think I now understand what you want to say. Yes, you need to take the uncertainty into account and because we work world wide, we probably also have to take the spatial variability in the climate and in the non-climatic changes into account. The results will depend on such considerations.

We have not worked much on the validation part. We did formulate some principles, but no specific error measures yet.