This page contains details of the algorithms used in deriving analysis results posted on this blog. In general the algorithms used are very simple. This simplicity has several benefits, but it is important to understand the limitations and accuracy of the algorithms.
An overview is given first, followed by algorithm details.
The following figure, showing synthetic data, illustrates the stages in the procedure followed in reconstructing regional temperature histories from raw station data:
The procedure is based on the following decomposition of each temperature datum:
- Temperature = (typical-value) + deviation
where “typical-value” applies at the time of the temperature datum in question, i.e. it is a moving average over a period nominally centred on the time in question, except near the boundaries of records, or near the periods identified by the procedure as having anomalous changes in temperature.
Once a series of typical-values has been obtained for each station in a region the deviations follow by simple subtraction from the raw data. The resulting deviations are robust against non-climatic influences, as long as those influences are not changing rapidly with time. Further robustness is obtained by combining the individual station deviations into a regional average by using the median of them, which suppresses the influence of outliers, typically resulting from defective data and transient perturbations.
Subtracting the regional average deviations from the raw temperature data for a station allows the visual detection of periods of anomalous warming or cooling, aided by any available metadata on the station history of changes in location/equipment/procedures. Initially the histories of RAW regional average and station typical-values are used in visual detection, but these histories gradually evolve into better estimates as periods of anomalous change are detected and marked (manually) for omission from the regional averaging process.
At intermediate stages, and as the final output, a regional average of typical-values is obtained by combining periods of such data from each individual station that survive the cull of periods of anomalous change. The regional averaging process is perfectly democratic (after any manual rejection of outliers): temperature variations from year to year are estimated by integrating the average (across stations) of interannual temperature differences. There are no explicit shifts applied to station temperatures to compensate for anomalous changes, but such shifts are implicit within the averaging process.
The following figure shows some synthetic monthly average temperature data, and several statistical measures that could be calculated to represent the “average” (g-median is the one used, and is defined below):
The core algorithm used for climate reconstruction on this blog is the computation of temperature averages in moving windows of monthly average temperatures, which can be Tmax (daily maximum), Tmin (daily minimum) or their average or difference. If monthly data are not available then annual averages can be used.
What is the “average” of a set of temperature measurements, such as the monthly average Tmax for January from 1960 to 1975? There is no single right answer to that question, it depends on what you are trying to convey. A thermodynamicist may want to calculate a simple mean, but that mean would not give the “typical” temperature if there were exceptional hot or cold spells, or defective data with large errors. The median is much better at providing the typical temperature, but has problems for certain datasets; for example, the median of an odd number of data that switch back and forth between (say) 18 and 22 C can only ever be one of those values, never the value 20, which may well be the desired result. I use an extension of the median, which I call “g-median” (generalised median) defined as follows:
- Sort the data
- Average (mean) a central subset of the sorted data, typically the central 50% portion
The g-median gets some of the benefit of averaging, but at the same time reduces the influence of outliers.
An important characteristic of any algorithm used to estimate the temperature averages is that there is consistency between stations. One common problem that can upset the consistency between stations is the occurrence of defective data, typically anomalous spikes or dips. The influence of defective data is reduced considerably in temperature averages by subtracting regional average temperature fluctuations before averaging, which increases the chance that any defective temperature measurements become outliers.
Automatic In-filling of Missing Data
The overall procedure for temperature reconstruction is quite resilient to missing data, but there are circumstances where it causes problems. This section describes the algorithm used to in-fill missing data. Once you have an automatic in-filler you can also deal with defective data simply by deleting it (setting it to NaN) before the in-filler operates.
The following figure shows how missing temperature data (in blue) is estimated from the regional average weather variations (in mauve):
Regional Average Temperature Deviations
More to follow shortly …