In a previous post I introduced the data set provided by Euskalmet, the Basque Agency of Meteorology. Here, we use the same database to assess the following conjecture:
Some of the record high temperature values observed in the last years are likely to be caused by an increase in the number of meteorological stations rather than by an actual increase in temperature.
In the recent years, we have often heard in the local media that highest temperature values have been observed. Academic research has also found that extreme temperature values have become more frequent around the world in the last decade.
The table below reports the number of stations that are available in the database for each year and the number of stations added to the database in each year. In some cases there are some gaps in the data for a given station, some years may be missing or a station may not be active any more after a certain year. Nevertheless, we may roughly say that the number of stations, and with it the amount of temperature data, has increased in the last five years of the sample.
#TS is the total number of stations (old and new) available in the database for each year.
#NS: is the number of new stations introduced in each year.
This makes me wonder whether the observed temperature highs are truly record highs or it is just a side effect of an increase in the amount of data? Imagine that a new record high is observed in 2012 at a particular station, say 41 Celsius degrees at station 'B'. However, if the station 'B' was not recording data in the past, say in 2004, how do we know that it is really a record high. If we had data for station 'B' in year 2004, a value of 43 degrees could have been observed and, hence, we wouldn't be talking about a record break; or it could be that the highest value was 40 degrees and hence we have a new highest value. The point is that since we do not know what the maximum temperature was in 2004 at the new station 'B', we cannot be certain whether the 41 degrees observed at 'B' is just a new highest value in the database or if it is really a record break.
Next, we will do an exercise to assess to what extent the increase in the amount of data may lead to false record breaks. Before I will sketch the purpose of the exercise:
Let 'A' be the set of stations that started running before 2008 and 'B' the set of stations that started running after 2007. Let us assume that a new maximum is observed at any station during the period 2008-2012 (i.e., 'A' + 'B' 2008-2012 > 'A' 2003-2007), we identify two situations:
- if 'A' 2008-2012 > 'A' 2003-2007, then we say that there is a new record high.
- if 'B' 2008-2012 > 'A' 2008-2012 but 'A' 2008-2012 < 'A' 2003-2007 (which implies 'B' 2008-2012 > 'A' 2003-2007), then we say that there is a false record break.
In the former case a new highest value is observed within the set of stations that are available in both periods and, hence, we have a record break. In the latter case, a new highest value is observed at some station that became available after 2007, therefore, we can say that we have observed a new highest value but we cannot be certain whether it is a new record break because there are no data to compare with ('B' was introduced after 2007).
If at this point I managed to make my point you may turn the reasoning around and say: well, we could also talk about false non-record breaks; we may say that a new record highs are not observed but the real reason may be that some of the old stations are not active any more. That's true, however, I believe it is more relevant to take it the another way around, after all there are more stations that are added to the database than they are removed from the database. In addition, I find it more exciting inspecting record breaks rather than checking that a record remains untouched. This somewhat twisted reasoning may help to see my point anyway.
Now, let's get to work. The database consists of 980 zipped files (each one containing 12 xml files, one for each month) that add up to 900 Megabytes, so there is quite some work ahead of us. The action is about to start,... get ready!
Here is a sample xml file that contains the data for one month at a given station: C020_2008_1.xml. The data are recorded every 10 minutes, so for each day in a given month there are 144 records. The whole database is available here.
In order to explore the data and assess our hypothesis we will use the Hadoop framework, which provides a MapReduce algorithm suitable to work with our relatively large data set. Hadoop arranges the computation of the tasks in parallel and manages the distribution of the data.
Basically, the MapReduce algorithm consists of two steps: the Map, that performs some filtering and sorting of the data and the Reduce, that performs a summary operation and gathers the results. In our case, the mapper and the reducer accomplish the following work.
- Mapper: for each year and month it reads the xml files and extract the value related to the variable air temperature. Then it outputs the name of the station, year, month and temperature value in a tab separated string (144 strings for each day given a station and year).
- Reducer: it reads the output created by the mapper and for each station and month keeps track of the maximum temperature value for each station and month. It outputs the station, year, month and the corresponding maximum temperature value (1 string for each month given a station and year).
I implemented the mapper and the reducer in python: mapper.py | reducer.py. The comments in the scripts sexplain some issues to be taken into account. These scripts can be passed to hadoop in streaming mode. The input for the mapper is the summary.txt text file. It contains by rows the name of the stations and the first and last year that are available in the database.
Actually, we can run the Map Reduce process from the command line outside the Hadoop framework. For example, in Linux we may type:
$ cat summary.txt | ./mapper.py | sort -k1,3 | ./reducer.py
Notice that before passing the output from the mapper to the reducer we must first sort the data by the first three fields, otherwise the loop that we defined in the reducer would not work correctly. Hadoop arranges the sorting operation by default. Running the above process took around 20 minutes in my computer. As mentioned above, Hadoop manages the whole process more wisely. Althoug to take greater advantage of how Hadoop operates we should use a multiple-node cluster we already observed some improvements in a dual-core single-node cluster. The time elapsed runnin our mapper and reducer in Hadoop drecreased to around 15 minutes.
At this point we have moved from a 900 MB database to single text file of less than 180 KB. Here is the output file. Now we have the data in a much easier to handle format where we can check our hypothesis. All that remains is to summarize the results splitting the data returned by the Map Reduce process by period (2003-2007 and 2008-2012) and month. We must also distinguish whether the station belong to the old set of stations or whether it was added after 2007. I did this operation through the following R script compare-stations-periods.R. The following summary table is what we were pursuing and what we get:
A: Maximum temperature in period 2003-2007.
B: Maximum temperature in period 2008-2012 within the set of stations that were available before 2008.
C: Maximum temperature in period 2008-2012 within the set of stations that became available after 2007.
What about out hypothesis, is it supported? Hummm,... I'm somewhat puzzled at the output. Was it really a temperature of 50 Celsius degrees observed in the middle of the winter? It does not seem reliable at all. Looking to the source database that's what it was recorded.
I was looking for cases such that C > A and B < A. July is the only month where it is observed, therefore, the hypothesis that more records may be in part due to more data is not supported. Nevertheless, considering that some of the temperature values do not appear realiable, before giving a conclsion it would be required either to get more information about what the variables in the database represent or to do some debugging in the database in order to remove anomalous observations. I already included a threshold value that omits values than 55 Celsius degrees. If some debugging is necessary it should be done by the maintainers of the database who know the details of the data. We didn't reach a stron conclusion but it was a nice exercise to get some practice with Hadoop and see its capabilities.