javi | May 8, 2025, 9:07 a.m.
Euskalmet is the Basque Agency of Meteorology. Euskalmet collects daily meteorological data at several stations. A database containing this data for the years 2003 to 2012 is provided through the service Open Data Euskadi. This is a rich and interesting amount of information that is worth exploring.
The database is too large to be digested in one go. It contains data about air direction, temperature and visibility among other variables. All the variables are observed and recorded daily every ten minutes. Here, I show how to handle two main operations that are required to make the most of the database:
from lxml import etree
import zipfile
from datetime import date, time, datetime, timedelta
import calendar
# user defined variables
station = "C040"
y0 = 2008
yN = 2012
# define the sequence of times (every 10 minutes)
# to be used with potential missing files and data
allHours = [''] * 144
tlast = datetime.combine(date.today(), time(0, 0))
allHours[0] = tlast.strftime("%H:%M")
for i in range(1, 144):
tlast += timedelta(minutes = 10)
allHours[i] = tlast.strftime("%H:%M")
# main loop
fout = open('%s.csv' % station, 'w')
for yy in range(y0, yN + 1):
print yy
for imonth in range(1, 13):
zipf = zipfile.ZipFile("../%s/%s_%s.zip" % (yy, station, yy))
isNA = False
try:
xmlf = zipf.open("%s/%s_%s_%s.xml" % (station, station, yy, imonth))
except KeyError:
try: # first try if month '1' is denoted '01'
xmlf = zipf.open("%s/%s_%s_0%s.xml" % (station, station, yy, imonth))
except KeyError: # the file does not exist (missing values)
isNA = True
temp = ''
if isNA == False: # if file exists
xmlData = etree.parse(xmlf)
monthDays = xmlData.findall("//dia")
print len(monthDays)
for day in monthDays:
dayLabel = day.attrib['Dia']
hours = day.findall("hora")
for hour in hours:
temp = hour.findtext("Meteoros/Tem.Aire._a_620cm")
#print "%s; %s; %s" % (dayLabel, hour.attrib['Hora'], temp)
#print hour.find("Meteoros/Tem.Aire._a_620cm")
fout.write("%s;%s;%s\n" % (dayLabel, hour.attrib['Hora'], temp))
xmlf.close()
else: # if file does not exist (missing data for that year and month)
ndays = calendar.monthrange(yy, imonth)[1]
monthDays = range(1, ndays + 1)
print "%s (missing)" % len(monthDays)
for day in monthDays:
d = "0%s" % day if day < 10 else day
month = "0%s" % imonth if imonth < 10 else imonth
dayLabel = "%s-%s-%s" % (yy, month, d)
for hour in allHours:
temp = '' # NA
fout.write("%s;%s;%s\n" % (dayLabel, hour, temp))
zipf.close()
fout.close()
back top
library("zoo")
a <- read.csv(file = "C040.csv", header = FALSE, sep = ";",
colClasses = c("character", "character", "numeric"))
d0 <- strsplit(a[1,1], "-")[[1]]
h0 <- strsplit(a[1,2], ":")[[1]]
d0 <- ISOdate(year = d0[1], month = d0[2], day = d0[3],
hour = h0[1], min = h0[2], sec = 0)
dN <- strsplit(a[nrow(a),1], "-")[[1]]
hN <- strsplit(a[nrow(a),2], ":")[[1]]
dN <- ISOdate(year = dN[1], month = dN[2], day = dN[3],
hour = hN[1], min = hN[2], sec = 0)
dates <- seq(d0, dN, by = "10 min")
x <- zoo(a[,3], dates)
# user defined parameters
yy1 <- 2008
mm1 <- 1
dd1 <- 1
h1 <- 0
m1 <- 00
yy2 <- 2011
mm2 <- 12
dd2 <- 31
h2 <- 23
m2 <- 50
stat <- "mean"
tscl <- "daily"
p <- c(yy1, mm1, dd1, h1, m1, yy2, mm2, dd2, h2, m2, stat, tscl)
# initial and end dates
d0 <- ISOdate(year=p[1], month=p[2], day=p[3], hour=0, min=0, sec=0)
dN <- ISOdate(year=p[6], month=p[7], day=p[8], hour=23, min=50, sec=0)
a <- window(x, start = d0, end = dN)
a0 <- a
# time interval
times <- as.numeric(format(time(a), "%H.%M"))
t0 <- paste(p[4], p[5], sep =".")
tN <- paste(p[9], p[10], sep = ".")
ft <- cut(times,
breaks = unique(as.numeric(c(0, as.numeric(t0), as.numeric(tN) + 0.05, 23.55))),
include.lowest = TRUE, right = FALSE)
#table(ft)
length(ft) == length(a)
if (as.numeric(t0) == 0) {
a <- split(a, f = ft)[[1]]
} else
a <- split(a, f = ft)[[2]]
# statistic and and time scale
switch(p[11],
"mean" = FUN <- mean,
"median" = FUN <- median,
"variance" = FUN <- var,
"maximum" = FUN <- max,
"minimum" = FUN <- min)
switch(p[12],
"hourly" = freq <- "60 mins",
"daily" = freq <- "1 days",
"weekly" = freq <- "1 weeks",
"monthly" = freq <- "1 months")
fd <- cut(time(a), breaks = freq,
include.lowest = TRUE, right = FALSE)
#table(fd)
# split the series and obtain the statistic
la <- split(a, f = fd)
if (p[11] == "min-max")
{
la.min <- lapply(X = la, FUN = min, na.rm = TRUE)
la.max <- lapply(X = la, FUN = max, na.rm = TRUE)
xout1 <- zoo(unlist(la.min), as.Date(names(la.min)))
xout1[is.nan(xout1)] <- NA
xout1[!is.finite(xout1)] <- NA
xout2 <- zoo(unlist(la.max), as.Date(names(la.max)))
xout2[is.nan(xout2)] <- NA
xout2[!is.finite(xout2)] <- NA
fout <- matrix(nrow = length(xout1), ncol = 3)
colnames(fout) <- c("date", "minimum", "maximum")
fout[,1] <- gsub("-", "", time(xout1))
fout[,2] <- round(xout1, 2)
fout[,3] <- round(xout2, 2)
} else {
la.stat <- lapply(X = la, FUN = FUN, na.rm = TRUE)
xout <- zoo(unlist(la.stat), as.Date(names(la.stat)))
xout[is.nan(xout)] <- NA
fout <- matrix(nrow = length(xout), ncol = 2)
colnames(fout) <- c("date", p[11])
fout[,1] <- gsub("-", "", time(xout))
fout[,2] <- round(xout, 2)
}
write.csv(fout, file = "tmp.csv", row.names = FALSE, quote = FALSE)
back top
The python script assumes that the source files are in the previous directory with respect to the path where the python file is located. The zip files for each must be already unzipped while the files containing the data for each station remain zipped. The script deals with potential missing files and with different naming conventions used in the source files, for example the month January is sometimes denoted '1' and other times '01'. Thus, all that is need is to define the variables 'station' (name of the meteorological station), 'y0' and 'yN' (initial and last year in the sample, respectively) at the top of the script.
One of the usefulness of the R script is that it returns the average values observed in a time interval. For example, we may be interested in temperature averages observed during daytime, nighttime or at a particular interval, e.g., from 8:00 AM to 10:00 AM. On the other hand, in addition to the mean, other relevant statistics such as the median, variance and minumum and maximum values can be fetched from the database. The user defined parameters are, in the order shown in the script: the starting year, month and day of the sample; the ending year, month and day; the time interval (minutes and hours); the statistic and the time scale.
The form below illustrates the output provided by these scripts for one of the series: temperature observed in the station labelled C040 from 2009 to 2011. Daily averages for this station and subsample are shown in the plot below. The source xml files are already merged; submitting the form runs the R script and displays the output series.
The form below is currently not available.
Station C040: daily average temperature 2009-2011 (°C degrees)
This blog is part of jalobe's website.
jalobe.com
At present not all posts from jalobe's blog are available.