The main enhancements are the following:
?environment
in R
. Dependencies are established only across those variables that are evaluated in the same environment. In this way, the names of variables can be reused without generating false dependencies. For example, two chunks may use the same variable named x
, but both chunks may actually be independent of each other. This avoid having to come up with new names, such as x2
, and remember all of them.
R
function all.vars
is now used. This facilitates the detection of the variables created and used by each chunk, which is in turn used to determine dependencies across files.
R
input chunk. When the chunk prints the value of more than one object, it may not be clear what each output refers to in the code chunk. Now, output is arranged by default in between input code in the same order as it would be printed on the console.
These are the arguments that can be passed to the new command latexr
:
make
: generate/update and run the Makefile. The following values can be passed to it:
doc=x
, the name of the file without extension (assumed '.rnw'),Rdir=x
, the name of the directory where the R chunks will be stored,figDir=x
, the name of the directory where the figures will be placed.clean
: delete LaTeX
and R
auxiliary and output files.
reload
: reload .Rprofile
(if any).
clear
: delete .R
and .Rout
files evaluated in a given environment.
help
: brief description of available arguments and options.
A sample rnw
file and output is available here. The output pdf file is built by means of the following command (after installling latexr
): latexr make
.
At the moment the documentation is a bit short. Don't hesitate to contact me if you need some clarifications on the usage of these scripts.
]]>Upon an ARIMA model fitted to the observed data, the procedures implemented in the package extract an estimate of trend and seasonal signals. The methodology is developed and described, among others, in Burman (1980) and Hillmer and Tiao (1982). An introduction to the methodology and the package in the form of a vignette is available here: https://www.jalobe.com/doc/tsdecomp.pdf.
The development of the package tsdecomp started as a pedagogical exercise. The current version has become relatively mature and provides interfaces that can be used by users not necessarily familiar with the methodology. It is nonetheless far from the capabilities of other software tools that have long been used by statistical offices, most notably SEATS and X13-ARIMA-SEATS.
If you try the package, feel free to contact me and raise any issues or questions. This feedback would help me to improve future versions of the package.
]]>The functions that implement these tests have been coded from the scratch in order to include the following new features:
The calculation of p-values based on response surface method are based on Díaz-Emparanza, I. and Moral, M. Paz (2013) [3] and Díaz-Emparanza (2014) [4], respectively for the CH and the HEGY tests. I ported the code provided by the authors in the Gretl software package for econometric analysis. For the bootstrapped HEGY test, I followed Burridge and Taylor (2004) [5].
The main motivation was to experiment with an NVIDIA GPU that I purchased and with the CUDA language. My first exercise was to replicate the simulation exercises carried out in Díaz-Emparanza (2014) by means of a parallel implementation that run most of the calculations on the GPU. It was amazing to see how well the GPU performed even though the operations involved in the simulations had nothing to do with graphics operations. I therefore decided to reuse the same code for bootstrapping the test statistics and eventually
ended with a new version of the package.
I updated this post that introduces the usage of uroot for obtaining the CH test statistics. I hope I will be able to finish soon a document with further details about the package and the implementation.
[2] Hylleberg, S., Engle, R., Granger, C. and Yoo, B. (1990) 'Seasonal integration and cointegration'. Journal of Econometrics, 44(1), pp. 215-238. DOI: 10.1016/0304-4076(90)90080-D
Go to top
[3] Díaz-Emparanza, I. and Moral, M. Paz (2013). Seasonal Stability Tests in gretl. An Application to International Tourism Data. Working paper: Biltoki D.T. 2013.03. URL: https://addi.ehu.es/handle/10810/10577. Gretl package.
Go to top
[4] Díaz-Emparanza, I. (2014) 'Numerical distribution functions for seasonal unit root tests'. Computational Statistics and Data Analysis, 76, pp. 237-247. DOI: \url{http://dx.doi.org/10.1016/j.csda.2013.03.006}. 10.1016/j.csda.2013.03.006. Gretl package.
Go to top
[5] Burridge, P. and Taylor, R. (2004) 'Bootstrapping the HEGY seasonal unit root tests'. Journal of Econometrics, 123(1), pp. 67-87. DOI: 10.1016/j.jeconom.2003.10.029.
Go to top
There are a lot of websites giving some money-saving tips when travelling as well as research works studying how the prices of flight tickets are determined. For example, this study published in the Economic Journal concludes that the best moment to buy flight tickets is eight weeks before departure. The mathematical formulation supporting this conclusion draws the attention of The Economist and The Guardian.
According to this article in The Telegraph, it is still possible to get cheaper prices in last minute bookings, although this requires being flexible with the departure and return dates. This BBC News article also argues why waiting a few days to purchase the tickets may result in a cheaper price.
There are several reasons to explain rises and decreases of prices at different points in time. But what about real data? What would we see if we observed the real data of flight ticket prices? Some time ago I recorded daily prices of some routes. After several months I gathered a relatively large database. It does not contain enough information to infer strong conclusions or rules, but it is a nice source of information to illustrate in a graphic how prices evolved. That is what I show in the animation below. These graphics display prices observed daily for the route KL1704 Madrid-Amsterdam. The prices were observed daily over a period of 100 days before different departure dates.
In general, buying 60-70 days before the departure date leads to cheaper prices. In the dates immediately before departure, prices tend to undergo sharper increases.
This article at cheapair.com summarizes some conclusions after observing millions of prices. kayak.com provides information on whether a price is expected to increase or decrease; this is also based on millions of flight queries.
In order to get reliable forecasts, further information would be necessary in addition to historic prices; for example, number of tickets that sold at each date, dates of holidays, oil prices. The database that I collected contains prices recorded daily from January 2013 to July 2014 for the route Madrid-Amsterdam and can be useful for a preliminary analysis and view of this kind of data. It is not easy to find prices of a given flight over a batch of 100 days before the departure date. If you are interested in this data set, send me an email and I would be happy to send you the data.
]]>In this post I summarize the results of a simulation exercise where I replicate a small part of the simulations described in Díaz-Emparanza (2014)[1]. I give the details of the exercise in this draft document that I am planning to complete and update. The code is still a development version, it is available upon request.
The table below reports the timings in different settings. The kind of implementation is labelled in the first column; it can be based either on the GSL interface to fit a linear regression model, on OpenMP with 2, 4 or 8 threads and on CUDA to program on the GPU.
The platforms (processor or environment) where the program was run are: a PC with an Intel Pentium(R) G2030 processor @3.00GHz with two cores and one thread per core; a computer with an Intel(R) i7-2760QM processor @2.40GHz with four cores and two threads per core; the GeForce GTX 660 GPU (installed on the PC with the G2030 processor).
Code | Platform | Time |
GSL | G2030 | 12m 39s |
GSL-OpenMP-2 | G2030 | 6m 28s |
GSL-OpenMP-4 | G2030 | 6m 26s |
CUDA | GTX 660 | 2m 04s |
GSL-OpenMP-1 | i7-2760QM | 11m 31s |
GSL-OpenMP-4 | i7-2760QM | 3m 59s |
GSL-OpenMP-8 | i7-2760QM | 2m 53s |
Given the G2030 processor, the implementation in CUDA is the fastest. It is around six times faster than the sequential version and around three times faster than the best performance that can be achieved on this CPU. The kernel was launched with 5 blocks and 200 threads per block; each thread carries out 100 iterations out of the total 5 x 200 x 100 = 100,000 iterations.
The processor i7-2760QM can handle up to eight threads. In that case, the timings for the version based on OpenMP (Code=GSL-OpenMP-8) is close to the timings observed for the GPU.
In a more powerful environment, for example the Arina cluster, increasing the threads to be run on the CPU could probably beat the GPU. But a more powerful GPU, such as the Tesla K20, perhaps would be able to reduce the timings of the CUDA programme as well.
Conclusion
The parallel structure of the GPU was useful for the simulations involved in this exercise. Despite the syntax is simple, it required adapting the whole process and the algorithm that is used to obtain the desired test statistics (details are given in the draft document linked above).
References:
[1] Díaz-Emparanza, I. (2014). "Numerical Distribution Functions for Seasonal Unit Root Tests". Computational Statistics and Data Analysis, 76, pp. 237-247.
DOI: 10.1016/j.csda.2013.03.006.
Go to top
Linear regression model with lagged dependent variables
ARIMA time series model with exogenous regressors
Although, at first glance, the above formulation could be understood as an AR(p) model with regressors, this model is actually specified a follows (and, to my knowledge, this is how it is implemented in software packages):
To save space, I included only on regressor, . A model with an MA component can be viewed as a regression model as well (Hannan and Rissanen procedure), but in order to simplify things here, I did not include an MA term.
In practice, the explanatory variables are usually observed variables, e.g. age, temperature, gross domestic product,... that are deemed helpful to explain or predict . Sometimes we are interested in measuring the effect of some particular event such as a marketing campaign, a policy change,... In these cases, we know the time of the event but its effect on the dependent variable is not directly observed. A time series model with an intervention, , can be employed in this case.
ARMA model with intervention
where and are polynomials on the lag operator , respectively for the AR and MA parts of the model. The intervention is an indicator or dummy variable representing a unit impulse, level shift or transitory change.
Premultiplying both sides of the previous equation by we obtain a series residuals which can be used to measure the effect of the intervention.
where and are, respectively, the estimates of $\pi(L)$ and the residuals from the ARMA model fitted to the original series in a previous step. For further details, see for example this document that describes the Chen and Liu's method for the detection of outliers (which can be regarded as a kind of interventions at unknown times) in time series.
Dynamic regression models have a similar resemblance to the models introduced above since they involve lags of the dependent variables and also lags of the explanatory variables. These models are often motivated as a representation of delays or adjustments that may expected in practice between the dependent and explanatory variables. For example, in economics it can be argued the existence of some delay between changes in interest rates and consumption or investment. Changes in interest rates or other economic variables may not be perceived by economic agents at the same time when they occur or adjusting the behaviour to those changes may be costly or require some time. A discussion about these models would take us far from our starting point in this post. For some details see for example my comments here.
]]>As pointed out by Prof. Pollock, the advantage of frequency domain filter is that they are able to achieve clear separations of components of the data that reside in adjacent frequency bands in a way that the conventional time-domain methods cannot. By conventional time-domain methods the author refers to filters like the Baxter and King and Christiano and Fitzgerald that are often applied to extract the business cycle from macroeconomic data.
Following the notation of the above-mentioned document, frequency-domain filters are based on the Fourier coefficients:
Each coefficient is related to a cycle of frequency . Thus, a natural way to filter desired frequencies is to set equal to zero those coefficients that are related to frequencies that do not belong to the target component (e.g. seasonal) and then synthesise the target component, , by means of the inverse transform:
Let's see some examples. (The UK consumption data employed in the first example is available here.)
First, we load the data and remove a linear trend to which the Fourier filter will be applied to extract a cycle.
load("UKconsumption.rda") n <- length(UKconsumption) fit <- lm(log(UKconsumption) ~ seq_along(UKconsumption)) y <- ts(residuals(fit)) tsp(y) <- tsp(UKconsumption)
Following Pollock we choose a cut-off frequency radians to extract a cycle, which involves a low-pass filter that passed the cycles in the low range frequencies , (whose periodicity ranges from infinity to 15.9 quarters).
It is convenient to define the coefficients as , with:
Then, the Fourier coefficients can be obtained in R
as follows (a direct translation of the summation terms above into explicit loops to obtain and would work as well, but the vectorized approach followed below is slightly faster):
cutOffInt <- 10 seqnm1 <- 2 * pi * seq.int(0, n-1) / n tmp <- outer(seqnm1, seq_len(cutOffInt), function(a, j) a * j) alpha <- c(mean(y, na.rm = TRUE), colSums(c(y) * cos(tmp) * 2/n)) beta <- c(0, colSums(c(y) * sin(tmp) * 2/n))
Upon the first 10 Fourier coefficients, it is straightforward to synthesize the corresponding trend-cycle component:
seqCutOffInt <- 2 * pi * seq_len(cutOffInt) / n seqnm1 <- seq.int(0, n-1) tmp <- outer(seqCutOffInt, seq.int(0, n-1), function(a, j) a * j) x <- colSums(alpha[-1] * cos(tmp) + beta[-1] * sin(tmp)) x <- ts(x + alpha[1]) tsp(x) <- tsp(y)
Similarly, a seasonal component can be obtained by obtaining the Fourier coefficients for the seasonal and nearby cycles.
A potential downside of frequency-domain filters is that macroeconomic data may exhibit some peculiarities (e.g., damping trends, periodic integration, time-varying parameters) that may be better captured by a time series model. Nevertheless, I find this approach appealing and helpful for the following reasons:
References:
D.S.G. Pollock. IDEOLOG: A Program for Filtering Econometric Data. A Synopsis of Alternative Methods. URL: ideolog.pdf
Canova and Hansen (1995) [1] proposed a test statistic for the null hypothesis that the seasonal pattern is stable. The test statistic can be formulated in terms of seasonal dummies or seasonal cycles. The former allows us to identify seasons (e.g. months or quarters) that are not stable, while the latter tests the stability of seasonal cycles (e.g. cycles of period 2 and 4 quarters in quarterly data).
In this post, I show how obtain the test statistic by means of the uroot R package. I apply the test to the data set of quarterly macroeconomic series used in Table 7 of the original paper. I have recently improved the code and fixed some issues with the Newey and West covariance matrix. I have also added the facility to compute p-values described in Díaz-Emparanza, I. and Moral, M. Paz (2013) [2].
Given the options mentioned in the original paper (lag truncation equal to 5 and including a first order lag as regressor), the stability of seasonal dummies for the series investments
can be tested as follows:
require("uroot") x <- diff(log(ch.data$ifix)) res <- ch.test(x = x, type = "dummy", lag1 = TRUE, NW.order = 5) res
The following result is obtained:
statistic pvalue Quarter1 0.741 0.0073 ** Quarter2 0.2853 0.1567 Quarter3 0.8277 0.0039 ** Quarter4 0.4481 0.054 . joint 1.9773 0.01 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Test type: seasonal dummies NW covariance matrix lag order: 5 First order lag: yes Other regressors: no P-values: based on response surface regressions
Information about the model fitted to obtain the test statistic is available in the element fitted.model
.
R> summary(res$fit) Coefficients: Estimate Std. Error t value Pr(>|t|) xreglag1 0.304707 0.072953 4.177 4.95e-05 *** xregSD1 -0.109497 0.006289 -17.411 < 2e-16 *** xregSD2 0.174379 0.009782 17.826 < 2e-16 *** xregSD3 -0.014516 0.012310 -1.179 0.24 xregSD4 0.008329 0.006468 1.288 0.20 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03846 on 153 degrees of freedom Multiple R-squared: 0.8509, Adjusted R-squared: 0.846 F-statistic: 174.6 on 5 and 153 DF, p-value: < 2.2e-16
The test for the stability of seasonal cycles can be obtained setting the argument type = "dummy"
.
The script table7.R applies both versions of the statistic to all the series in the data set. The results are summarized below.
Series | Quarter 1 | Quarter 2 | Quarter 3 | Quarter 4 |
---|---|---|---|---|
ifix | 0.741 ** | 0.285 | 0.828 ** | 0.448 . |
ifixr | 0.620 * | 0.095 | 0.892 ** | 0.050 |
ifixnr | 0.315 | 0.271 | 0.275 | 0.341 |
ifixnrs | 0.407 . | 0.357 . | 0.432 . | 0.571 * |
ifixnrpd | 0.376 . | 0.435 . | 0.203 | 0.290 |
cns | 1.934 *** | 0.915 ** | 0.605 * | 1.492 *** |
cdur | 0.242 | 0.116 | 0.345 | 0.313 |
cnd | 1.183 *** | 1.437 *** | 1.033 *** | 1.525 *** |
cser | 1.296 *** | 0.638 * | 1.098 *** | 0.932 ** |
gnp | 0.994 ** | 1.033 *** | 0.526 * | 0.881 ** |
imports | 0.157 | 0.434 . | 0.097 | 0.285 |
exports | 0.191 | 0.177 | 0.401 . | 0.22 |
finsale | 1.813 *** | 0.239 | 0.172 | 1.352 *** |
cpi | 0.421 . | 0.660 * | 0.295 | 0.290 |
tbill | 0.359 . | 0.218 | 0.100 | 0.073 |
businv | 0.238 | 1.382 *** | 0.342 | 0.550 * |
m1 | 0.075 | 2.134 *** | 0.137 | 0.188 |
unemp | 1.131 *** | 1.255 *** | 0.454 . | 0.130 |
labfor | 1.348 *** | 0.566 * | 0.374 . | 0.462 * |
empl | 0.368 . | 0.286 | 0.098 | 0.569 * |
monbase | 0.536 * | 0.316 | 0.217 | 0.225 |
monmult | 0.524 * | 1.036 *** | 0.193 | 0.408 . |
hours | 0.317 | 0.397 . | 0.158 | 0.340 |
wage | 0.068 | 0.716 ** | 0.273 | 0.694 ** |
Cells report the CH test statistic and significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1. |
The value of the statistics is not exactly the same as those reported in Table 7 of the original paper. They should be identical as the same options were used. Unfortunately the authors and the journal did not provide the code that reproduces the results of the paper. I checked my implementation with that available in the software Gretl [2]. Running some examples I obtained the same results in both programs. In the original paper, the authors may have used some variation on the computation of the Newey and West matrix or may have used some options other than from those eventually reported in the paper. There may also be some error in the transcription of results in the original paper (e.g., the results for the series hours
are close to those obtained here for the series wage>
and the other way around).
Regardless of the issues about reproducibility, the overall conclusion based on the Table above is the same as in the original paper. The seasonal pattern in most of the series is not stable for all the seasons. Yet, there are some series for which a stable seasonal pattern is not rejected at the 5% significance level: non-residential investments, non-residential producer durables, consumption durables, imports, exports, consumer price index, treasure bill and hours. According to the dummy version of the test, changes in the seasonal pattern occur a bit more often in the first and second quarters.
The Canova and Hansen test is an interesting tool that can be used as a complement to the more common unit root tests. Knowing the seasons that cause the instability in the seasonal pattern is helpful in order to find an explanation for evolving patterns, for example, changes in consumption habits, holidays or other socio-economic phenomena.
References:
[1] Canova, F. and Hansen, Bruce E. (1995) 'Are Seasonal Patterns Constant over Time? A Test for Seasonal Stability'. Journal of Business & Economic Statistics, 13(3), pp. 237-252. DOI: 10.1080/07350015.1995.10524598. Paper available from one of the author's website. Data set.
Go to top
[2] Díaz-Emparanza, I. and Moral, M. Paz (2013). Seasonal Stability Tests in gretl. An Application to International Tourism Data. Working paper: Biltoki D.T. 2013.03. URL: https://addi.ehu.es/handle/10810/10577. Gretl package.
Go to top
It is illuminating to plot the gain of the corresponding ARIMA filter. The gain reveals the frequencies of those cycles that are captured by the filter. A gain close to zero indicates that the cycle at that frequency is captured by the ARIMA filter, whereas a peak in the gain implies that the corresponding cycles are overlooked by the filter.
The left-hand-side plot in the figure below displays the gain of the following seasonal ARIMA(1,1,0)(1,0,0) model for a monthly series:
where is the lag operator, such that .
Vertical dotted lines point to the seasonal frequencies, which for monthly series are by definition: . We can see that those cycles related to the seasonal frequencies are not touched by this filter. The right-hand-side plot displays the gain for the same model by with a negative seasonal AR coefficient. In this case, the model captures cycles related to the seasonal frequencies. If the second filter were applied to a monthly series, the seasonal cycles would be filtered and removed from the original series. In the first model, the seasonal cycles will escape to the filter.
One interpretation in the context of time series analysis is that if model selected for the data is the first one, then there is no major seasonal pattern as the selected model is not related to any seasonal cycle. The reverse applies for the model with negative seasonal AR coefficient.
When ARIMA models are used to obtain forecasts, we may not care much about the implications of the parameters of the fitted model. We can stick to choose and fit the model that gives better forecasts according to the mean squared error or to other measure of accurary. In general, we may expect that a seasonal ARIMA model will perform better with monthly or quarterly series, but that's all.
If the purpose of the analysis is to extract a seasonal component or to explore whether a seasonal patter is present in the data, then we must pay attention to the properties of the filter defined by the chosen ARIMA model.
Concluding that there is a seasonal pattern in the data because the seasonal AR coefficient is significant is not appropriate. We have seen two seasonal ARIMA models with very different features. Displaying the gain of the filter is helpful to study the properties of the filter and the frequency of the cycles that explain the variability in the observed data.
]]>An autoregressive model is an interesting model to discuss the meaning, consequences and solutions to be considered when some of the classical assumptions are not met. A stationary autoregressive process of order , AR(p), model is defined as follows:
where the roots of the autoregressive polynomial lie outside the unit circle. This model departs from the classical framework in two points:
The first issue requires investigating the relation between the error term in the model and the stochastic variables (which is the same in this model, epsilon). As discussed in my answer to this question, the predictors and the error term are not independent. However, as they are contemporaneously uncorrelated, the properties of the OLS estimator and the distribution of the usual test statistics are valid asymptotically, i.e., for large samples.
The consequence of the second issue is that standard errors may be larger than when the predictors are related. Some intuition explaining this is that if the predictors are related to each other, they contain some redundant information and it is difficult to discern to which variable the influence of this common component should be attributed. This reduces the precision and increases the standard errors.
The above are some of the reasons than can be argued against using -test or -tests for the selection of the lag order, , in an AR model. As discussed in this post by Rob J. Hyndman, other tools such as the AIC are generally a better approach for that end.
I complemented this answer with a small simulation exercise to assess how serious these consequences are when carrying out an -test in an AR model. Surprisingly, I found that for a sample size of 100 observations the standard -test statistic (the asymptotic -test) performs fairly well, with an empirical level of 0.047 for a nominal size of 5%.
For different reasons, the consequences both points above are related to those caused by small samples. This makes me think that bootstrapping can be a useful strategy that may get to cope with both issues. Here, I extend the simulation exercise by bootstrapping the -test statistic. Despite in this small exercise there is no much room for improvement, at least I would like to check whether the bootstrap would work as expected.
The R code below requires the two code chunks shown here.
set.seed(125) pvals <- rep(NA, niter) breps <- 400 # number of bootstrap replicates for (i in seq_len(niter)) { # results stored in the first two code chunks from the link # given above can be reused here; # computed bootstrapped p-value: # generate data from the model under the null hypothesis # using the resampled residuals the innovations counter <- 0 for (j in seq_len(breps)) { xb <- arima.sim(n = n, model = list(ar = coefs[i,]), innov = sample(resids[,i], size = n, replace = TRUE)) fit1 <- lm(xb[5:n] ~ xb[4:(n-1)] + xb[3:(n-2)]) fit2 <- lm(xb[5:n] ~ xb[4:(n-1)] + xb[3:(n-2)] + xb[2:(n-3)] + xb[1:(n-4)]) Fb <- anova(fit1, fit2)$F[2] if (Fb > chisq.stats[i]/2) counter <- counter + 1 } pvals[i] <- counter / breps # print tracing information print(c(i, sum(pvals[seq_len(i)] < 0.05) / i)) } sum(pvals < 0.05) / niter # [1] 0.046
We find that in this small exercise bootstrapping works well, yielding a empirical level equal to 0.046, which is close to the nominal level 0.05. For this model I found that the bootstrap becomes more reliable than the -test in samples as smaller as 20-30 observations.
Although these results do not reveal any major problem in the usage of the -test, using this statistic is not probably the best way to decide on the lags to be included in an AR model and, in general, in an ARMA model. One reason is that it is the whole AR polynomial what determines the underlying cycles that are captured by the model. All the AR terms may play a role even if they are not significant. This post show an example where after removing a non-significant term, the resulting model cannot capture some of the features of the data.
]]>