Under the framework of the classical linear regression model, the ordinary least squares (OLS) estimator has good properties: it is unbiased, has minimum variance compared to other linear estimators and the usual test statistics follow common distributions.

An autoregressive model is an interesting model to discuss the meaning, consequences and solutions to be considered when some of the classical assumptions are not met. A stationary autoregressive process of order $$p$$, AR(p), model is defined as follows:

$$y_t = \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \epsilon_t \,, \quad \epsilon_t \sim NID(0, \sigma^2) \,,$$

where the roots of the autoregressive polynomial lie outside the unit circle. This model departs from the classical framework in two points:

- Explanatory variables or predictors are not deterministic or fixed: as the predictors are lags of the dependent variable $y_t$, they are clearly stochastic variables that depend on past values of the random variable $$\epsilon_t$$.
- Collinearity between the predictors: as they follow the same model, they are likely to be correlated with each other.

The first issue requires investigating the relation between the error term in the model and the stochastic variables (which is the same in this model, epsilon). As discussed in my answer to this question, the predictors and the error term are not independent. However, as they are contemporaneously uncorrelated, the properties of the OLS estimator and the distribution of the usual test statistics are valid asymptotically, i.e., for large samples.

The consequence of the second issue is that standard errors may be larger than when the predictors are related. Some intuition explaining this is that if the predictors are related to each other, they contain some redundant information and it is difficult to discern to which variable the influence of this common component should be attributed. This reduces the precision and increases the standard errors.

The above are some of the reasons than can be argued against using $$t$$-test or $$F$$-tests for the selection of the lag order, $$p$$, in an AR model. As discussed in this post by Rob J. Hyndman, other tools such as the AIC are generally a better approach for that end.

I complemented this answer with a small simulation exercise to assess how serious these consequences are when carrying out an $$F$$-test in an AR model. Surprisingly, I found that for a sample size of 100 observations the standard $$\chi^2$$-test statistic (the asymptotic $$F$$-test) performs fairly well, with an empirical level of 0.047 for a nominal size of 5%.

For different reasons, the consequences both points above are related to those caused by small samples. This makes me think that bootstrapping can be a useful strategy that may get to cope with both issues. Here, I extend the simulation exercise by bootstrapping the $$F$$-test statistic. Despite in this small exercise there is no much room for improvement, at least I would like to check whether the bootstrap would work as expected.

The R code below requires the two code chunks shown here.

set.seed(125) pvals <- rep(NA, niter) breps <- 400 # number of bootstrap replicates for (i in seq_len(niter)) { # results stored in the first two code chunks from the link # given above can be reused here; # computed bootstrapped p-value: # generate data from the model under the null hypothesis # using the resampled residuals the innovations counter <- 0 for (j in seq_len(breps)) { xb <- arima.sim(n = n, model = list(ar = coefs[i,]), innov = sample(resids[,i], size = n, replace = TRUE)) fit1 <- lm(xb[5:n] ~ xb[4:(n-1)] + xb[3:(n-2)]) fit2 <- lm(xb[5:n] ~ xb[4:(n-1)] + xb[3:(n-2)] + xb[2:(n-3)] + xb[1:(n-4)]) Fb <- anova(fit1, fit2)$F[2] if (Fb > chisq.stats[i]/2) counter <- counter + 1 } pvals[i] <- counter / breps # print tracing information print(c(i, sum(pvals[seq_len(i)] < 0.05) / i)) } sum(pvals < 0.05) / niter # [1] 0.046

We find that in this small exercise bootstrapping works well, yielding a empirical level equal to 0.046, which is close to the nominal level 0.05. For this model I found that the bootstrap becomes more reliable than the $$F$$-test in samples as smaller as 20-30 observations.

Although these results do not reveal any major problem in the usage of the $$F$$-test, using this statistic is not probably the best way to decide on the lags to be included in an AR model and, in general, in an ARMA model. One reason is that it is the whole AR polynomial what determines the underlying cycles that are captured by the model. All the AR terms may play a role even if they are not significant. This post show an example where after removing a non-significant term, the resulting model cannot capture some of the features of the data.