The comparison of an autoregressive model with exogenous regressors and the linear regression model is a recurrent question at Cross Validated. The question often arises when an autoregressive model with exogenous variables is fitted as a linear regression model with lags of the dependent variable. Some of these questions are this, this, this or this. In this post, I discuss how these approaches differ.
Linear regression model with lagged dependent variables
$$y_t = \beta_0 + \beta_1 x_{1,t} + \cdots + \beta_k x_{k,t} + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \epsilon_t \,, \quad \epsilon_t \sim NID(0, \sigma^2) \,.$$
- The coefficient $$\beta_1$$ measures how the dependent variable $$y_t$$ changes when there is a unit change in $$x_1$$.
- The role of the lagged dependent variables is usually to whiten the residuals, i.e. remove serial correlation in the disturbance term in order to gain efficiency in the Ordinary Least Squares estimates. This is for example used in the so-called augmented Dickey-Fuller regression or the HEGY regression.
- $$\beta_0$$ is an intercept, the expected value of $$y_t$$ when $$x_{1,t}$$ is zero.
ARIMA time series model with exogenous regressors
Although, at first glance, the above formulation could be understood as an AR(p) model with regressors, this model is actually specified a follows (and, to my knowledge, this is how it is implemented in software packages):
$$(y_t – \beta_0 – \beta_1 x_{1,t}) = \sum_{i=1}^p \phi_i (y_{t-i} – \beta_0 – \beta_1 x_{1,t-i}) + \epsilon_t \,, \quad \epsilon_t \sim NID(0, \sigma^2) \,.$$
To save space, I included only on regressor, $$x_1$$. A model with an MA component can be viewed as a regression model as well (Hannan and Rissanen procedure), but in order to simplify things here, I did not include an MA term.
- The role of the lagged variables (and of moving average terms of a general ARMA model) is to capture the overall dynamics observed in the data, e.g. looking to the autocorrelation function.
- In the absence of exogenous regressors, $$\beta_0$$ is not an intercept as in the regression model above, it is instead the mean of $$y_t$$. With explanatory variables, the mean of the series is not constant and changes with $$x_t$$.
In practice, the explanatory variables $$x_i$$ are usually observed variables, e.g. age, temperature, gross domestic product,… that are deemed helpful to explain or predict $$y_t$$. Sometimes we are interested in measuring the effect of some particular event such as a marketing campaign, a policy change,… In these cases, we know the time of the event but its effect on the dependent variable is not directly observed. A time series model with an intervention, $$I_t$$, can be employed in this case.
ARMA model with intervention
$$y_t = \frac{\theta(L)}{\phi(L)} \epsilon_t + \beta I_t \,, \quad \epsilon_t \sim NID(0, \sigma^2) \,,$$
where $$\phi(L)$$ and $$\theta(L)$$ are polynomials on the lag operator $$L$$, respectively for the AR and MA parts of the model. The intervention $$I_t$$ is an indicator or dummy variable representing a unit impulse, level shift or transitory change.
Premultiplying both sides of the previous equation by $$pi(L)$$ we obtain a series residuals $$\hat{e}$$ which can be used to measure the effect of the intervention.
$$\hat{\pi}(L) y_t \equiv \hat{e} = \hat{\pi}(L) \beta I_t + \xi_t \,, \quad \hbox{with} \quad \hat{\pi}(L)=\frac{\hat{\phi}(L)}{\hat{\theta}(L)} \,,$$
where $$\hat{\pi}(L)$$ and $$\hat{\eta}$$ are, respectively, the estimates of $\pi(L)$ and the residuals from the ARMA model fitted to the original series in a previous step. For further details, see for example this document that describes the Chen and Liu’s method for the detection of outliers (which can be regarded as a kind of interventions at unknown times) in time series.
Dynamic regression models have a similar resemblance to the models introduced above since they involve lags of the dependent variables and also lags of the explanatory variables. These models are often motivated as a representation of delays or adjustments that may expected in practice between the dependent and explanatory variables. For example, in economics it can be argued the existence of some delay between changes in interest rates and consumption or investment. Changes in interest rates or other economic variables may not be perceived by economic agents at the same time when they occur or adjusting the behaviour to those changes may be costly or require some time. A discussion about these models would take us far from our starting point in this post. For some details see for example my comments here.