Detection of outliers in time series with R

This post is an abridged version of this document. Here, I will focus on the non-technical content and will reproduce some examples in order to introduce the purpose and usage of the tsoutliers package of R.

Time series data often undergo sudden changes that affect the dynamics of the data transitory or permanently. Sometimes, the practitioner may have some a priori information about the existence of these effects. A new regulation, strike periods, exceptional natural phenomena or a change in the methodology used to collect the data are some examples of events that may alter the overall dynamics of the data. In other cases, there is no such clear-cut information available.

Detecting and correcting the effect of outliers is important because they have an impact on the selection of the model, the estimation of parameters and, consequently, on forecasts and other results pursued by the analysis such as seasonal adjustment. In practice, we observe that many time series are affected by outliers and they are more common than intervention variables, especially when a large an heterogeneous set of time series is analysed, as it is the case of the analyses carried out by National Institutes of Statistics and Central Banks.

The specialised software for automatic detection of outliers in time series are X-12-ARIMA (developed by the US Census Bureau) and TRAMO (developed by the Bank of Spain).

A procedure to detect outliers in time series is available in the R package tsoutliers. Considering the number of years that X-12-ARIMA and TRAMO have been around, the first versions of tsoutliers are not intended to be used in production with large and heterogeneous data sets. The package can nonetheless be used in small applications in a semi automatic way, i.e., running the automatic procedure with different options.

A relevant role of the package tsoutliers is as a research tool. Compared to the previous software, tsoutliers enjoys some advantages. The design of the package allows the user to run the procedure step by step. Since it is written in R, it is relatively easier to modify or extend the baseline methodology.

As an example of enhancements, the package includes the seasonal type of outlier. As regards possible extensions, the package provides the grounds to extend the methodology described in the literature to the context of structural time series.

Next, I show the usage of the package with an application to the monthly Italian industrial production index in the period 1981:1 to 1996:12, gipi. These data are used in Kaiser and Maravall (1999).

Below, we load the data and take logarithms to the series. Calendar effects (by default trading day and Easter) are defined by means of the function calendar.effects. The ARIMA(0,1,1)(0,1,1) model is specified and the same critical value used in the reference paper is used for the detection of outliers, cval=3.5. The following types of outliers are considered: AO, additive outliers; LS, level shifts; TC, temporary changes and SLS, seasonal level shifts.

gipi <- log(bde9915$gipi)
ce <- calendar.effects(gipi)
res1 <- tso(y = gipi, xreg = ce, cval = 3.5, 
  types = c("AO", "LS", "TC", "SLS"), maxit = 1, 
  remove.method = "bottom-up", tsmethod = "arima", 
  args.tsmethod = list(order = c(0, 1, 1), 
    seasonal = list(order = c(0, 1, 1))))

The following result is obtained:

Series: gipi 

          ma1     sma1  trading-day   Easter    SLS28    AO44    
      -0.5756  -0.9265       0.0085  -0.0273  -0.0620  0.1068  
s.e.   0.0603   0.1176       0.0005   0.0080   0.0156  0.0213   
        SLS49   SLS92  SLS164
      -0.0606  0.0594  0.0704
s.e.   0.0121  0.0118  0.0141

sigma^2 estimated as 0.0005178:  log likelihood=388.13
AIC=-756.26   AICc=-754.95   BIC=-724.38

  type ind    time  coefhat  tstat
1  SLS  28 1983:04 -0.06198 -3.975
2   AO  44 1984:08  0.10679  5.022
3  SLS  49 1985:01 -0.06062 -4.994
4  SLS  92 1988:08  0.05945  5.024
5  SLS 164 1994:08  0.07043  4.995

Calendar effects are significant at the 5% significance level since the ratio of the estimated coefficients and their standard errors are large in absolute value (greater than 3). Several seasonal level shifts are found. The outliers detected in the reference paper cited before are AO 1984:08, AO 1987:01, SLS 1994:08 and SLS 1988:08. Two more SLS and one less AO are found with the options chosen above. It must noticed that here we specified the ARIMA model beforehand while in the reference paper a model selection procedure is used. Hence, despite the ARIMA(0,1,1)(0,1,1) is eventually chosen, different models may have been used at intermediate stages of the procedure.

Running the procedure along with a model selection strategy implemented in the forecast package the ARIMA(1,0,1)(2,1,1) is chosen and a SLS at 1985:01 is found.

res2 <- tso(y = gipi, xreg = ce, 
  types = c("AO", "LS", "TC", "SLS"), 
  maxit = 1, remove.method = "bottom-up", tsmethod = "auto.arima", 
  args.tsmethod = list(allowdrift = FALSE, ic = "bic"))

yields (for brevity only the selected model and the set of outliers is printed here):

  type ind    time  coefhat  tstat
1  SLS  49 1985:01 -0.06517 -3.664

The method used to remove outliers from a potential set outliers detected in a previous stage may have an effect on the results. Although not reported here, using remove.method="en-masse", the model ARIMA(0,1,1)(0,0,2) was chosen and a SLS at 1988:08 was detected.

Changing the method to discard outliers leads also to a different set of outliers with the ARIMA(0,1,1)(0,1,1) model.

res3 <- tsoutliers(y = gipi, xreg = ce, cval = 3.5, 
  types = c("AO", "LS", "TC", "SLS"), 
  maxit = 1, remove.method = "en-masse", tsmethod = "arima", 
  args.tsmethod = list(order = c(0, 1, 1), 
    seasonal = list(order = c(0, 1, 1))))
  type ind    time  coefhat  tstat
1   AO  44 1984:08  0.08984  4.010
2  SLS  49 1985:01 -0.06376 -3.592
3   AO 176 1995:08  0.09631  4.080

In the next example the number of seasonal differences is fixed to one, D=1. Despite the same model as in res2 shown above is chosen and the same method to remove outliers is used, here two additional SLS are found. The reason for that is that although the final model is the same in both cases, different models may probably have been considered during the process, in particular a different order of integration may have been used, leading to a different set of outliers obtained at each step.

res4 <- tso(y = gipi, xreg = ce, 
  types = c("AO", "LS", "TC", "SLS"), 
  maxit = 1, remove.method = "bottom-up", tsmethod = "auto.arima", 
  args.tsmethod = list(allowdrift = FALSE, D = 1, ic = "bic"))
  type ind    time  coefhat  tstat
1  SLS  49 1985:01 -0.06002 -4.500
2  SLS  92 1988:08  0.04349  3.668
3  SLS 176 1995:08  0.07729  4.092

Despite using different options we found some differences in the results, a SLS at 1985:01 was found in the four cases discussed above. This outlier seems to be a robust result. However, this outlier was not identified in the reference paper. The SLS at 1988:08 that they find is detected in some of the examples shown above. Our examples may differ from the application shown in the reference paper in some issues such as the possible inclusion of a regressor variable defining national holidays or other details not mentioned in the reference paper.

In these examples we have seen the flexibility of the tsoutliers interface to set and choose different options that may be relevant in practice. The examples suggested also that pursuing a fully automatic procedure may become too cumbersome. Therefore, alternative manual runs of the automatic procedure are advisable when possible.

This entry was posted in outliers, R contributed package, time series and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *