The Curious Behavior of diffseries()

This is the story of a subtle error that, to my opinion, is a nice example of the special challenges of statistical programming. One of my main research interests is time series with long memory. These are often modeled by fractionally integrated models, where

(1-L)^d X_t=\varepsilon_t.

Here X_t is the time series, d is between zero and one, L is the lag operator defined so that LX_t=X_{t-1} and \varepsilon_t\sim N(0,\sigma^2). Details on fractional differences can be found on Wikipedia or in the original paper of Hosking (1981). Essentially the fractional differencing filter defines an infinite order autoregressive model where the coefficients are a function of the memory parameter d.

To generate a fractionally integrated series, we can bring the fractional differencing filter to the other side:

X_t=(1-L)^{-d}\varepsilon_t=\sum_{i=0}^\infty \theta_i \varepsilon_{t-i}.

An important special case of this is the random walk model where d=1. In this case \theta_i=1 for all i and it is usually assumed that $\varepsilon_t=0$ for all t<0. Then the model reduces to

X_t=X_{t-1}+\varepsilon_t=\sum_{i=0}^t \varepsilon_{t-i}.

Using the same stream of random number \varepsilon_t, it should therefore be possible to generate the exact same random walk in R using cumsum() from the base package as well as diffseries() from the package fracdiff with d=-1. But, as a former colleague demonstrated to me a while ago, this is not the case.

rm(list=ls())
set.seed(54321)
library(fracdiff)

series<-rnorm(1000)
a<-diffseries(series,d=-1)
b<-cumsum(series)

ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
legend(x="bottomleft", col=c(1,2), lty=c(1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()"))

fracdiffplot1

As one can see from the graph above, the two series a and b that should be identical diverge faster from each other the longer the series becomes.

Recently I was reminded of this curious behavior when I was trying to implement some new statistical procedures for a research paper and they refused to work until I replaced diffseries() with the function fast_fracdiff(), that was proposed as a faster alternative for fractional differencing by Jensen and Nielsen (2014) and makes use of the convolution theorem.

fast_fracdiff <- function(x, d){
  iT <- length(x)
  np2 <- nextn(2*iT - 1, 2)
  k <- 1:(iT-1)
  b <- c(1, cumprod((k - d - 1)/k))
  dx <- fft(fft(c(b, rep(0, np2 - iT))) * fft(c(x, rep(0, np2 - iT))), inverse = T) / np2;
  return(Re(dx[1:iT]))
}
c<-fast_fracdiff(series, d=-1)

ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
lines(c, col=3)
legend(x="bottomleft", col=c(1,2,3), lty=c(1,1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()", "RW generated with fast_fracdiff()"))

fracdiffplot2

As one can see, using fast_fracdiff() which can be found on the university webpage of Morten Nielsen with d=-1 produces exactly the same series as cumsum(), as one would have expected from the beginning. The green lines representing the random walk generated using fast_fracdiff() lies directly above the red one that represents the series obtained using cumsum(), so that the latter is not visible. Now why does diffseries() behave differently? Lets have a look at the function.

diffseries <- function(x, d)
{
  x <- as.data.frame(x)
  names(x) <- "series"
  x  1)
    stop("only implemented for univariate time series")
  if (any(is.na(x)))
    stop("NAs in x")
  n = 2)
  x <- x - mean(x)
  PI <- numeric(n)
  PI[1] <- -d
  for (k in 2:n) {
    PI[k] <- PI[k-1]*(k - 1 - d)/k
  }
  ydiff <- x
  for (i in 2:n) {
    ydiff[i] <- x[i] + sum(PI[1:(i-1)]*x[(i-1):1])
  }
  ## return numeric!
  ydiff
}

As one can see above, in the line x<-x-mean(x) the inpust series is de-meaned prior to the fractional differentiation. This is because in a model with non-zero means, we have

(1-L)^{d}(X_t-\mu)=\varepsilon_t,

where \mu is the expected value of X_t. However, this produces some unwanted behavior, since the fractionally differenced series returned by diffseries() always has a mean of zero.

x<-fracdiff.sim(n=1000, d=0.4)$series+10
mean(x)
## [1] 9.702761
y<-diffseries(x,0.4)
z<-fast_fracdiff(x,0.4)
mean(y)
## [1] -0.004302338
mean(z)
## [1] 0.6807775

But what causes the behavior in the first graph above? Why do the series drift apart? Here the input series is the series of innovations \varepsilon_t. Since these are standard normal, we have \bar \varepsilon_t\sim N(0,1/T). That means \bar \varepsilon_t has a standard deviation of 1/\sqrt{T}.

If we use diffseries() to integrate the white noise sequence, these are demeaned before they are integrated.

X_t=(1-L)^{-d} \varepsilon_t-\bar \varepsilon_t =\sum_{i=0}^t( \varepsilon_{t-i} -\bar \varepsilon_t)=-t \bar \varepsilon_t +\sum_{i=0}^t \varepsilon_{t-i}.

As you can see from the last equation, this produces a random walk with drift, where the drift parameter is given by the mean of the of the innovations.

What do we lean from this? The de-meaning in diffseries() will probably not cause problems in most use cases. However, I certainly think it is problematic since the expected value \mu in the formula above is the expected value after fractional differentiation. Therefore using fast_fracdiff() instead of diffseries() seems to be advantageous beyond the speed gains.

But there is another point to make here. It is obvious that diffseries() is written with the differentiation in mind. Using it to integrate – even though it should be theoretically possible – goes beyond the use cases that the developers had in mind. Errors caused by using the function this way are hard to spot, since the generated series is still a random walk and the slight drift is hard to spot unless it is compared with the cumsum() function. They could go unnoticed for years. This highlights very nicely the extra degree of care that is necessary for statistical programming.

Refrences

Hosking, J. R. (1981). Fractional differencing. Biometrika, 165-176.

Jensen, A. N., & Nielsen, M. Ø. (2014). A fast fractional difference algorithm. Journal of Time Series Analysis, 35(5), 428-436.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s