Visualizing MonteCarlo Simulation Results: Mean vs Median

Simulation studies are used in a wide range of areas from risk management, to epidemiology, and of course in statistics. The MonteCarlo package provides tools to automatize the design of these kind of simulation studies in R. The user only has to specify the random experiment he or she wants to conduct and to specify the number of replications. The rest is handled by the package.

So far, the main tool to analyze the results was to look at Latex tables generated using the MakeTable() function. Now, the new package version 1.0.5 contains the function MakeFrame() that allows to represent the simulation results in form of a dataframe. This makes it very easy to visualize the results using standard tools such as dplyr and ggplot2.

Here, I will demonstrate some of these concepts for a simple example that could be part of an introductory statistics course: the comparison of the mean and the median as estimators for the expected value. For an introduction to the MonteCarlo package click here or confer the package vignette.

For a symmetric distribution both the mean and the median are consistent estimators of the expected value, but since the mean is the maximum likelihood estimator, it is more efficient. On the other hand, it is usually stressed that in contrast to the mean, the median is not sensitive to outliers.

To demonstrate this, and to explore the relative magnitude of these effects depending on the sample size, we formulate a suitable random experiment.

library(MonteCarlo)
set.seed(123)

mean_vs_median<-function(n, out){

# generate sample
sample<-rnorm(n, 0, 1)

# add outlier
sample[1]<-sample[1]+out

# calculate estimators
mean_sample<-mean(sample)
median_sample<-median(sample)

# return results
return(list("mean"=mean_sample, "median"=median_sample))
}


The mean_vs_median() function generates a sample of n standard normal distributed random variables and contaminates the first observation with a deterministic outlier of size out. We then calculate both the mean and the median and return them as elements of a list. Note that the function has two arguments – the sample size n and the size of the outlier out.

To analyze the effect of these two parameters on the relative performance of the two estimators, we use the MonteCarlo package. All we have to do is to pass the mean_vs_median function to the MonteCarlo() function and to specify the parameter values we are interested in. In the example below, we look at outliers of size 0, 3, or 5 in a sample of size 5, 25, or 250 and the experiment is repeated 1000 times.

erg_mean_median<-MonteCarlo(func=mean_vs_median, nrep=1000, param_list=list("n"=c(5, 25, 250), "out"=c(0, 3, 5)))
summary(erg_mean_median)
## Simulation of function:
##
## function(n, out){
##
##   # generate sample
##   sample<-rnorm(n, 0, 1)
##
##   # add outlier
##   sample[1]<-sample[1]+out
##
##   # calculate estimators
##   mean_sample<-mean(sample)
##   median_sample<-median(sample)
##
##   # return results
##   return(list("mean"=mean_sample, "median"=median_sample))
## }
##
## Required time: 2 secs for nrep = 1000  repetitions on 1 CPUs
##
## Parameter grid:
##
##    n : 5 25 250
##  out : 0 3 10
##
##
## 2 output arrays of dimensions: 3 3 1000

This is all the programming that is required to run the simulation study. To be able to generate plots from the results we call the MakeFrame() function on the object returned by the simulation.

df<-MakeFrame(erg_mean_median)
head(df)
##     n out        mean      median
## 1   5   0  0.85512352  1.28055488
## 2  25   0 -0.19633118 -0.09100042
## 3 250   0 -0.07376411 -0.06749168
## 4   5   3  0.59734761  0.59949205
## 5  25   3  0.13122706  0.28618942
## 6 250   3  0.07257960  0.12357383

As one can see, the result is a dataframe. Each row contains information on a single repetition of the random experiment. The first two columns contain the values of the parameters and the other two columns contain the estimates of the mean and the median.

To manipulate the dataframe and to make plots, we load the tidyverse package and convert the dataframe to a tibble.

library(tidyverse)
tbl <- tbl_df(df)

To compare the efficiency of the estimates in absence of an outlier, we focus on the cases where out=0 and then compare estimates of the distribution of the estimators for different sample sizes. For each sample size we use a different color and the mean and the median can be distinguished by the linetypes.

ggplot(filter(tbl, out==0)) +
geom_density(aes(x=mean, col=factor(n), linetype="mean")) +
geom_density(aes(x=median, col=factor(n), linetype="median"))

It is clear to see that both the distribution of the mean and the median are centered around the true expected value of zero. This implies that both estimators are unbiased. However, the distribution of the mean tends to be more concentrated than that of the median. The mean has a smaller variance and is therefore more efficient.

This can also be seen if we calculate the corresponding summary statistics. Using the tibble and the dplyr package, this can be done with a single line of code.

tbl %>% filter(out==0) %>% group_by(n) %>% summarise_each("sd") %>% round(digits=2)
## # A tibble: 3 x 4
##       n   out  mean median
##   <dbl> <dbl> <dbl>  <dbl>
## 1     5     0  0.44   0.52
## 2    25     0  0.19   0.24
## 3   250     0  0.06   0.08

We now consider the effect of outliers on the two estimators. To do so, we generate a similar plot, but now we keep the sample size constant at n=25 and different colors represent outliers of different magnitudes.

ggplot(filter(tbl, n==25)) +
geom_density(aes(x=mean, col=factor(out), linetype="mean")) +
geom_density(aes(x=median, col=factor(out), linetype="median"))

It is clear to see that the dashed lines representing the distribution of the medians are centered around the true mean of zero, irrespective of the size of the outlier. The distribution of the means, on the other hand, shifts further to the right the larger the magnitude of the outlier. This shows that the median is robust to outliers, whereas the mean is not.

Finally, we want to explore the interaction of the effect of the size of the outlier with the sample size. We therefore focus on the mean. Different colors now represent different sizes of the outlier and different linetypes represent different sample sizes.

ggplot(filter(tbl)) +
geom_density(aes(x=mean, col=factor(out), linetype=factor(n)))


For an outlier of a given size, we can observe that its impact decreases as the sample size increases. While the effect of the outlier can be quite dramatic for n=5, the effect basically vanishes for n=250. The sensitivity of the mean to outliers is therefore a finite sample property that is less important in larger samples.

As can be seen from this example, conducting simulation studies requires minimal effort, if a package such as MonteCarlo or one of its competitors such as simsalapar or SimDesign is used. The programming required to produce this analysis should be simple enough so that simulations are not restricted to be a tool for research, but can even be used for teaching at an undergraduate level.

Advertisements

Introducing the MonteCarlo Package

My first R package has been released on CRAN recently. It is named MonteCarlo and aims to make simulation studies as easy as possible – including parallelization and the generation of tables.

What Are Simulation Studies Good For?

Monte Carlo simulations are an essential tool in statistics and related disciplines. They are routinely used to determine distributional properties, where no analytical results are available. A typical example is to study the finite sample properties of a new statistical procedure. Finite sample properties are often unknown, since theoretical results usually rely on asymptotic theory that only applies to infinite sample sizes. This may be a good approximation in practice if the sample size is large, but it may also be bad if the sample is small or if the rate of convergence is very slow.

Simulation studies are also extremely useful to compare the performance of alternative procedures for the same task. Imagine you want to test whether your sample comes from a Gaussian distribution and you have to decide whether to use the Jarque-Bera test or the Kolmogorov Smirnov test. Both appear to be legitimate choices. A simulation study that is tailored so that it reflects the situation at hand might uncover that one of the procedures is much more powerful than the other.

Finally, small scale simulation studies are an essential tool for statistical programming. Testing is an essential part of programming and software development. Usually, if one writes a function, it is good practice to let it run on some test cases. This is easy, if the outcome is deterministic. But if your function is a statistical test or an estimator, the output depends on the sample and is stochastic. Therefore, the only way to test whether the implementation is correct, is to generate a large number of samples and to see whether it has the expected statistical properties. For example one might test whether the mean squared error of an estimator converges to zero as the sample size increases, or whether the test has the correct alpha error.

Therefore, writing Monte Carlo simulations is an everyday task in many areas of statistics. This comes with considereable effort. It is not unusual that the required lines of code to produce a simulation study are a multiple of that needed to implement the procedure of interest. As a consequence of that they are also one of the main sources for errors. On top of this, the large computational effort often requires parallelization which brings additional complications and programming efforts. This efforts can often be prohibitive – especially for less advanced users.

The MonteCarlo Package

The MonteCarlo package streamlines the process described above. It allows to create simulation studies and to summarize their results in LaTeX tables quickly and easily.

There are only two main functions in the package:

1. MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPU units.
2. MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.

To run a simulation study, the user has to nest both – the generation of a sample and the calculation of the desired statistics from this sample – in a single function. This function is passed to MonteCarlo(). No additional programming is required. This approach is not only very versatile – it is also very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.

The aim of this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment, but to take away all the effort of setting up the technical part of the simulation.

A Simple Example: The t-test

Suppose we want to evaluate the performance of a standard t-test for the hypothesis that the mean is equal to zero. We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc for location) and the standard deviation of the distribution (scale). The sample is generated from a normal distribution.

To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.

library(MonteCarlo)


Then define the following function.

#########################################
##      Example: t-test

# Define function that generates data and applies the method of interest

ttest<-function(n,loc,scale){

# generate sample:
sample<-rnorm(n, loc, scale)

# calculate test statistic:
stat<-sqrt(n)*mean(sample)/sd(sample)

# get test decision:
decision <- stat > 1.96

# return result:
return(list("decision"=decision))
}


As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in. ttest() carries out 4 steps:

1. Draw a sample of n observations from a normal distribution with mean loc and standard deviation scale.
2. Calculate the t-statistic.
3. Determine the test decision.
4. Return the desired result in form of a list.

We then define the combinations of parameters that we are interested in and collect them in a list. The elements of the lists must have the same names as the parameters for which we want to supply grids.

# define parameter grid:

n_grid<-c(50,100,250,500)
loc_grid<-seq(0,1,0.2)
scale_grid<-c(1,2)

# collect parameter grids in list:
param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)


To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (nrep=1000).

# run simulation:

MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)


There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.

Calling summary produces a short information on the simulation.

  summary(MC_result)

## Simulation of function:
##
## function(n,loc,scale){
##
##   # generate sample:
##     sample<-rnorm(n, loc, scale)
##
##   # calculate test statistic:
##     stat<-sqrt(n)*mean(sample)/sd(sample)
##
##   # get test decision:
##     decision1.96
##
##   # return result:
##     return(list("decision"=decision))
## }
##
## Required time: 13.38 secs for nrep = 1000  repetitions on 1 CPUs
##
## Parameter grid:
##
##      n : 50 100 250 500
##    loc : 0 0.2 0.4 0.6 0.8 1
##  scale : 1 2
##
##
## 1 output arrays of dimensions: 4 6 2 1000


As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000), where the Monte Carlo repetitions are collected in the last dimension of the array.

To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function that stacks the submatrices collected in the array in the rows and columns of a matrix and prints the result in the form of code to generate a LaTeX table.

To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols. These are vectors of the names of the parameters in the order in which we want them to appear in the table (sorted from the inside to the outside).

# generate table:

MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE)

## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrrrrrrrr }
## \hline\hline\\\\
##  scale && \multicolumn{ 6 }{c}{ 1 } &  & \multicolumn{ 6 }{c}{ 2 } \\
## n/loc &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\
##  &  &  &  &  &  &  &  &  &  &  &  &  &  &  \\
## 50 &  & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 &  & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\
## 100 &  & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 &  & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\
## 250 &  & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 &  & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\
## 500 &  & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 &  & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision  }
## \end{table}


To change the ordering, just change the vectors rows and cols.

# generate table:

MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE)

## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## scale & n/loc &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\
##  &  &  &  &  &  &  &  &  \\
## \multirow{ 4 }{*}{ 1 } & 50 &  & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 \\
##  & 100 &  & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 \\
##  & 250 &  & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 \\
##  & 500 &  & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\
##  &  &  &  &  &  &  &  &  \\
## \multirow{ 4 }{*}{ 2 } & 50 &  & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\
##  & 100 &  & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\
##  & 250 &  & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\
##  & 500 &  & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision  }
## \end{table}


Now we can simply copy the code and add it to our paper, report or presentation. That is all. Only make sure that the package multirow is included in the header of the .tex file.

Parallelised Simulation

If the procedure you are interested in is not so fast or you need a large number of replications to produce very accurate results, you might want to use parallelized computation on multiple cores of your computer (or cluster). To achive this, simply specify the number of CPUs by supplying a value for the argument ncpus of MonteCarlo as shown below. Of course you should actually have at least the specified number of units.

# run simulation:

MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list, ncpus=4)


This automatically sets up a snow cluster, including the export of functions and the loading of packages. The user does not have to take care of anything.

Further Information

This is an introduction to get you up and running with the MonteCarlo package as quickly as possible. Therefore, I only included a short example. However, the functions MonteCarlo() and particularly MakeTable() provide much more functionality. This is described in more detail in the package vignette, that also provides additional examples.

The Case Against Seasonal Unit Roots

There are several ways to model seasonality in a time series. Traditionally, trend-cycle decomposition such as the Holt-Winters procedure has been very popular. Also, until today applied researchers often try to account for seasonality by using seasonal dummy variables. But of course, in a stochastic process it seems unreasonable to assume that seasonal effects are purely deterministic. Therefore, in a time series context seasonal extensions of the classical ARMA model are very popular. One of these extensions is the seasonal unit root model

$(1-L^S)X_t=u_t,$

where $LX_t=X_{t-1}$ is the usual lag operator and $S$ is the period length of the seasonality such as $4$ or $12$ for a yearly cycle in quarterly or monthly data and $u_t$ is some short run component such as an $iid$ innovation term or a SARMA(p,q)-(P,Q) model.

I have always been puzzled about the popularity of this process. Probably it is due to the obvious conceptual simplicity. It also seems to be a natural extension of the usual non-seasonal integrated ARIMA model. However, the model generates a feature that we will hardly ever observe in an actual time series: as time progresses the difference between consecutive values of the will become infinitely large.

To see this consider the following example. To generate seasonal unit root processes we first define a function that generates seasonal sums.

seasonal_sum<-function(data,S){
out<-data
for(t in (S+1):length(data)){out[t]<-data[t]+out[(t-S)]}
out
}

We then generate a sample of 250 observations from the process and look at its plot and its autocorrelation function. We choose a period of $S=12$, so that the example resembles a yearly cycle in monthly data.

series<-seasonal_sum(rnorm(250),S=12)
acf(series)

ts.plot(series, ylab="series", xlab="t")

From the autocorrelation function (ACF) it can be seen that there is a pronounced seasonal behavior with a spike in the ACF at each lag that is an integer multiple of $S$. However, the plot of the series shows a curious behavior. As $t$ increases, we see that the difference between two consecutive observations $\Delta X_t=X_t-X_{t-1}$ increases. This behavior becomes even more pronounced if we increase the sample size to 2500.

ts.plot(seasonal_sum(rnorm(2500),S=12), ylab="series")

To understand this feature consider the usual unit root model with an $iid$ innovation $\varepsilon_t$ with variance $\sigma_\varepsilon^2$. This can be expressed as the sum over all past innovations.

$X_t=(1-L)^{-1}\varepsilon_t=\sum_{i=0}^t \varepsilon_{t-i}.$

From this representation it is easy to show that the variance of the process is given by

$Var(X_t)=t \sigma_\varepsilon^2,$

so that the variance becomes infinite as $t$ approaches infinity. This is a property that seems to apply to many economic and financial time series and is therefore completely reasonable.

Now, the seasonal unit root model can be expressed in a similar way, but with an important twist. To see this, denote the $s$th innovation in the $i$th repetition of the cycle of length $S$ by $\eta_{i}^{(s)}$. This means that if you have monthly observations the innovation in the first January in the sample is $\eta_1^{(1)}$ and the innovation in the second January in the sample is $\eta_2^{(1)}$. By the same principle the innovation in the $4$th December in the sample would be $\eta_4^{(12)}$. Therefore, any observation $X_t=X_{i}^{(s)}$, for some $i=1,..,n$ and $s=1,...,S$ can be represented as

$X_i^{(s)}=\sum_{i=1}^n \eta_{i}^{(s)}$.

The important thing to note here is that for two consecutive observations within the $i$th repetition of the cycle we have $X_t=X_i^{(s)}=\sum_{i=1}^n \eta_{i}^{(s)}$ and $X_{t-1}=X_i^{(s-1)}=\sum_{i=1}^n \eta_{i}^{(s-1)}$. Since $\eta_{i}^{(s)}$ and $\eta_{i}^{(s-1)}$ are independent streams of random numbers this means that $X_i^{(s)}$ and $X_i^{(s-1)}$ are independent random walks! Consequently, the difference of the process is given by

$\Delta X_t=X_t-X_{t-1}=X_{i}^{(s)}-X_{i}^{(s-1)}=\sum_{i=1}^n \eta_i^{(s)}-\eta_i^{(s-1)},$

so that

$Var(\Delta X_t)= 2n Var(\eta_i^{(s)}).$

Since $n$ goes to infinity as $t$ goes to infinity, so does the variance of the changes. Has anybody ever seen a series that exhibits such a feature? Of course in reality we would expect that the innovations are not $iid$ but show some kind of dependence structure, so that the random walks are not completely independent anymore. However, if the dependence is weak – such as that of an ARMA process – they are still asymptotically independent for large lags. Therefore, the same issue arises, as can be seen from the example below.

sarima_sim<-function(T, S, arma_model){
arma_series<-arima.sim(n=T, model=arma_model)
seasonal_sum(data=arma_series, S=S)
}

sarima_series<-sarima_sim(T=250, S=12, arma_model=list(ar=c(0.5,0.3)))

acf(sarima_series)

ts.plot(sarima_series, ylab="series")

ts.plot(sarima_sim(T=2500, S=12, arma_model=list(ar=c(0.5,0.3))), ylab="series")

So what is the conclusion from all this? The seasonal unit root process seems to be ill suited to model most behavior that we observe in practice. However, it is well known that it often generates a good fit. Especially in shorter time series the drawbacks of the seasonal unit root process do not have to become visible. Nevertheless, I think it is fair to say that one could envision a more satisfactory model. One avenue that seems very useful in this context is that of seasonal long memory processes that are able to combine some persistence in the cyclical fluctuations with a finite variance.

Another important conclusion is that we have to be careful with seemingly direct extensions of standard models such as the ARIMA. The fact that the ARIMA is extremely successful in modelling the non-seasonal component, does not necessarily mean that the SARIMA is a good model for the seasonal applications that we have in mind, too.

The Curious Behavior of diffseries()

This is the story of a subtle error that, to my opinion, is a nice example of the special challenges of statistical programming. One of my main research interests is time series with long memory. These are often modeled by fractionally integrated models, where

$(1-L)^d X_t=\varepsilon_t.$

Here $X_t$ is the time series, $d$ is between zero and one, $L$ is the lag operator defined so that $LX_t=X_{t-1}$ and $\varepsilon_t\sim N(0,\sigma^2)$. Details on fractional differences can be found on Wikipedia or in the original paper of Hosking (1981). Essentially the fractional differencing filter defines an infinite order autoregressive model where the coefficients are a function of the memory parameter $d$.

To generate a fractionally integrated series, we can bring the fractional differencing filter to the other side:

$X_t=(1-L)^{-d}\varepsilon_t=\sum_{i=0}^\infty \theta_i \varepsilon_{t-i}.$

An important special case of this is the random walk model where $d=1$. In this case $\theta_i=1$ for all $i$ and it is usually assumed that $\varepsilon_t=0$ for all $t<0$. Then the model reduces to

$X_t=X_{t-1}+\varepsilon_t=\sum_{i=0}^t \varepsilon_{t-i}.$

Using the same stream of random number $\varepsilon_t$, it should therefore be possible to generate the exact same random walk in R using cumsum() from the base package as well as diffseries() from the package fracdiff with $d=-1$. But, as a former colleague demonstrated to me a while ago, this is not the case.

rm(list=ls())
set.seed(54321)
library(fracdiff)

series<-rnorm(1000)
a<-diffseries(series,d=-1)
b<-cumsum(series)

ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
legend(x="bottomleft", col=c(1,2), lty=c(1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()"))


As one can see from the graph above, the two series a and b that should be identical diverge faster from each other the longer the series becomes.

Recently I was reminded of this curious behavior when I was trying to implement some new statistical procedures for a research paper and they refused to work until I replaced diffseries() with the function fast_fracdiff(), that was proposed as a faster alternative for fractional differencing by Jensen and Nielsen (2014) and makes use of the convolution theorem.

fast_fracdiff <- function(x, d){
iT <- length(x)
np2 <- nextn(2*iT - 1, 2)
k <- 1:(iT-1)
b <- c(1, cumprod((k - d - 1)/k))
dx <- fft(fft(c(b, rep(0, np2 - iT))) * fft(c(x, rep(0, np2 - iT))), inverse = T) / np2;
return(Re(dx[1:iT]))
}
c<-fast_fracdiff(series, d=-1)

ts.plot(a, ylim=c(min(a,b),max(a,b)))
lines(b, col=2)
lines(c, col=3)
legend(x="bottomleft", col=c(1,2,3), lty=c(1,1,1), bty="n", legend=c("RW generated with cumsum()", "RW generated with diffseries()", "RW generated with fast_fracdiff()"))


As one can see, using fast_fracdiff() which can be found on the university webpage of Morten Nielsen with $d=-1$ produces exactly the same series as cumsum(), as one would have expected from the beginning. The green lines representing the random walk generated using fast_fracdiff() lies directly above the red one that represents the series obtained using cumsum(), so that the latter is not visible. Now why does diffseries() behave differently? Lets have a look at the function.

diffseries <- function(x, d)
{
x <- as.data.frame(x)
names(x) <- "series"
x  1)
stop("only implemented for univariate time series")
if (any(is.na(x)))
stop("NAs in x")
n = 2)
x <- x - mean(x)
PI <- numeric(n)
PI[1] <- -d
for (k in 2:n) {
PI[k] <- PI[k-1]*(k - 1 - d)/k
}
ydiff <- x
for (i in 2:n) {
ydiff[i] <- x[i] + sum(PI[1:(i-1)]*x[(i-1):1])
}
## return numeric!
ydiff
}


As one can see above, in the line x<-x-mean(x) the inpust series is de-meaned prior to the fractional differentiation. This is because in a model with non-zero means, we have

$(1-L)^{d}(X_t-\mu)=\varepsilon_t,$

where $\mu$ is the expected value of $X_t$. However, this produces some unwanted behavior, since the fractionally differenced series returned by diffseries() always has a mean of zero.

x<-fracdiff.sim(n=1000, d=0.4)\$series+10
mean(x)

## [1] 9.702761

y<-diffseries(x,0.4)
z<-fast_fracdiff(x,0.4)
mean(y)

## [1] -0.004302338

mean(z)

## [1] 0.6807775


But what causes the behavior in the first graph above? Why do the series drift apart? Here the input series is the series of innovations $\varepsilon_t$. Since these are standard normal, we have $\bar \varepsilon_t\sim N(0,1/T)$. That means $\bar \varepsilon_t$ has a standard deviation of $1/\sqrt{T}$.

If we use diffseries() to integrate the white noise sequence, these are demeaned before they are integrated.

$X_t=(1-L)^{-d} \varepsilon_t-\bar \varepsilon_t =\sum_{i=0}^t( \varepsilon_{t-i} -\bar \varepsilon_t)=-t \bar \varepsilon_t +\sum_{i=0}^t \varepsilon_{t-i}.$

As you can see from the last equation, this produces a random walk with drift, where the drift parameter is given by the mean of the of the innovations.

What do we lean from this? The de-meaning in diffseries() will probably not cause problems in most use cases. However, I certainly think it is problematic since the expected value $\mu$ in the formula above is the expected value after fractional differentiation. Therefore using fast_fracdiff() instead of diffseries() seems to be advantageous beyond the speed gains.

But there is another point to make here. It is obvious that diffseries() is written with the differentiation in mind. Using it to integrate – even though it should be theoretically possible – goes beyond the use cases that the developers had in mind. Errors caused by using the function this way are hard to spot, since the generated series is still a random walk and the slight drift is hard to spot unless it is compared with the cumsum() function. They could go unnoticed for years. This highlights very nicely the extra degree of care that is necessary for statistical programming.

Refrences

Hosking, J. R. (1981). Fractional differencing. Biometrika, 165-176.

Jensen, A. N., & Nielsen, M. Ø. (2014). A fast fractional difference algorithm. Journal of Time Series Analysis, 35(5), 428-436.