Introducing the MonteCarlo Package

My first R package has been released on CRAN recently. It is named MonteCarlo and aims to make simulation studies as easy as possible – including parallelization and the generation of tables.

What Are Simulation Studies Good For?

Monte Carlo simulations are an essential tool in statistics and related disciplines. They are routinely used to determine distributional properties, where no analytical results are available. A typical example is to study the finite sample properties of a new statistical procedure. Finite sample properties are often unknown, since theoretical results usually rely on asymptotic theory that only applies to infinite sample sizes. This may be a good approximation in practice if the sample size is large, but it may also be bad if the sample is small or if the rate of convergence is very slow.

Simulation studies are also extremely useful to compare the performance of alternative procedures for the same task. Imagine you want to test whether your sample comes from a Gaussian distribution and you have to decide whether to use the Jarque-Bera test or the Kolmogorov Smirnov test. Both appear to be legitimate choices. A simulation study that is tailored so that it reflects the situation at hand might uncover that one of the procedures is much more powerful than the other.

Finally, small scale simulation studies are an essential tool for statistical programming. Testing is an essential part of programming and software development. Usually, if one writes a function, it is good practice to let it run on some test cases. This is easy, if the outcome is deterministic. But if your function is a statistical test or an estimator, the output depends on the sample and is stochastic. Therefore, the only way to test whether the implementation is correct, is to generate a large number of samples and to see whether it has the expected statistical properties. For example one might test whether the mean squared error of an estimator converges to zero as the sample size increases, or whether the test has the correct alpha error.

Therefore, writing Monte Carlo simulations is an everyday task in many areas of statistics. This comes with considereable effort. It is not unusual that the required lines of code to produce a simulation study are a multiple of that needed to implement the procedure of interest. As a consequence of that they are also one of the main sources for errors. On top of this, the large computational effort often requires parallelization which brings additional complications and programming efforts. This efforts can often be prohibitive – especially for less advanced users.

The MonteCarlo Package

The MonteCarlo package streamlines the process described above. It allows to create simulation studies and to summarize their results in LaTeX tables quickly and easily.

There are only two main functions in the package:

  1. MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPU units.
  2. MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.

To run a simulation study, the user has to nest both – the generation of a sample and the calculation of the desired statistics from this sample – in a single function. This function is passed to MonteCarlo(). No additional programming is required. This approach is not only very versatile – it is also very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.

The aim of this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment, but to take away all the effort of setting up the technical part of the simulation.

A Simple Example: The t-test

Suppose we want to evaluate the performance of a standard t-test for the hypothesis that the mean is equal to zero. We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc for location) and the standard deviation of the distribution (scale). The sample is generated from a normal distribution.

To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.

library(MonteCarlo)

Then define the following function.

#########################################
##      Example: t-test

# Define function that generates data and applies the method of interest

ttest<-function(n,loc,scale){

  # generate sample:
    sample<-rnorm(n, loc, scale)

  # calculate test statistic:
    stat<-sqrt(n)*mean(sample)/sd(sample)

  # get test decision:
    decision <- stat > 1.96

  # return result:
    return(list("decision"=decision))
}

As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in. ttest() carries out 4 steps:

  1. Draw a sample of n observations from a normal distribution with mean loc and standard deviation scale.
  2. Calculate the t-statistic.
  3. Determine the test decision.
  4. Return the desired result in form of a list.

We then define the combinations of parameters that we are interested in and collect them in a list. The elements of the lists must have the same names as the parameters for which we want to supply grids.

# define parameter grid:

  n_grid<-c(50,100,250,500)
  loc_grid<-seq(0,1,0.2)
  scale_grid<-c(1,2)

# collect parameter grids in list:
  param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)

To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (nrep=1000).

# run simulation:

  MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)

There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.

Calling summary produces a short information on the simulation.

  summary(MC_result)
## Simulation of function: 
## 
## function(n,loc,scale){
##   
##   # generate sample:
##     sample<-rnorm(n, loc, scale)
##   
##   # calculate test statistic:
##     stat<-sqrt(n)*mean(sample)/sd(sample)
##   
##   # get test decision:
##     decision1.96
##   
##   # return result:
##     return(list("decision"=decision))
## }
## 
## Required time: 13.38 secs for nrep = 1000  repetitions on 1 CPUs 
## 
## Parameter grid: 
## 
##      n : 50 100 250 500 
##    loc : 0 0.2 0.4 0.6 0.8 1 
##  scale : 1 2 
## 
##  
## 1 output arrays of dimensions: 4 6 2 1000

As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000), where the Monte Carlo repetitions are collected in the last dimension of the array.

To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function that stacks the submatrices collected in the array in the rows and columns of a matrix and prints the result in the form of code to generate a LaTeX table.

To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols. These are vectors of the names of the parameters in the order in which we want them to appear in the table (sorted from the inside to the outside).

# generate table:

  MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrrrrrrrr }
## \hline\hline\\\\
##  scale && \multicolumn{ 6 }{c}{ 1 } &  & \multicolumn{ 6 }{c}{ 2 } \\ 
## n/loc &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ 
##  &  &  &  &  &  &  &  &  &  &  &  &  &  &  \\ 
## 50 &  & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 &  & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ 
## 100 &  & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 &  & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ 
## 250 &  & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 &  & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ 
## 500 &  & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 &  & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ 
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision  }
## \end{table}

To change the ordering, just change the vectors rows and cols.

# generate table:

  MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## scale & n/loc &  & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ 
##  &  &  &  &  &  &  &  &  \\ 
## \multirow{ 4 }{*}{ 1 } & 50 &  & 0.05 & 0.30 & 0.83 & 0.98 & 1.00 & 1.00 \\ 
##  & 100 &  & 0.05 & 0.51 & 0.98 & 1.00 & 1.00 & 1.00 \\ 
##  & 250 &  & 0.05 & 0.89 & 1.00 & 1.00 & 1.00 & 1.00 \\ 
##  & 500 &  & 0.05 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\ 
##  &  &  &  &  &  &  &  &  \\ 
## \multirow{ 4 }{*}{ 2 } & 50 &  & 0.05 & 0.10 & 0.28 & 0.55 & 0.79 & 0.94 \\ 
##  & 100 &  & 0.07 & 0.16 & 0.53 & 0.84 & 0.98 & 1.00 \\ 
##  & 250 &  & 0.05 & 0.35 & 0.90 & 1.00 & 1.00 & 1.00 \\ 
##  & 500 &  & 0.06 & 0.58 & 1.00 & 1.00 & 1.00 & 1.00 \\ 
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision  }
## \end{table}

Now we can simply copy the code and add it to our paper, report or presentation. That is all. Only make sure that the package multirow is included in the header of the .tex file.

Parallelised Simulation

If the procedure you are interested in is not so fast or you need a large number of replications to produce very accurate results, you might want to use parallelized computation on multiple cores of your computer (or cluster). To achive this, simply specify the number of CPUs by supplying a value for the argument ncpus of MonteCarlo as shown below. Of course you should actually have at least the specified number of units.

# run simulation:

  MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list, ncpus=4)

This automatically sets up a snow cluster, including the export of functions and the loading of packages. The user does not have to take care of anything.

Further Information

This is an introduction to get you up and running with the MonteCarlo package as quickly as possible. Therefore, I only included a short example. However, the functions MonteCarlo() and particularly MakeTable() provide much more functionality. This is described in more detail in the package vignette, that also provides additional examples.

34 thoughts on “Introducing the MonteCarlo Package

  1. Trying to run your test exemples with the following result 😦 :
    > # collect parameter grids in list:
    > param_list=list(“n”=n_grid, “loc”=loc_grid, “scale”=scale_grid)
    >
    > # run simulation:
    > MC_result summary(MC_result)
    Fejl i summary(MC_result) : objekt ‘MC_result’ blev ikke fundet
    >
    > # generate table:
    >
    > MakeTable(output=MC_result, rows=”n”, cols=c(“loc”,”scale”), digits=2, include_meta=FALSE)
    Fejl i MakeTable(output = MC_result, rows = “n”, cols = c(“loc”, “scale”), :
    objekt ‘MC_result’ blev ikke fundet

    Like

  2. Hi, thanks for the package! It sounds very exciting. However, the code below take 1.38 secs when ncpus=1, but 28.5 secs when ncpus=2. This is on Windows 10 64 bit. Any ideas why? Someone else had the same observation: https://twitter.com/uri_sohn/status/873205000545218561

    library(MonteCarlo)

    test_func<-function(n,loc,scale){
    sample<-rnorm(n, loc, scale)
    stat<-sqrt(n)*mean(sample)/sd(sample)
    decision1.96
    return(list(“decision”=decision))
    }

    # Example without parallization
    n_grid<-c(50,100,250,500)
    loc_grid<-seq(0,1,0.2)
    scale_grid<-c(1,2)

    param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)
    erg<-MonteCarlo(func=test_func, nrep=250, param_list=param_list, ncpus=1)
    summary(erg)

    rows<-c("n")
    cols<-c("loc","scale")
    MakeTable(output=erg, rows=rows, cols=cols, digits=2)

    Liked by 1 person

    • Hi! I am very happy thank you find the package exciting ;)! Parallelization is not necessarily faster. The reason for this behavior is that the computer needs time to initialize the cluster (start R instances, load packages there, export things from your workspace to these new instances, etc…). If this setup time exceeds the time saved by computing on more than one CPU the performance gets worse not better. The t-test simulated in this example is very quick to calculate. That is why you do not save very much time if you parallize it. This should change if you consider something bigger – for example nrep=5000 and n=2500, or if you evaluate more complicate functions that contain loops or numerical optimization steps. Best regards!

      Like

    • Hi Daniel:
      Indeed, parallel computing can be slower–sometimes much slower. To see benefits, you need many many conditions that can overcome the overhead time. For simpler simulation studies (conditions < 100) , I usually use one core when using frequetist methods.

      Donny

      Like

    • >
      rm(list=ls())
      library(MonteCarlo)

      slow_mean<-function(data){
      sum_est<-0
      for(i in 1:length(data)){sum_est<-sum_est+data[i]}
      sum_est/length(data)
      }

      test_func<-function(n,loc,scale){
      sample<-rnorm(n, loc, scale)
      stat<-sqrt(n)*slow_mean(sample)/sd(sample)
      decision1.96
      return(list("decision"=decision))
      }

      n_grid<-c(1000,2500,5000)
      loc_grid<-1
      scale_grid<-c(1,2)

      param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)
      erg1<-MonteCarlo(func=test_func, nrep=5000, param_list=param_list, ncpus=1)
      erg2 erg1$meta$time
      Time difference of 55.06 secs
      > erg2$meta$time
      Time difference of 34.71 secs

      Like

      • Sorry, some of the code had disappeared after copying and pasting. The version below should work.

        rm(list=ls())
        library(MonteCarlo)

        slow_mean<-function(data){
        sum_est<-0
        for(i in 1:length(data)){sum_est<-sum_est+data[i]}
        sum_est/length(data)
        }

        test_func<-function(n,loc,scale){
        sample<-rnorm(n, loc, scale)
        stat<-sqrt(n)*slow_mean(sample)/sd(sample)
        decision<-stat<1.96
        return(list("decision"=decision))
        }

        n_grid<-c(1000,2500,5000)
        loc_grid<-1
        scale_grid<-c(1,2)
        param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)
        erg1<-MonteCarlo(func=test_func, nrep=5000, param_list=param_list, ncpus=1)
        erg2<-MonteCarlo(func=test_func, nrep=5000, param_list=param_list, ncpus=2)

        erg1$meta$time
        erg2$meta$time

        Like

      • That is indeed puzzling. On my machine the version with 2 cpus is about twice as fast as the version with 1. Maybe yours is much faster. What happens if you set the number of repetitions to 50000 each?

        Like

      • First of all now it took a veeeeery long time caculating if the rep=50000
        The result were quite different: 15.98 and 8.27, respectively

        Like

      • Ah great. So it works. It es nearly twice as fast with two cpus instead of one. If parallelization makes sense really depends on the magnitude of the problem at hand. If the things you usually do only take a few microseconds to compute you are probably fine without it, but you see if you have simulations that take hours or days to compute it is a real time saver. As I said, I ll see that I provide a more detailed discussion soon.

        Like

      • I am going to write a longer post about the pros and cons of parallel computation and the factors that influence overhead time and time savings. Also, I am considering to add a second parallel mode that works better for very large numbers of CPUs, and large parameter grids with relatively simple calculations. The current implementation is geared very much towards situations with a lower number of cores (say <10) and situations where a single repetition is time consuming (say it takes several seconds or even minutes). It might take a couple of weeks though until I find some time for that.

        Like

      • Hi, the package if it can work for my problem, then i will be the happiest man on earth. However, i intend to install the package on my R studio and getting an information that there is no package called MonteCarlo . Please, should i do.

        Like

  3. Hi, thanks for putting this together. The MakeFrame() function in your post doesn’t seem to work as the latter doesn’t exist. Any ideas?

    Like

  4. I am having a hard time understanding how this line in your code is correct, but I am very new to R and MonteCarlo simulations. Is it a misprint?

    # get test decision:
    decision1.96

    how does “decision1.96” do anything? I tried “decision <- 1.96" but that obviously isn't what needs to happen – doesn't it need to set decision equal to something though depending on the value of the t statistic?
    Thanks,
    Mark

    Like

    • Hi!

      Sorry, I don’t know how, but a “>” got lost there when I copied the code into wordpress… I updated the code above. Thank you for pointing me to the issue. 1.96 is the critical value for the test decision based on the standard normal distribution. We therefore have to compare the test statistic with this value.

      Like

      • yep, i recognized the 1.96 as that critical value. Glad I could help! Now I need to try to program my own simulation in R but I’m pretty new to it. Having this package is really helpful starting point though. Thanks!
        Mark

        Like

  5. Actually, when I run the code now it’s giving me an error saying that “object ‘decision’ not found. I don’t see in the code where you initialize that value?

    Thanks, Mark

    Like

  6. Hi interesting package, just the kind of thing I’m going to need soon – I’m just learning about R.

    Am I correct in understanding that it will run nrep simulations for each combination of the input grid elements?

    Does this make sense as an extension of the package:
    – give each input grid element a weight (with weights auto normalized to sum to 1) – so each grid becomes a discrete probability distribution
    – each “combination of grid elements” === “point in parameter space” would thus have a probability (product of the probability of the component elements if they were independent, but I guess one could think about correlations also…)
    – then one picks points in parameter space at random weighted by their probability and run a single repetition on it – so the most likely combinations get more “runs”
    – nreps would be interpreted as “total reps” whereas now total reps = “number of points in parameter space” * nreps
    – I guess one could extend this concept to continuous probability distribution parameters…

    I’m only asking because I was thinking of doing something like I describe and yours has exactly the same “shape” and I have zero R so far.

    Like

    • Hi! Sorry for the late reply. I don’t think what that would be useful for. What is the application you have in mind? I think there is probably a way to do something like that using the functionality of the package the way it is. The random sampling of parameters from a probability distribution could be done within the function that is supplied to MonteCarlo().

      Like

  7. I’m getting the following error message when I try to run a monte carlo simulation in parallel.

    Error in setDefaultClusterOptions(type = .sfOption$type) :
    could not find function “setDefaultClusterOptions”

    Is there a reason why this happens?

    Thank you

    Like

    • Sorry for that. There is some problem with the automatic loading of the dependencies at the moment. I am trying to work that out. A simple workaround is to load the snowfall package manually after loading MonteCarlo (so: library(MonteCarlo);library(snowfall)). That should solve the problem.

      Like

Leave a comment