Increasing the speed of {raster} processing with R: Part 2/3 Parallelisation

Hey rasteR people !

Welcome to the second part of my tutorial on how to increase the processing speed of the R {raster} package. In the previous part I showed you how to speed up R by increasing the maxmemory limit. This part will be a little more “sophisticated” since we will have a look on how to parallelise R processes.

Parallelise? Wait, what?

“Paralellise” is such a fancy word, but what does it mean? Well, you probably know that your CPU (if you didn’t get stuck in the 90ties) has multiple cores to process your requests. My laptop for example has 4 cores and my dektop PC has 8. If you use R for you calculations, usually only one core is used to handle this calculation and the other ones are basically sleeping or handling some overhead operations like copying data and making sure your other programs are running properly. The following picture shows the CPU usuage across my 4 cores, during a typical R session:

CPU single coreYou can see that only one core (in this case it’s core 3) is being used and the other ones are basically at 0% usage.
Imagine we have a big processing task  that we  have to perform over and over again (for example inisde a for loop).  This is what R is doing in this situation: It activates only one core and lets it handle the iterations of the loop, step by step, one iteration at a time. This means that one core will be doing all the work and the other core will esentially be doing (almost) nothing, because R on default uses single core processing. What a waste, right? Why not use all cores for the loop? For example, core one processes iteration 1, while core 2 processes iteration 2, while core 3 processes iteration 3,… etc. This is esentially what parallelisation means: Using multiple cores, at the same time (parallel) for repetitive tasks. So how do we implement this in R?

Parallelisation – Step by Step

Example without parallelisation

Let’s say we have a folder with eight layerstacks and we would like to calculate the NDVI for every stack. This is how I would calculate the NDVI using a for-loop without parallel processing:

#load raster package
library(raster)

#path to directory with folders
path       <- "Downloads/Stacks/"

#get file names using list.files() function
stack_list <- list.files(path, pattern=".tif$", full.names=T)

#Loop over every stack in filelist
for(i in 1:length(stack_list)) {  
  
#stack image i
img  <- stack(stack_list[i])

#calc ndvi for image i
ndvi <- (img[[4]]-img[[3]]) / (img[[4]]+img[[3]])

#construct filename
outname <- sub(pattern     = ".tif",
               replacement = "_ndvi.tif", 
               x           = stack_list[i])

#export raster
writeRaster(ndvi, 
            filename  = outname,
            overwrite = T)
}

This code will work just fine and will need approximately 2min to execute for smaller rasters (4000x4000px). However, imagine you have 100 rasters you would like to process. Here using multiple cores will save you a lot of time. And this is how you do it:

Example with parallelisation

library(raster)
library(doParallel)  #Foreach Parallel Adaptor 
library(foreach)     #Provides foreach looping construct

#Define how many cores you want to use
UseCores <- detectCores() -1

#Register CoreCluster
cl       <- makeCluster(UseCores)
registerDoParallel(cl)

path       <- "Downloads/Stacks/"
stack_list <- list.files(path, pattern=".tif$", full.names=T)

#Use foreach loop and %dopar% command
foreach(i=1:length(stack_list)) %dopar% {
library(raster)
  
img  <- stack(stack_list[i])
ndvi <- (img[[4]]-img[[3]]) / (img[[4]]+img[[3]])

outname <- sub(pattern     = ".tif",
               replacement = "_ndvi.tif", 
               x           = stack_list[i])

writeRaster(ndvi, 
            filename  = outname,
            overwrite = T)

}

#end cluster
stopCluster(cl)
The code looks very similar and only differs in some parts:

First, you need to load the doParallel and foreach packages. They enable R to perform parallel processing.

In a second step you have define how many core you want to use for you calculation:

#Define how many cores you want to use
UseCores <- detectCores() -1

#Register CoreCluster
cl       <- makeCluster(UseCores)
registerDoParallel(cl)  

The function detectCores() returns the number of cores that are available. Note: You should always use one core less, so there will be one core available that handles the overhead tasks like copying etc. Otherwise your PC might crash. With makeCluster() and registerDoParallel() you register your cores for the parallelisation.

The last change is the foreach() loop instead of a for loop:

foreach(i=1:length(stack_list)) %dopar% {
    library(raster)
    #calulation here
}

Note:

  • The syntax inside the brackets differ from the conventional for loop where you would write i in 1:length(stack_list)), here you write i=1:length(stack_list)).
  • After the loop brackets you have to write %dopar% which stand for “do parallel”.
  • You have to load the libraries that are needed for your calculations inside the loop! In this case we will need the raster package, so we load the raster package using library(raster) INSIDE the loop statement.

In the end we close the parallelisation by calling stopCluster().

Benchmark

By using the parallelisation code, I managed to increase my processing time by 50% when I use three cores instead of one. This is how my CPU usage looks like using a single core processing (upper right corner):

single_core

You can see that two cores are working, one is doing the calculation and the second one is handling the overhead.

This is how my CPU usage looks like using all available cores (upper right corner):

Multicore Process

The next tutorial will be on how to use the clusteR() function that enables the use of multicore processing of some raster functions.

See you soon!

 

About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at an Earth Observation Company in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and TV series.

2 Comments

You can post comments in this post.


  • Very useful post.
    Thanks for the simple explanation.

    Diego 1 year ago Reply


  • You are welcome. Thanks for the feedback.

    Martin 1 year ago Reply


Post A Reply

*