Hey rasteR people !
Welcome to the second part of my tutorial on how to increase the processing speed of the R {raster} package. In the previous part I showed you how to speed up R by increasing the maxmemory limit. This part will be a little more “sophisticated” since we will have a look on how to parallelise R processes.
Parallelise? Wait, what?
“Paralellise” is such a fancy word, but what does it mean? Well, you probably know that your CPU (if you didn’t get stuck in the 90ties) has multiple cores to process your requests. My laptop for example has 4 cores and my dektop PC has 8. If you use R for you calculations, usually only one core is used to handle this calculation and the other ones are basically sleeping or handling some overhead operations like copying data and making sure your other programs are running properly. The following picture shows the CPU usuage across my 4 cores, during a typical R session:
You can see that only one core (in this case it’s core 3) is being used and the other ones are basically at 0% usage.
Imagine we have a big processing task that we have to perform over and over again (for example inisde a for loop). This is what R is doing in this situation: It activates only one core and lets it handle the iterations of the loop, step by step, one iteration at a time. This means that one core will be doing all the work and the other core will esentially be doing (almost) nothing, because R on default uses single core processing. What a waste, right? Why not use all cores for the loop? For example, core one processes iteration 1, while core 2 processes iteration 2, while core 3 processes iteration 3,… etc. This is esentially what parallelisation means: Using multiple cores, at the same time (parallel) for repetitive tasks. So how do we implement this in R?
Parallelisation – Step by Step
Example without parallelisation
Let’s say we have a folder with eight layerstacks and we would like to calculate the NDVI for every stack. This is how I would calculate the NDVI using a for-loop without parallel processing:
#load raster package library(raster) #path to directory with folders path <- "Downloads/Stacks/" #get file names using list.files() function stack_list <- list.files(path, pattern=".tif$", full.names=T) #Loop over every stack in filelist for(i in 1:length(stack_list)) { #stack image i img <- stack(stack_list[i]) #calc ndvi for image i ndvi <- (img[[4]]-img[[3]]) / (img[[4]]+img[[3]]) #construct filename outname <- sub(pattern = ".tif", replacement = "_ndvi.tif", x = stack_list[i]) #export raster writeRaster(ndvi, filename = outname, overwrite = T) }
This code will work just fine and will need approximately 2min to execute for smaller rasters (4000x4000px). However, imagine you have 100 rasters you would like to process. Here using multiple cores will save you a lot of time. And this is how you do it:
Example with parallelisation
library(raster) library(doParallel) #Foreach Parallel Adaptor library(foreach) #Provides foreach looping construct #Define how many cores you want to use UseCores <- detectCores() -1 #Register CoreCluster cl <- makeCluster(UseCores) registerDoParallel(cl) path <- "Downloads/Stacks/" stack_list <- list.files(path, pattern=".tif$", full.names=T) #Use foreach loop and %dopar% command foreach(i=1:length(stack_list)) %dopar% { library(raster) img <- stack(stack_list[i]) ndvi <- (img[[4]]-img[[3]]) / (img[[4]]+img[[3]]) outname <- sub(pattern = ".tif", replacement = "_ndvi.tif", x = stack_list[i]) writeRaster(ndvi, filename = outname, overwrite = T) } #end cluster stopCluster(cl)
The code looks very similar and only differs in some parts:
First, you need to load the doParallel and foreach packages. They enable R to perform parallel processing.
In a second step you have define how many core you want to use for you calculation:
#Define how many cores you want to use UseCores <- detectCores() -1 #Register CoreCluster cl <- makeCluster(UseCores) registerDoParallel(cl)
The function detectCores() returns the number of cores that are available. Note: You should always use one core less, so there will be one core available that handles the overhead tasks like copying etc. Otherwise your PC might crash. With makeCluster() and registerDoParallel() you register your cores for the parallelisation.
The last change is the foreach() loop instead of a for loop:
foreach(i=1:length(stack_list)) %dopar% { library(raster) #calulation here }
Note:
- The syntax inside the brackets differ from the conventional for loop where you would write i in 1:length(stack_list)), here you write i=1:length(stack_list)).
- After the loop brackets you have to write %dopar% which stand for “do parallel”.
- You have to load the libraries that are needed for your calculations inside the loop! In this case we will need the raster package, so we load the raster package using library(raster) INSIDE the loop statement.
In the end we close the parallelisation by calling stopCluster().
Benchmark
By using the parallelisation code, I managed to increase my processing time by 50% when I use three cores instead of one. This is how my CPU usage looks like using a single core processing (upper right corner):
You can see that two cores are working, one is doing the calculation and the second one is handling the overhead.
This is how my CPU usage looks like using all available cores (upper right corner):
The next tutorial will be on how to use the clusteR() function that enables the use of multicore processing of some raster functions.
See you soon!
13 Comments
You can post comments in this post.
Very useful post.
Thanks for the simple explanation.
Diego 7 years ago
You are welcome. Thanks for the feedback.
Martin 7 years ago
Yes, thank you for this post! Most comprehensive, easy to understand example I’ve come across of how to parallelize raster processing with R. Much appreciated.
Seth G. 6 years ago
Thanks for the nice feedback! It’s very appreciated!!
Martin 6 years ago
Thanks for the post, Martin. I tried using your approach to parallelize dismo::predict(). I want to predict a fitted model for multiple RasterStacks (i.e., alternative future scenarios). Using example below, I get the following error — ‘Error in { : task 1 failed – “missing layers (or wrong names)”‘?
predict() works if I specify the RasterStack (i.e., ‘future5’), but not if I point to a list (i.e., ‘stack_list[i]’)?
Any suggestions?
# where ‘m’ = maxent model derived in dismo
foreach(i=1:length(stack_list)) %dopar% {
library(dismo)
p <- predict(m, stack_list[i])
}
Jason M. 6 years ago
Hey!
Try stack_list[[i]] instead of stack_list[i]?
Good luck!
Martin 6 years ago
Thanks for the quick response! Unfortunately, still get same error (pasted below)?
Error in { : task 1 failed – “missing layers (or wrong names)”
3. stop(simpleError(msg, call = expr))
2. e$fun(obj, substitute(ex), parent.frame(), e$data)
1. foreach(i = 1:length(stack_list)) %dopar% {
library(dismo)
pc <- predict(m, stack_list[[i]])
}
Jason M. 6 years ago
mhm.. does it work if you use a simple for loop instead?
Furthermore, can you post str(stack_list) ?
cheers
Martin 6 years ago
Sorry about slow response, I went on leave for a few days… I think this is nearly solved, but there is still a bug? Any help much appreciated!
I figured out that I need to create a list of rasterStacks (i.e., ‘stacks <- as.list(c(base, s01, s02, s03))'), rather than a vector of character strings (i.e., 'stacks pc
class : RasterStack
nlayers : 0
Any suggestions?
#################################
library(dismo)
library(rgdal)
library(readr)
# description of objects()
# ‘species’ = list of species
# ‘me’ = maxent model for individual species
# ‘t’ = threshold for individual species
# ‘base’ = rasterStack for baseline climate (1990-2009)
# ‘s01’ – ‘s12’ = rasterStacks for each of 12 future climate scenarios
# ‘stacks’ = list of stack names
# ‘pc’ = rasterStack of predictions (continuous)
# import rasterStacks for multiple climate scenarios
l_0 <- list.files(path="Q:/refugia/climate/epoch_1990-2009", pattern='.tif',
full.names=T )
base <- stack(l_0[c(2,4,5,6,13,14,15)])
l_1 <- list.files(path="Q:/refugia/climate/epoch_2020-2039/future01",
pattern='.tif', full.names=T )
s01 <- stack(l_1[c(2,4,5,6,13,14,15)])
l_2 <- list.files(path="Q:/refugia/climate/epoch_2020-2039/future02",
pattern='.tif', full.names=T )
s02 <- stack(l_2[c(2,4,5,6,13,14,15)])
l_3 <- list.files(path="Q:/refugia/climate/epoch_2020-2039/future03",
pattern='.tif', full.names=T )
s03 <- stack(l_3[c(2,4,5,6,13,14,15)])
# create list of species
setwd("Q:/refugia/observations/")
spp <- read_csv("species_obs.csv", col_names=TRUE,
col_types=cols(speciesModel=col_character()))
species <- unique(spp$speciesModel)
# create list of climate rasterStacks
stacks <- as.list(c(base, s01, s02, s03))
# create list of scenario names for raster outputs
scenario <- c('current', 'future1', 'future2', 'future3')
for (j in 1:length(species)) {
pc <- stack()
path <- paste('Q:/refugia/maxent/', species[j], '_modelfit.rdata', sep = '')
attach(path); me <- me;
detach(paste('file:',path,sep=''), character.only=TRUE)
library(doParallel)
library(foreach)
UseCores <- detectCores() -1
cl <- makeCluster(UseCores)
registerDoParallel(cl)
foreach(i=1:4) %dopar% {
library(dismo)
pc <- predict(me, stacks[[i]])
setwd('Q:/refugia/test')
writeRaster(pc[[i]], filename=paste(species[j], "_", scenario[i], "_prediction.tif",
sep=""), format="GTiff", overwrite=T, progress='text')
}
stopCluster(cl)
rm(path,me,pc)
}
Jason M. 6 years ago
Previous post corrupted when CATPCHA Code was not successful? Sorry about line breaks in code (looked OK in comments window), but top paragraph should read…
I figured out that I need to create a list of rasterStacks (i.e., ‘stacks <- as.list(c(base, s01, s02, s03))'), rather than a vector of character strings (i.e., 'stacks <- c('base', 's01', 's02', 's03')'). The code (pasted below) now runs — all my processors engage with similar/parallel performance curves. I think the dismo::predict() step is running successfully (i.e., my processors all spin for ~20 minutes = ~expected time), but then the script falls over trying to write predictions as a rasterStack. Any help much appreciated!
Jason M. 6 years ago
embarrasing… here is the error…
Error in { : task 2 failed – “not a valid subset”
> pc
class : RasterStack
nlayers : 0
Jason M. 6 years ago
[…] Increasing the speed of {raster} processing with R: Part 2/3 Parallelisation […]
Raster Parallelization – Avian Ecologist 4 years ago
Hey, thanks for the interesting posts.
Do you have insights regarding the settings of rasterOptions when running parallel processes?
In particular it is not clear to me whether the fraction of memory allocated to one process (memfrac option) should account for the number of cores registered (i.e. make sure that the number of cores used times the fraction of memory < 1)?
Best,
Valentin
Valentin 4 years ago
Post A Reply