Benchmark: R vs. Python Rasterio vs. GDAL

Hey there!
Today I would like to do a small comparsion between three raster processing tools:
R, Python (rasterio) and GDAL.
When I started with GIS and raster processing, I didn’t really pay much attention to the performance of the scripts I wrote and tools I used. But now, working with bigger data, higher spatial resolution, a fast processing of my raster files is the key to success. Therefore I will compare the performance of three raster processing tools for you, namely on a simple NDVI calculation using a Landsat 8 subset:

Data:

First, lets have a look at the data I’ve used for the benchmark: It’s a standard Landsat 8 image downloaded from the USGS EarthExplorer Webpage.
It has spatial resolution of 30mx30m and the dimensions of the subset are 3485×4606 pixels. The scene depicts the Lake Baikal in Siberia, Russia:
Landsat 8 subset showing Lake Baikal in Siberia, Russia

Tested processing tools:

The processing tools I have tested in this benchmark are the following: R, Python rasterio and GDAL.

R

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Here I focused on the {raster} package that is widely used for the handling of raster files in R. [read more here]

I tested two algorithms:

  • The first using the overlay() function from the raster package. This functions enables users to perform raster algebra with two or more input bands. This is necessary for the calculation of the NDVI.
  • A second one, using standard R syntax.

You can have a look at both code snippets below:

library(raster)

#----------1. ALGORITHM - RASTER PACKAGE with OVERLAY FUNCTION -------------
ras <- stack("baikal_subset.tif")

ndvi <- overlay(ras, fun=function(x){(x[1]-x[2])/(x[1]+x[2])})

writeRaster(ndvi, "ndvi_subset_raster.tif", 
            datatype="FLT8S",
            options=c("compress=lzw"),
            overwrite=T)

#---------- 2. ALGORITHM - RASTER with STANDARD R SYNTAX -----------------
ras <- stack("baikal_subset.tif")

b1 <- ras[[1]]
b2 <- ras[[2]]

ndvi = (b1-b2)/(b1+b2)

writeRaster(ndvi, "ndvi_subset_R.tif", 
            datatype="FLT8S",
            options=c("compress=lzw"),
            overwrite=T)

 

Python Rasterio

Rasterio employs GDAL under the hood for file I/O and raster formatting. Its functions typically accept and return Numpy ndarrays. Rasterio is designed to make working with geospatial raster data more productive and more fun.  [read more here]

You can have a look at the code below:

import rasterio
import numpy

with rasterio.drivers():
    with rasterio.open('baikal_subset.tif') as src:
        b1, b2, b3, b4, b5 = src.read()

        profile = src.profile
        profile.update(
            dtype=rasterio.float64,
            count=1,
            compress='lzw')

    ndvi = numpy.zeros(b1.shape)
    ndvi = (b1-b2)/(b1+b2)

    with rasterio.open('ndvi_python.tif', 'w', **profile) as dst:
        dst.write(ndvi.astype(rasterio.float64), 1)

GDAL

The last tool I used is GDAL and the function gdal_calc.py:

GDAL is a translator library for raster and vector geospatial data formats that is released under an X/MIT style Open Source license by the Open Source Geospatial Foundation. As a library, it presents a single raster abstract data model and vector abstract data model to the calling application for all supported formats. It also comes with a variety of useful command line utilities for data translation and processing. The NEWS page describes the January 2016 GDAL/OGR 2.0.2 release. [read more here]

Benchmark Results

For the benchmark I ran all the scripts described above 5 times each and got the following very interesting result:

The fastest processing was done by GDAL with a mean value of 5.8seconds. The python rasterio package was slightly slower with an average processing time of 7.95 seconds. Unfortunately my favourite software R was the slowest: using the standard R syntax an average processing time of 15.3 seconds was recorded and what was shocking (!) to me: the overlay() function from the raster package needed 105.84 seconds on average to calculate the NDVI from the Landsat 8 subset.

Note: This benchmark was carried out on a MacBook Air with 1,8 GHz Intel Core i5 and 8GB 1600 MHz DDR3 with OS X El Capitan. Here my version numbers: R 3.2.3, Python 2.7, GDAL 1.11.3 and Rasterio 0.31.0.

Let me know if you have any questions.

Cheers

Martin

About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at an Earth Observation Company in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and TV series.

3 Comments

You can post comments in this post.


  • Thanks Martin for your speed testing. The experiences I have made with the raster package in R confirm that in some way it is very slow and often unstable.
    Now, why is GDAL the winner? imho because it is written in C. Do you agree?

    Simon 1 year ago Reply


    • Dear Simon,
      I am happy to hear from you. Yes, as you already pointed out it’s due to the fact that GDAL is written in C / Python which makes it much faster. I recently stumbled upon the possibility of parallelisation with R, this increases the performance of many processing tasks enormously. You should definitely check out my posts on that topic.
      Hope you are doing fine.
      Cheers Martin

      Martin 12 months ago Reply


  • Interesting post. I use both R and Python (via Rasterio) for spatial data analysis and processing of raster data. Python does indeed feel faster for many purposes. However, a comment regarding your benchmarking is that your Rasterio script is loading the entire image stack into a numpy array for processing, which of course is very fast. However, for many analyses involving high-resolution imagery and/or large grids, loading the entire dataset into numpy is not possible. I can rarely use such an approach. R via the raster package (applies to both of your examples) is reading chunks from the rasters on disk, and also using a temp file to store the processing results. Therefore it will be much slower, but this approach is more sophisticated because grids of any dimension can be processed. The equivalent Python code requires quite a bit more work in terms of coding, and involves looping through the raster using windowed reading, performing the calculation in numpy, followed by windowed writing of the results. This will be much slower, and it would be interesting to compare the benchmark using windowed reading and writing in Rasterio involving looping in Python, vs R where I believe this component of the Raster package is written in C++.

    Steve 9 months ago Reply


Post A Reply

*