xgBoost for landcover classification in R

My favourite supervised classification method for land cover classification until now was the very popular Random Forest. Recently however, I stumbled upon the xgBoost algorithm which made me very curious because of its huge success on the machine learning competition platform Kaggle where it has won several competitions. Since I didn’t know much about this method, I did some research. Summarised in this post you will find the main difference between Random Forest and xgBoost, as well as a minimal example for running a xgBoost landcover classification on a Sentinel-2 image using R.

 

Random Forest vs. xgBoost

What is the difference you might ask? Well both of these methods are based on ensembles of decision trees, the main difference lies in the way how these trees are grow:

 

Random Forest is a Bagging algorithm which reduces variance. I found a very nice explanation on this topic in the following stackexchange post:

Say that you have very unreliable models, such as Decision Trees. (Why unreliable? Because if you change your data a little bit, the decision tree created can be very different.) In such a case, you can build a robust model (reduce variance) through bagging — bagging is when you create different models by resampling your data to make the resulting model more robust.

Random forest is what we call to bagging applied to decision trees, but it’s no different than other bagging algorithm.

Why would you want to do this? It depends on the problem. But usually, it is highly desirable for the model to be stable.

If you still are not 100% what bagging means, here is a short youtube video that will give you some further insights. It can be basically said that Random Forst grows the decision trees in parallel – each independently.

 

xgBoost is a Boosting algorithm,  a very nice explanation of what this means,  can be found in the very same stackexchange post:

Boosting reduces variance, and also reduces bias. It reduces variance because you are using multiple models (bagging). It reduces bias by training the subsequent model by telling him what errors the previous models made (the boosting part). […] In these ensembles, your base learner must be weak. If it overfits the data, there won’t be any residuals or errors for the subsequent models to build upon. Why are these good models? Well, most competitions in websites like Kaggle have been won using gradient boosting trees. Data science is an empirical science, “because it works” is good enough.

Again, here is a short youtube video that might help you understand boosting a little bit better. xgBoost leanrs from previous models and grows iteratively (it learns step by step by looking at the residuals for example).

 

xgBoost 101 for landcover in R

Let’s do a quick landcover classification! For this we need two things as an input:

  • Satellite image (we will be using a 4 Band Sentinel-2 stack as a minimal example)
  • Sample points

Here a short overview over my input data visualised with QGIS:

Input data for xgBoost classification

Input data for xgBoost classification

Step 1 – Load data into R

library(raster)

shp <- shapefile("yourPath/samples.shp")
ras <- stack("yourPath/sentinel2_2348_clip.tif")

And this is how our data looks in R:

Raster

> ras
class : RasterStack 
dimensions : 1535, 1813, 2782955, 4 (nrow, ncol, ncell, nlayers)
resolution : 10, 10 (x, y)
extent : 624190, 642320, 5304770, 5320120 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
names : red, green, blue, nir 
min values : 820, 609, 377, 282 
max values : 3907, 3845, 5208, 5205

As you can see, the raster has 4 bands (red,green,blue and nir) and is in UTM 33 projection.

Shapefile

> shp
class : SpatialPointsDataFrame 
features : 346 
extent : 624751.2, 641951.9, 5304893, 5319267 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables : 1
names : class 
min values : 1 
max values : 5

The shapefile contains 346 points and has 5 classes:

  • 1 – Water
  • 2 – Reeds
  • 3 – Agriculture
  • 4 – Built-up
  • 5 – Forest

Step 2 – Install xgBoost package

Before we can start, we need one last thing: The xgBoost R package – installing it is really straight forward simply type the following and load it afterwards:

install.packages("xbgoost")
library(xgboost)

Step 3 – Prepare training data

Now we need to prepare our training data: The first thing that we will do is to extract the values for each sample point from the raster:

vals <- extract(ras,shp)

Depending on the number of samples and raster size, this may take a while. Go grab a coffee if it takes too long! 🙂 Now we should have our data in a dataframe. For each sample point (row) we should have spectral information of each layer (column):

 red green blue nir
 [1,] 1484 1464 1115 375
 [2,] 1471 1468 1119 371
 [3,] 1493 1474 1151 376
 [4,] 1470 1435 1097 397
...

Now we convert the dataframe into a matrix – this is needed for xgBoost:

train <- data.matrix(vals)

We are almost ready to start the training. One last thing: Remeber how my classes start with class number 1 and go up to 5? xgBoost is a littl bit of a “diva” and doesn’t like that. For multiclass classification like ours, classes need to start with 0 and go to n-1. Let’s take care of this:

# We must convert factors to numeric
# They must be starting from number 0 to use multiclass
# For instance: 0, 1, 2, 3, 4...
classes <- as.numeric(as.factor(shp@data$class)) - 1

Step 4 – Train the xgBoost model

Training xgBoost is much more complex than randomForest because there are many more hyperparameters that need tuning. Finding out the right parametes is very important for a good and stable result. If you are interested in this topic, let me know and I will write a designated blog post about this. For now I will refer you to a couple of websites that cover this topic quite well (including R code):

For know I will just show you how the set up the training with standard parameters:

xgb <- xgboost(data = train, 
               label = classes, 
               eta = 0.1,
               max_depth = 6, 
               nround=100, 
               objective = "multi:softmax",
               num_class = length(unique(classes)),
               nthread = 3)

The data argument takes our previously created matrix of values extracted from the raster. The labels are our reclassified classes scaled from 0 to 4. The other parameters  (and optionally others as well) need to be tuned in your model individually.

Step 5 – Use model to classify a raster

For this you can use the predict function – in my case a little hacky version:

# prediction on the raw raster yielded an error. That's why I extract the values from the raster and predict on them.
# 1:(dim(ras)[1]*dim(ras)[2]) is simply a little bit hacky indexing. Should also be possible to do this cleaner.
# Note: works only with smaller raster or huge RAM. Let me know if it works for you.
result   <- predict(xgb, ras[1:(dim(ras)[1]*dim(ras)[2])]

#create dummy result raster
res      <- raster(ras)

#fill in results and add a "1" to them (to get back to inital class numbering! - see above "Prepate data" for more information)
res      <- setValues(res,result+1)

Voila! We are done. Lets have a look at our result in QGIS:

xgboost_result - classified in R

xgboost_result – classified in R

 

The classification looks quite homogenous (no salt and pepper effekt). Especially if we keep in mind that we have quite complex classes like reeds and agriculture and only sampled with about 300 samples and used 4 input bands. Impressive!! To achieve better results one might use multiple seasons, increase the number of bands orsamples and tune the hyper paramaters.

Good luck building your classifier! Let me know if you have any questions.

Cheers

Martin

About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at GeoVille - an Earth Observation Company based in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and sports.

2 Comments

You can post comments in this post.


  • Excellent post, very informative. Thanks for sharing! For step 5, did you try to use the predict function from the raster package? Also raster::beginCluster() and endCluster() may speed up prediction through parallelization. I wrote a post in my blog (at amsantac .co) showing an example of how to implement it in case you are interested. Thanks again for your contribution! I’m looking forward to reading your next posts!

    Ali 2 months ago Reply


    • Hey Ali! Thanks for the nice feedback!
      I actually tried using the raster predict function also in combination with clusterR but I wasnt able to make it run. (With randomForest it runs flawlessly). If you find out more, let me know.
      Cheers!
      P.S.: great blog btw!

      Martin 2 months ago Reply


Post A Reply

*