My favourite supervised classification method for land cover classification until now was the very popular Random Forest. Recently however, I stumbled upon the xgBoost algorithm which made me very curious because of its huge success on the machine learning competition platform Kaggle where it has won several competitions. Since I didn’t know much about this method, I did some research. Summarised in this post you will find the main difference between Random Forest and xgBoost, as well as a minimal example for running a xgBoost landcover classification on a Sentinel-2 image using R.
Random Forest vs. xgBoost
What is the difference you might ask? Well both of these methods are based on ensembles of decision trees, the main difference lies in the way how these trees are grow:
Random Forest is a Bagging algorithm which reduces variance. I found a very nice explanation on this topic in the following stackexchange post:
Say that you have very unreliable models, such as Decision Trees. (Why unreliable? Because if you change your data a little bit, the decision tree created can be very different.) In such a case, you can build a robust model (reduce variance) through bagging — bagging is when you create different models by resampling your data to make the resulting model more robust.
Random forest is what we call to bagging applied to decision trees, but it’s no different than other bagging algorithm.
Why would you want to do this? It depends on the problem. But usually, it is highly desirable for the model to be stable.
If you still are not 100% what bagging means, here is a short youtube video that will give you some further insights. It can be basically said that Random Forst grows the decision trees in parallel – each independently.
xgBoost is a Boosting algorithm, a very nice explanation of what this means, can be found in the very same stackexchange post:
Boosting reduces variance, and also reduces bias. It reduces variance because you are using multiple models (bagging). It reduces bias by training the subsequent model by telling him what errors the previous models made (the boosting part). […] In these ensembles, your base learner must be weak. If it overfits the data, there won’t be any residuals or errors for the subsequent models to build upon. Why are these good models? Well, most competitions in websites like Kaggle have been won using gradient boosting trees. Data science is an empirical science, “because it works” is good enough.
Again, here is a short youtube video that might help you understand boosting a little bit better. xgBoost leanrs from previous models and grows iteratively (it learns step by step by looking at the residuals for example).
xgBoost 101 for landcover in R
Let’s do a quick landcover classification! For this we need two things as an input:
- Satellite image (we will be using a 4 Band Sentinel-2 stack as a minimal example)
- Sample points
Here a short overview over my input data visualised with QGIS:
Step 1 – Load data into R
library(raster) shp <- shapefile("yourPath/samples.shp") ras <- stack("yourPath/sentinel2_2348_clip.tif")
And this is how our data looks in R:
> ras class : RasterStack dimensions : 1535, 1813, 2782955, 4 (nrow, ncol, ncell, nlayers) resolution : 10, 10 (x, y) extent : 624190, 642320, 5304770, 5320120 (xmin, xmax, ymin, ymax) coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 names : red, green, blue, nir min values : 820, 609, 377, 282 max values : 3907, 3845, 5208, 5205
As you can see, the raster has 4 bands (red,green,blue and nir) and is in UTM 33 projection.
> shp class : SpatialPointsDataFrame features : 346 extent : 624751.2, 641951.9, 5304893, 5319267 (xmin, xmax, ymin, ymax) coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 variables : 1 names : class min values : 1 max values : 5
The shapefile contains 346 points and has 5 classes:
- 1 – Water
- 2 – Reeds
- 3 – Agriculture
- 4 – Built-up
- 5 – Forest
Step 2 – Install xgBoost package
Before we can start, we need one last thing: The xgBoost R package – installing it is really straight forward simply type the following and load it afterwards:
Step 3 – Prepare training data
Now we need to prepare our training data: The first thing that we will do is to extract the values for each sample point from the raster:
vals <- extract(ras,shp)
Depending on the number of samples and raster size, this may take a while. Go grab a coffee if it takes too long! 🙂 Now we should have our data in a dataframe. For each sample point (row) we should have spectral information of each layer (column):
red green blue nir [1,] 1484 1464 1115 375 [2,] 1471 1468 1119 371 [3,] 1493 1474 1151 376 [4,] 1470 1435 1097 397 ...
Now we convert the dataframe into a matrix – this is needed for xgBoost:
train <- data.matrix(vals)
We are almost ready to start the training. One last thing: Remeber how my classes start with class number 1 and go up to 5? xgBoost is a littl bit of a “diva” and doesn’t like that. For multiclass classification like ours, classes need to start with 0 and go to n-1. Let’s take care of this:
# We must convert factors to numeric # They must be starting from number 0 to use multiclass # For instance: 0, 1, 2, 3, 4... classes <- as.numeric(as.factor(shp@data$class)) - 1
Step 4 – Train the xgBoost model
Training xgBoost is much more complex than randomForest because there are many more hyperparameters that need tuning. Finding out the right parametes is very important for a good and stable result. If you are interested in this topic, let me know and I will write a designated blog post about this. For now I will refer you to a couple of websites that cover this topic quite well (including R code):
- Hypertuning xgboost parameters (stackoverflow)
For know I will just show you how the set up the training with standard parameters:
xgb <- xgboost(data = train, label = classes, eta = 0.1, max_depth = 6, nround=100, objective = "multi:softmax", num_class = length(unique(classes)), nthread = 3)
The data argument takes our previously created matrix of values extracted from the raster. The labels are our reclassified classes scaled from 0 to 4. The other parameters (and optionally others as well) need to be tuned in your model individually.
Step 5 – Use model to classify a raster
For this you can use the predict function – in my case a little hacky version:
# prediction on the raw raster yielded an error. That's why I extract the values from the raster and predict on them. # Note: works only with small raster or huge RAM. Let me know if it works for you. result <- predict(xgb, ras[1:(nrow(ras)*ncol(ras))]) #create dummy result raster res <- raster(ras) #fill in results and add a "1" to them (to get back to inital class numbering! - see above "Prepate data" for more information) res <- setValues(res,result+1)
Voila! We are done. Lets have a look at our result in QGIS:
The classification looks quite homogenous (no salt and pepper effekt). Especially if we keep in mind that we have quite complex classes like reeds and agriculture and only sampled with about 300 samples and used 4 input bands. Impressive!! To achieve better results one might use multiple seasons, increase the number of bands orsamples and tune the hyper paramaters.
Good luck building your classifier! Let me know if you have any questions.