My favourite supervised classification method for land cover classification until now was the very popular Random Forest. Recently however, I stumbled upon the xgBoost algorithm which made me very curious because of its huge success on the machine learning competition platform Kaggle where it has won several competitions. Since I didn’t know much about this method, I did some research. Summarised in this post you will find the main difference between Random Forest and xgBoost, as well as a minimal example for running a xgBoost landcover classification on a Sentinel-2 image using R.
Random Forest vs. xgBoost
What is the difference you might ask? Well both of these methods are based on ensembles of decision trees, the main difference lies in the way how these trees are grow:
Random Forest is a Bagging algorithm which reduces variance. I found a very nice explanation on this topic in the following stackexchange post:
Say that you have very unreliable models, such as Decision Trees. (Why unreliable? Because if you change your data a little bit, the decision tree created can be very different.) In such a case, you can build a robust model (reduce variance) through bagging — bagging is when you create different models by resampling your data to make the resulting model more robust.
Random forest is what we call to bagging applied to decision trees, but it’s no different than other bagging algorithm.
Why would you want to do this? It depends on the problem. But usually, it is highly desirable for the model to be stable.
If you still are not 100% what bagging means, here is a short youtube video that will give you some further insights. It can be basically said that Random Forst grows the decision trees in parallel – each independently.
xgBoost is a Boosting algorithm, a very nice explanation of what this means, can be found in the very same stackexchange post:
Boosting reduces variance, and also reduces bias. It reduces variance because you are using multiple models (bagging). It reduces bias by training the subsequent model by telling him what errors the previous models made (the boosting part). […] In these ensembles, your base learner must be weak. If it overfits the data, there won’t be any residuals or errors for the subsequent models to build upon. Why are these good models? Well, most competitions in websites like Kaggle have been won using gradient boosting trees. Data science is an empirical science, “because it works” is good enough.
Again, here is a short youtube video that might help you understand boosting a little bit better. xgBoost leanrs from previous models and grows iteratively (it learns step by step by looking at the residuals for example).
xgBoost 101 for landcover in R
Let’s do a quick landcover classification! For this we need two things as an input:
- Satellite image (we will be using a 4 Band Sentinel-2 stack as a minimal example)
- Sample points
Here a short overview over my input data visualised with QGIS:
Step 1 – Load data into R
library(raster) shp <- shapefile("yourPath/samples.shp") ras <- stack("yourPath/sentinel2_2348_clip.tif")
And this is how our data looks in R:
Raster
> ras class : RasterStack dimensions : 1535, 1813, 2782955, 4 (nrow, ncol, ncell, nlayers) resolution : 10, 10 (x, y) extent : 624190, 642320, 5304770, 5320120 (xmin, xmax, ymin, ymax) coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 names : red, green, blue, nir min values : 820, 609, 377, 282 max values : 3907, 3845, 5208, 5205
As you can see, the raster has 4 bands (red,green,blue and nir) and is in UTM 33 projection.
Shapefile
> shp class : SpatialPointsDataFrame features : 346 extent : 624751.2, 641951.9, 5304893, 5319267 (xmin, xmax, ymin, ymax) coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 variables : 1 names : class min values : 1 max values : 5
The shapefile contains 346 points and has 5 classes:
- 1 – Water
- 2 – Reeds
- 3 – Agriculture
- 4 – Built-up
- 5 – Forest
Step 2 – Install xgBoost package
Before we can start, we need one last thing: The xgBoost R package – installing it is really straight forward simply type the following and load it afterwards:
install.packages("xbgoost") library(xgboost)
Step 3 – Prepare training data
Now we need to prepare our training data: The first thing that we will do is to extract the values for each sample point from the raster:
vals <- extract(ras,shp)
Depending on the number of samples and raster size, this may take a while. Go grab a coffee if it takes too long! 🙂 Now we should have our data in a dataframe. For each sample point (row) we should have spectral information of each layer (column):
red green blue nir [1,] 1484 1464 1115 375 [2,] 1471 1468 1119 371 [3,] 1493 1474 1151 376 [4,] 1470 1435 1097 397 ...
Now we convert the dataframe into a matrix – this is needed for xgBoost:
train <- data.matrix(vals)
We are almost ready to start the training. One last thing: Remeber how my classes start with class number 1 and go up to 5? xgBoost is a littl bit of a “diva” and doesn’t like that. For multiclass classification like ours, classes need to start with 0 and go to n-1. Let’s take care of this:
# We must convert factors to numeric # They must be starting from number 0 to use multiclass # For instance: 0, 1, 2, 3, 4... classes <- as.numeric(as.factor(shp@data$class)) - 1
Step 4 – Train the xgBoost model
Training xgBoost is much more complex than randomForest because there are many more hyperparameters that need tuning. Finding out the right parametes is very important for a good and stable result. If you are interested in this topic, let me know and I will write a designated blog post about this. For now I will refer you to a couple of websites that cover this topic quite well (including R code):
For know I will just show you how the set up the training with standard parameters:
xgb <- xgboost(data = train, label = classes, eta = 0.1, max_depth = 6, nround=100, objective = "multi:softmax", num_class = length(unique(classes)), nthread = 3)
The data argument takes our previously created matrix of values extracted from the raster. The labels are our reclassified classes scaled from 0 to 4. The other parameters (and optionally others as well) need to be tuned in your model individually.
Step 5 – Use model to classify a raster
For this you can use the predict function – in my case a little hacky version:
# prediction on the raw raster yielded an error. That's why I extract the values from the raster and predict on them. # Note: works only with small raster or huge RAM. Let me know if it works for you. result <- predict(xgb, ras[1:(nrow(ras)*ncol(ras))]) #create dummy result raster res <- raster(ras) #fill in results and add a "1" to them (to get back to inital class numbering! - see above "Prepate data" for more information) res <- setValues(res,result+1)
Voila! We are done. Lets have a look at our result in QGIS:
The classification looks quite homogenous (no salt and pepper effekt). Especially if we keep in mind that we have quite complex classes like reeds and agriculture and only sampled with about 300 samples and used 4 input bands. Impressive!! To achieve better results one might use multiple seasons, increase the number of bands orsamples and tune the hyper paramaters.
Good luck building your classifier! Let me know if you have any questions.
Cheers
Martin
11 Comments
You can post comments in this post.
Excellent post, very informative. Thanks for sharing! For step 5, did you try to use the predict function from the raster package? Also raster::beginCluster() and endCluster() may speed up prediction through parallelization. I wrote a post in my blog (at amsantac .co) showing an example of how to implement it in case you are interested. Thanks again for your contribution! I’m looking forward to reading your next posts!
Ali 5 years ago
Hey Ali! Thanks for the nice feedback!
I actually tried using the raster predict function also in combination with clusterR but I wasnt able to make it run. (With randomForest it runs flawlessly). If you find out more, let me know.
Cheers!
P.S.: great blog btw!
Martin 5 years ago
Has anyone been able to use a xgboost model to predict classes using the raster::predict multi-core function? You hack to predict to a raster works well but it just takes much longer than a multi-core prediction of a RandomForest or BRT model.
Evan 4 years ago
Do you need to use points as training sites or you can use polygons?
James Magidi 4 years ago
Hi,
what would be the solution if I want to train the model on polygon samples, not on point samples?
Thank You in advance
Dino 4 years ago
Thanks for your very interesting post !
As James and Dino, I would be quite interested to know how to do the same job with polygon samples as training sites.
Thanks again!
Paul Taconet 4 years ago
Hey Martin
Greetings from Vienna. I have immensely profited from your posts. I have a raster that has integer values sal 0 to 10. I can convert that to some categories such as c(“CL1”, CL2″….etc) by the ratify function. The problem is when I save my raster as ascii the stored attribute table is no more there.
Can you please suggest some options on how to store the categories of a raster so that later on such rasters can read and visualised
best wishes
dev
Dev 4 years ago
Hey Martin,
Great post, it was very helpful for me to understand XGB better. I’m currently working on a ~230 MB raster with 8GB of RAM, and your prediction method works well (of course, it takes some time – but it gets the job done!).
Thank you!
Robert 4 years ago
Hi Martin, thank you very much for your post I’ve really appreciated it 🙂
I’ve found a way to speed-up the prediction.
First you need to convert your raster to a data.frame:
mdf <- data.frame(rasterToPoints(ras))
Then you can use the native XGBoost procedure:
result %
dplyr::select(-x, -y)))
Hope this help
Tristan 3 years ago
Thank you for the hint! How is your experience with large rasters?
Martin 3 years ago
so the classes are combined as one shape file
Ghazal Ba khadher 3 years ago
Post A Reply