Introduction to Extreme Value Analysis in R – Part 3: Peak over Threshold Approach

Hey there,

welcome to part 3 of our short introduction to extreme value analysis using the extRemes package in R.

Having discussed the block maxima method the last time, we will now have a look at the threshold excess method.

According to Coles (2001), the threshold approach is more efficient than the block maxima method if complete (time) series without gaps are available, as all values exceeding a certain threshold can serve as a basis for model fitting. In some cases, fitting distributions to block maxima data is a wasteful approach as only one value per block is used for modelling, while a threshold excess approach potentially provides more information on extremes.

However, analogous to the choice of the block size in the block maxima approach, the selection of the threshold value for partial duration models is also subject to a trade-off between bias (low threshold) and variance (high threshold).

Coles (2001) describes two different methods for threshold selection. Firstly, there is an exploratory approach based on the mean residual life plot. This technique is applied prior to the actual model fitting. Secondly, an alternative approach is to assess the stability of parameter estimates. Hence, this sensitivity analysis for model fitting is performed across a range of different thresholds.

However, selecting an appropriate threshold is probably the most crucial part of performing an extreme value analysis using partial duration series. Scarrott and MacDonald provide a quite good overview of approaches for threshold estimation in their 2012 article A review of extreme value threshold estimation and uncertainty quantification (REVSTAT 10(1): 33–59).

Having found an appropriate threshold, the resulting subset of extreme values exceeding this threshold is used for fitting a generalized Pareto distribution.

According to the Pickands-Balkema-de Haan theorem, the distribution of values exceeding a threshold can be approximated by a generalized pareto distribution.

The following code shows a short practical example of fitting a generalized pareto distribution to a time series of precipitation data using the extRemes package in R. The sample data set features precipitation data of Oberwang (Salzburg, Austria) from 1981 through 2013 and is provided by the hydrographic services of Austria via via eHYD.

# load packages
library(extRemes)
library(ismev)
library(RCurl)

# get data from eHYD
link <- getURL("http://ehyd.gv.at/eHYD/MessstellenExtraData/nlv?id=105700&file=2")
df <- read.csv(text = link, sep=";", dec=",", skip=21, header=FALSE)
df[df == "Lücke"] = NA
colnames(df) <- c("Date","Precip")
df$Date <- as.Date(df$Date, format = "%d.%m.%Y")
df$Precip <- as.numeric(gsub(",", ".", as.character(df$Precip)))
df <- na.omit(df)
head(df)

# mean residual life plot:
mrlplot(df$Precip, main="Mean Residual Life Plot")
# The mean residual life plot depicts the Thresholds (u) vs Mean Excess flow.
# The idea is to find the lowest threshold where the plot is nearly linear;
# taking into account the 95% confidence bounds.

# fitting the GPD model over a range of thresholds
gpd.fitrange(df$Precip, umin=30, umax=45, nint = 16)

# set threshold
th <- 40

# maximum likelihood estimation
pot.mle <- fevd(df$Precip, method = "MLE", type="GP", threshold=th)
plot(pot.mle) # diagnostic plots
rl.mle <- return.level(pot.mle, conf = 0.05, return.period= c(2,5,10,20,50,100))

# L-moments estimation
pot.lmom <- fevd(df$Precip, method = "Lmoments", type="GP", threshold=th)
plot(pot.lmom) # diagnostic plots
rl.lmom <- return.level(pot.lmom, conf = 0.05, return.period= c(2,5,10,20,50,100))

# return level plots
par(mfcol=c(1,2))
# return level plot w/ MLE
plot(pot.mle, type="rl",
     main="Return Level Plot for Oberwang w/ MLE",
     ylim=c(0,200), pch=16)
loc <- as.numeric(return.level(pot.mle, conf = 0.05,return.period=100))
segments(100, 0, 100, loc, col= 'midnightblue',lty=6)
segments(0.01,loc,100, loc, col='midnightblue', lty=6)

# return level plot w/ LMOM
plot(pot.lmom, type="rl",
     main="Return Level Plot for Oberwang w/ L-Moments",
     ylim=c(0,200))
loc <- as.numeric(return.level(pot.lmom, conf = 0.05,return.period=100))
segments(100, 0, 100, loc, col= 'midnightblue',lty=6)
segments(0.01,loc,100, loc, col='midnightblue', lty=6)

# comparison of return levels
res <- t(data.frame(mle=as.numeric(rl.mle),
                    lmom=as.numeric(rl.lmom)))
colnames(res) <- c(2,5,10,20,50,100)
round(res,1)

return_levels_pot
This is an example that illustrates nicely why the approach based on L-moments may be preferred over maximum likelihood estimation, as the right image proves clearly that the influence of the outlier is much smaller when using L-moments estimation.

Reference: Coles, Stuart (2001). An Introduction to Statistical Modeling of Extreme Values. Springer Series in Statistics. London: Springer.

About This Author

Matthias studied Environmental and Bio-Resources Management with a specialization in Environmental Information Management at the University of Natural Resources and Life Sciences (Vienna). He is currently a PhD student working at the Austrian Institute of Technology. Having written his master's thesis about extreme weather risk identification for the Austrian road network, he currently focuses on modeling of adverse weather events as a basis for risk assessment of road infrastructure networks.

5 Comments

You can post comments in this post.


  • Congratulations for this post. It is really helpful.
    I am working on extremes in R and Ι have already estimated the parameters (scale, shape) for GEV and Pareto distributions using “MLE” and “L moments” methods. Moreover, I have to calculate the values of shape and scale for GEV distribution, using the Bayesian method. But I can’t estimate these parameters for Pareto distribution using the Bayesian method. More specifically I am writing
    fevd(data, threshold, type="GP", method="Bayesian")
    and I am taking the following message from R:
    Error in colMeans(x$chain.info[, 2:(pdim[2] - 1)], na.rm = TRUE):
    'x' must be an array of at least two dimensions
    .

    Could you please help me with my problem?
    Thanks in advance!

    Georgia L 11 months ago Reply


    • Hey,
      it is a bit difficult to reproduce this error message. I guess that this is not an issue related to the extRemes package, but rather to your input data set.
      What do your input data data look like?

      A minimal working example for using Bayesian parameter estimation would be:

      library(extRemes)
      library(ismev)
      
      # Southwest-England rainfall data from Coles
      data(rain)
      
      # simple threshold (usually one should not use fixed quantiles)
      u < - quantile(rain, 0.9)
      
      # fit GP using Bayesian parameter estimation
      x <- fevd(rain, threshold = u , type='GP', method='Bayesian')
      

      This takes a couple of seconds, but just works fine.

      Regards,
      Matthias

      Matthias 8 months ago Reply


  • Hi, thanks for this post. Very clear and simple to understand.

    I’m working on extreme event in R using fevd {extRemes}. I have a big sample (more than 4 million rows). I’m trying to fit my data but I found a huge AIC and it doesn’t match with the plot. Bigger AIC better fits my data, the return period are empty for the smallest AIC. So my question is: should I select my model using the plots or the AIC?

    thanks for your help

    my R code :

    B1.fit=fevd(B1[,2], data=B1, threshold=B1.th, type="GEV",method ="MLE", units="KVA",time.units="seconds",period = "Seconds") 
    B1.fit1=fevd(B1[,2], data=B1, threshold=B1.th,type="GP",method ="MLE", units="KVA",time.units="seconds",period = "Seconds") 
    B1.fit2=fevd(B1[,2], data=B1, threshold=B1.th,type="Gumbel", method ="MLE",units="KVA",time.units="seconds",period = "Seconds")
    B1.fit3=fevd(B1[,2], data=B1, threshold=B1.th,type="Exponential",method ="MLE", units="KVA",time.units="seconds",period = "Seconds")
    
    fit.AIC=summary(B1.fit, silent=TRUE)$AIC
    fit1.AIC=summary(B1.fit1, silent=TRUE)$AIC
    fit2.AIC=summary(B1.fit2, silent=TRUE)$AIC
    fit3.AIC=summary(B1.fit3, silent=TRUE)$AIC
    
    fit.AIC
    # [1] 39976258
    fit1.AIC
    # [1] 466351.5
    fit2.AIC
    # [1] 13934878
    fit3.AIC
    # [1] 466330.8 
    
    plot(B1.fit)
    plot(B1.fit1)
    plot(B1.fit2)
    plot(B1.fit3)
    

    Feriel B 5 months ago Reply


    • Hi,

      without knowing your data set I can only provide the following recommendations:

      • Threshold selection is arguably the most crucial part when employing the threshold excess approach. Make sure to consult diagnostic plots (parameter stability plots, mean excess plot) for selecting an appropriate threshold.
      • You seem to use only one data set. This should allow you to carefully consult various diagnostic plots of the resulting fitted functions. Personally, I think that visual inspection of results is a very important aspect. Goodness-of-fit measures might lead to somewhat confusing conclusions, especially in the threshold-excess approach. A model that fits the bulk of data at the lower tail of the distribution very well but performs poorly at the upper tail (i.e. high return periods) might still have better GOF metrics than a model that offers a sound performance at the upper tail (which is actually the part we are interested in).
      • You might want to try a different fitting method to double-check your results. extRemes provides several useful options, which can be controlled by the argument method.

      I hope this helps.
      Regards,
      Matthias

      Matthias 5 months ago Reply


  • Hi Matthias, your posts here are really amazing. Great job! I am also working now on a similar topic, basically on extreme events in R where I should use Block Maxima Method and Peaks Over Thershold Method, and also different distributions to calculate Value at Risk (VaR) and Expected Shortfall (ES) of given data. Could you be able to help me with the codes?

    Best regards,
    Martina

    Martina 2 months ago Reply


Post A Reply

*