Introduction to Extreme Value Analysis in R – Part 3: Peak over Threshold Approach

Hey there,

welcome to part 3 of our short introduction to extreme value analysis using the extRemes package in R.

Having discussed the block maxima method the last time, we will now have a look at the threshold excess method.

According to Coles (2001), the threshold approach is more efficient than the block maxima method if complete (time) series without gaps are available, as all values exceeding a certain threshold can serve as a basis for model fitting. In some cases, fitting distributions to block maxima data is a wasteful approach as only one value per block is used for modelling, while a threshold excess approach potentially provides more information on extremes.

However, analogous to the choice of the block size in the block maxima approach, the selection of the threshold value for partial duration models is also subject to a trade-off between bias (low threshold) and variance (high threshold).

Coles (2001) describes two different methods for threshold selection. Firstly, there is an exploratory approach based on the mean residual life plot. This technique is applied prior to the actual model fitting. Secondly, an alternative approach is to assess the stability of parameter estimates. Hence, this sensitivity analysis for model fitting is performed across a range of different thresholds.

However, selecting an appropriate threshold is probably the most crucial part of performing an extreme value analysis using partial duration series. Scarrott and MacDonald provide a quite good overview of approaches for threshold estimation in their 2012 article A review of extreme value threshold estimation and uncertainty quantification (REVSTAT 10(1): 33–59).

Having found an appropriate threshold, the resulting subset of extreme values exceeding this threshold is used for fitting a generalized Pareto distribution.

According to the Pickands-Balkema-de Haan theorem, the distribution of values exceeding a threshold can be approximated by a generalized pareto distribution.

The following code shows a short practical example of fitting a generalized pareto distribution to a time series of precipitation data using the extRemes package in R. The sample data set features precipitation data of Oberwang (Salzburg, Austria) from 1981 through 2014 and is is provided by the hydrographic services of Austria via via eHYD. The function read_ehyd() for importing the data set can be found at Reading data from eHYD using R.

# load packages
library(extRemes)
library(xts)

# get data from eHYD
ehyd_url <- "http://ehyd.gv.at/eHYD/MessstellenExtraData/nlv?id=105700&file=2"
precipitation_xts <- read_ehyd(ehyd_url)

# mean residual life plot:
mrlplot(precipitation_xts, main="Mean Residual Life Plot")
# The mean residual life plot depicts the Thresholds (u) vs Mean Excess flow.
# The idea is to find the lowest threshold where the plot is nearly linear;
# taking into account the 95% confidence bounds.

# fitting the GPD model over a range of thresholds
threshrange.plot(precipitation_xts, r = c(30, 45), nint = 16)
# ismev implementation is faster:
# ismev::gpd.fitrange(precipitation_xts, umin=30, umax=45, nint = 16)
# set threshold
th <- 40

# maximum likelihood estimation
pot_mle <- fevd(as.vector(precipitation_xts), method = "MLE", type="GP", threshold=th)
# diagnostic plots
plot(pot_mle)
rl_mle <- return.level(pot_mle, conf = 0.05, return.period= c(2,5,10,20,50,100))

# L-moments estimation
pot_lmom <- fevd(as.vector(precipitation_xts), method = "Lmoments", type="GP", threshold=th)
# diagnostic plots
plot(pot_lmom)
rl_lmom <- return.level(pot_lmom, conf = 0.05, return.period= c(2,5,10,20,50,100))

# return level plots
par(mfcol=c(1,2))
# return level plot w/ MLE
plot(pot_mle, type="rl",
     main="Return Level Plot for Oberwang w/ MLE",
     ylim=c(0,200), pch=16)
loc <- as.numeric(return.level(pot_mle, conf = 0.05,return.period=100))
segments(100, 0, 100, loc, col= 'midnightblue',lty=6)
segments(0.01,loc,100, loc, col='midnightblue', lty=6)

# return level plot w/ LMOM
plot(pot_lmom, type="rl",
     main="Return Level Plot for Oberwang w/ L-Moments",
     ylim=c(0,200))
loc <- as.numeric(return.level(pot_lmom, conf = 0.05,return.period=100))
segments(100, 0, 100, loc, col= 'midnightblue',lty=6)
segments(0.01,loc,100, loc, col='midnightblue', lty=6)

# comparison of return levels
results <- t(data.frame(mle=as.numeric(rl_mle),
                        lmom=as.numeric(rl_lmom)))
colnames(results) <- c(2,5,10,20,50,100)
round(results,1)

return_levels_pot
This is an example that illustrates nicely why the approach based on L-moments may be preferred over maximum likelihood estimation, as the right image proves clearly that the influence of the outlier is much smaller when using L-moments estimation. In addition to these classical estimation methods, extRemes offers Generalized Maximum Likelihood Estimation (GMLE, Martins and Stedinger, 2000) and Bayesian estimation methods (Gilleland and Katz, 2016).

Reference: Coles, Stuart (2001). An Introduction to Statistical Modeling of Extreme Values. Springer Series in Statistics. London: Springer.

About This Author

Matthias studied Environmental Information Management at the University of Natural Resources and Life Sciences Vienna and holds a PhD in environmental statistics. The focus of his thesis was on the statistical modelling of rare (extreme) events as a basis for vulnerability assessment of critical infrastructure. He is working at the Austrian national weather and geophysical service (ZAMG) and at the Institute of Mountain Risk Engineering at BOKU University. He currently focuses the (statistical) assessment of adverse weather events and natural hazards, and disaster risk reduction. His main interests are statistical modelling of environmental phenomena as well as open source tools for data science, geoinformation and remote sensing.

9 Comments

You can post comments in this post.


  • Congratulations for this post. It is really helpful.
    I am working on extremes in R and Ι have already estimated the parameters (scale, shape) for GEV and Pareto distributions using “MLE” and “L moments” methods. Moreover, I have to calculate the values of shape and scale for GEV distribution, using the Bayesian method. But I can’t estimate these parameters for Pareto distribution using the Bayesian method. More specifically I am writing
    fevd(data, threshold, type="GP", method="Bayesian")
    and I am taking the following message from R:
    Error in colMeans(x$chain.info[, 2:(pdim[2] - 1)], na.rm = TRUE):
    'x' must be an array of at least two dimensions
    .

    Could you please help me with my problem?
    Thanks in advance!

    Georgia L 7 years ago Reply


    • Hey,
      it is a bit difficult to reproduce this error message. I guess that this is not an issue related to the extRemes package, but rather to your input data set.
      What do your input data data look like?

      A minimal working example for using Bayesian parameter estimation would be:

      library(extRemes)
      library(ismev)
      
      # Southwest-England rainfall data from Coles
      data(rain)
      
      # simple threshold (usually one should not use fixed quantiles)
      u <- quantile(rain, 0.9)
      
      # fit GP using Bayesian parameter estimation
      x <- fevd(rain, threshold = u , type='GP', method='Bayesian')
      

      This takes a couple of seconds, but just works fine.

      Regards,
      Matthias

      Matthias 6 years ago Reply


  • Hi, thanks for this post. Very clear and simple to understand.

    I’m working on extreme event in R using fevd {extRemes}. I have a big sample (more than 4 million rows). I’m trying to fit my data but I found a huge AIC and it doesn’t match with the plot. Bigger AIC better fits my data, the return period are empty for the smallest AIC. So my question is: should I select my model using the plots or the AIC?

    thanks for your help

    my R code :

    B1.fit=fevd(B1[,2], data=B1, threshold=B1.th, type="GEV",method ="MLE", units="KVA",time.units="seconds",period = "Seconds") 
    B1.fit1=fevd(B1[,2], data=B1, threshold=B1.th,type="GP",method ="MLE", units="KVA",time.units="seconds",period = "Seconds") 
    B1.fit2=fevd(B1[,2], data=B1, threshold=B1.th,type="Gumbel", method ="MLE",units="KVA",time.units="seconds",period = "Seconds")
    B1.fit3=fevd(B1[,2], data=B1, threshold=B1.th,type="Exponential",method ="MLE", units="KVA",time.units="seconds",period = "Seconds")
    
    fit.AIC=summary(B1.fit, silent=TRUE)$AIC
    fit1.AIC=summary(B1.fit1, silent=TRUE)$AIC
    fit2.AIC=summary(B1.fit2, silent=TRUE)$AIC
    fit3.AIC=summary(B1.fit3, silent=TRUE)$AIC
    
    fit.AIC
    # [1] 39976258
    fit1.AIC
    # [1] 466351.5
    fit2.AIC
    # [1] 13934878
    fit3.AIC
    # [1] 466330.8 
    
    plot(B1.fit)
    plot(B1.fit1)
    plot(B1.fit2)
    plot(B1.fit3)
    

    Feriel B 6 years ago Reply


    • Hi,

      without knowing your data set I can only provide the following recommendations:

      • Threshold selection is arguably the most crucial part when employing the threshold excess approach. Make sure to consult diagnostic plots (parameter stability plots, mean excess plot) for selecting an appropriate threshold.
      • You seem to use only one data set. This should allow you to carefully consult various diagnostic plots of the resulting fitted functions. Personally, I think that visual inspection of results is a very important aspect. Goodness-of-fit measures might lead to somewhat confusing conclusions, especially in the threshold-excess approach. A model that fits the bulk of data at the lower tail of the distribution very well but performs poorly at the upper tail (i.e. high return periods) might still have better GOF metrics than a model that offers a sound performance at the upper tail (which is actually the part we are interested in).
      • You might want to try a different fitting method to double-check your results. extRemes provides several useful options, which can be controlled by the argument method.

      I hope this helps.
      Regards,
      Matthias

      Matthias 6 years ago Reply


  • Hi Matthias, your posts here are really amazing. Great job! I am also working now on a similar topic, basically on extreme events in R where I should use Block Maxima Method and Peaks Over Thershold Method, and also different distributions to calculate Value at Risk (VaR) and Expected Shortfall (ES) of given data. Could you be able to help me with the codes?

    Best regards,
    Martina

    Martina 6 years ago Reply


  • please can i get the code on how to draw the peak over threshold image drawn above

    Emmanuel Koranteng 4 years ago Reply


  • How can i read the precipitation value from NetCDF ( TRMM dataset ) file for analysis of Daily Extreme precipitation value.

    Pawan 4 years ago Reply


  • Hello and thank you for your really helpful blog.

    How can I get the matrix of the mrplot? The one with the columns of the threshold, number of exceedances, mean excess etc. that is returned invisibly.

    Thank you in advance,

    Paspati Maria

    Maria Paspati 4 years ago Reply


  • I am working on extremes rainfall in R but I failed to estimate parameters for gamma pareto and gamma generalized pareto distributions using R. Could you please help me with the codes in R studio?

    Jean 4 years ago Reply


Post A Reply

*