Reading data from eHYD using R

Dear all,

during the process of streamlining my posts on extreme value analysis, I have tried to rework the import of the data sets used for my elaborations. Usually, one would definitely use custom functions to perform tasks like this data import, and I figured that this is also useful in this demonstrative case for two reasons. Firstly, this procedure is used in several consecutive posts. Repeatedly performing the same operations is probably the most striking indication for using a function. Secondly, the way data were imported originally is somewhat clumsy, and the updated version is nicer to the eye as well.

In the original version, all data sets were imported manually like this:

# load package RCurl
library(RCurl)

# get data from eHYD and build data frame
link <- getURL("http://ehyd.gv.at/eHYD/MessstellenExtraData/nlv?id=107540&file=2")
df <- read.csv(text = link, sep=";", dec=",", skip=21, header=FALSE)
df[df == "Lücke"] = NA
colnames(df) <- c("Date","Precipitation")
df$Date <- as.Date(df$Date, format = "%d.%m.%Y")
df$Precipitation <- as.numeric(gsub(",", ".", as.character(df$Precipitation)))
head(df)

This resulted in a simple data frame called df with two columns, “Date” and “Precipitation”. Even though it is definitely not wrong to import our data set like this, there is room for improvement, especially as far as the coding style is concerned. The process of reading data from the website of the hydrographic office of Austria can be done much more conveniently using a simple function like this:

# load package xts
library(xts)

read.ehyd <- function(ehyd_url) {
    # separate the header, open the connection with correct encoding
    con <- url(ehyd_url, encoding = "latin1")
    header <- readLines(con, n=50)
    lines.header <- grep("Werte:", header, fixed = T)
    # read data, define time and values
    infile <- read.csv2(con, header = F, skip = lines.header,
        col.names = c("time", "value"),
        colClasses = c("character", "numeric"),
        na.strings = "Lücke",
        strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1")
    infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S")
    # return time series object of class xts
    return(xts(infile$value, order.by = infile$time))
}

In doing so we can avoid using the package RCurl. The resulting object is a nice xts time series, and thus a closer representation of the actual data structure, which is in fact a hydrological time series.

This function is sufficient for the use in my tutorials on extreme value analysis and dygraphs. However, we could still enhance this function in several ways.

For instance, the argument ehyd_url of the function could be disassembled further. The URL includes the ID of the station (107540), an indicator that the requested file belongs to the area of ‘precipitation, air temperature and evaporation’ (in German: ‘Niederschlag, Lufttemperatur und Verdunstung’, thus ‘NLV’) as well as an indicator for the hydrological parameter (2 denotes precipitation). Thus, the function could be modified further to either directly accept an URL or building the URL from some input information about the properties of the file. In a simple version, we could start constructing the function like this:

suppressPackageStartupMessages({
  library(xts)
})

read.ehyd <- function(hzbnr, parameter = c("niederschlag", "neuschnee",
                                           "schneehöhe", "wasserstand", 
                                           "abfluss"), ehyd_url) {
  parameter <- match.arg(parameter)
  if(missing(ehyd_url)){
    area <- ifelse(parameter %in% c("wasserstand", "abfluss"), "owf", "nlv")
    param_names <- c("niederschlag","neuschnee", "schneehöhe", 
                     "wasserstand", "abfluss")
    param_ids <- c(2, 3, 4, 7, 4)
    file_nr <- param_ids[which(param_names == parameter)]
    ehyd_url <- paste0("http://ehyd.gv.at/eHYD/MessstellenExtraData/",
                       area, "?id=", hzbnr, "&file=", file_nr)
  }
  ...
}

with hzbnr indicating the station ID, parameter being the environmental variable we are interested in, “area” being either ‘nlv’ or ‘owf’ (depending on the input parameter), and file_nr being the indicator for correct file depending on the desired environmental variable.

You can also mess around with some regular expressions to obtain the station name and the name of the hydrographic parameter (in German) from the header of the respective file.

In a simple version (which doesn’t include building the url by providing the hydrographic parameter and the station id as mentioned above, this might look something like this:

read_ehyd <- function(ehyd_url) {
  # separate the header, open the connection with correct encoding
  con <- url(ehyd_url, encoding = "latin1")
  header <- readLines(con, n=50)
  lines.header <- grep("Werte:", header, fixed = T)
  # read data, define time and values
  infile <- read.csv2(con, header = F, skip = lines.header,
                      col.names = c("time", "value"),
                      colClasses = c("character", "numeric"),
                      na.strings = "Lücke",
                      strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1")
  infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S")
  # output message
  station <- gsub("(^.*;)([a-zA-Z]+)", "\\2", header[1])
  line_param <- grep("Exportzeitreihe:", header, fixed = T)
  if(endsWith(ehyd_url, "2")){
    param <- gsub("(^.*;)([a-zA-Z]+)(,.+$)", "\\2", header[line_param])
  } else {
    param <- gsub("Exportzeitreihe:           ;", "", header[line_param])
    param <- strsplit(param, split = ",")[[1]][[2]]
  }
  message("data set: ", param, " in ", station)
  # return time series object of class xts
  return(xts(infile$value, order.by = infile$time))
}

A nicely implemented version to solve the task at hand is the function read.hzb() which can be found at
https://github.com/mundl/readhyd

About This Author

Matthias studied Environmental Information Management at the University of Natural Resources and Life Sciences Vienna and holds a PhD in environmental statistics. The focus of his thesis was on the statistical modelling of rare (extreme) events as a basis for vulnerability assessment of critical infrastructure. He is working at the Austrian national weather and geophysical service (ZAMG) and at the Institute of Mountain Risk Engineering at BOKU University. He currently focuses the (statistical) assessment of adverse weather events and natural hazards, and disaster risk reduction. His main interests are statistical modelling of environmental phenomena as well as open source tools for data science, geoinformation and remote sensing.

Post A Reply

*