Dear all,
during the process of streamlining my posts on extreme value analysis, I have tried to rework the import of the data sets used for my elaborations. Usually, one would definitely use custom functions to perform tasks like this data import, and I figured that this is also useful in this demonstrative case for two reasons. Firstly, this procedure is used in several consecutive posts. Repeatedly performing the same operations is probably the most striking indication for using a function. Secondly, the way data were imported originally is somewhat clumsy, and the updated version is nicer to the eye as well.
In the original version, all data sets were imported manually like this:
# load package RCurl library(RCurl) # get data from eHYD and build data frame link <- getURL("http://ehyd.gv.at/eHYD/MessstellenExtraData/nlv?id=107540&file=2") df <- read.csv(text = link, sep=";", dec=",", skip=21, header=FALSE) df[df == "Lücke"] = NA colnames(df) <- c("Date","Precipitation") df$Date <- as.Date(df$Date, format = "%d.%m.%Y") df$Precipitation <- as.numeric(gsub(",", ".", as.character(df$Precipitation))) head(df)
This resulted in a simple data frame called df
with two columns, “Date” and “Precipitation”. Even though it is definitely not wrong to import our data set like this, there is room for improvement, especially as far as the coding style is concerned. The process of reading data from the website of the hydrographic office of Austria can be done much more conveniently using a simple function like this:
# load package xts library(xts) read.ehyd <- function(ehyd_url) { # separate the header, open the connection with correct encoding con <- url(ehyd_url, encoding = "latin1") header <- readLines(con, n=50) lines.header <- grep("Werte:", header, fixed = T) # read data, define time and values infile <- read.csv2(con, header = F, skip = lines.header, col.names = c("time", "value"), colClasses = c("character", "numeric"), na.strings = "Lücke", strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1") infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S") # return time series object of class xts return(xts(infile$value, order.by = infile$time)) }
In doing so we can avoid using the package RCurl
. The resulting object is a nice xts
time series, and thus a closer representation of the actual data structure, which is in fact a hydrological time series.
This function is sufficient for the use in my tutorials on extreme value analysis and dygraphs. However, we could still enhance this function in several ways.
For instance, the argument ehyd_url
of the function could be disassembled further. The URL includes the ID of the station (107540), an indicator that the requested file belongs to the area of ‘precipitation, air temperature and evaporation’ (in German: ‘Niederschlag, Lufttemperatur und Verdunstung’, thus ‘NLV’) as well as an indicator for the hydrological parameter (2 denotes precipitation). Thus, the function could be modified further to either directly accept an URL or building the URL from some input information about the properties of the file. In a simple version, we could start constructing the function like this:
suppressPackageStartupMessages({ library(xts) }) read.ehyd <- function(hzbnr, parameter = c("niederschlag", "neuschnee", "schneehöhe", "wasserstand", "abfluss"), ehyd_url) { parameter <- match.arg(parameter) if(missing(ehyd_url)){ area <- ifelse(parameter %in% c("wasserstand", "abfluss"), "owf", "nlv") param_names <- c("niederschlag","neuschnee", "schneehöhe", "wasserstand", "abfluss") param_ids <- c(2, 3, 4, 7, 4) file_nr <- param_ids[which(param_names == parameter)] ehyd_url <- paste0("http://ehyd.gv.at/eHYD/MessstellenExtraData/", area, "?id=", hzbnr, "&file=", file_nr) } ... }
with hzbnr
indicating the station ID, parameter
being the environmental variable we are interested in, “area” being either ‘nlv’ or ‘owf’ (depending on the input parameter), and file_nr
being the indicator for correct file depending on the desired environmental variable.
You can also mess around with some regular expressions to obtain the station name and the name of the hydrographic parameter (in German) from the header of the respective file.
In a simple version (which doesn’t include building the url by providing the hydrographic parameter and the station id as mentioned above, this might look something like this:
read_ehyd <- function(ehyd_url) { # separate the header, open the connection with correct encoding con <- url(ehyd_url, encoding = "latin1") header <- readLines(con, n=50) lines.header <- grep("Werte:", header, fixed = T) # read data, define time and values infile <- read.csv2(con, header = F, skip = lines.header, col.names = c("time", "value"), colClasses = c("character", "numeric"), na.strings = "Lücke", strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1") infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S") # output message station <- gsub("(^.*;)([a-zA-Z]+)", "\\2", header[1]) line_param <- grep("Exportzeitreihe:", header, fixed = T) if(endsWith(ehyd_url, "2")){ param <- gsub("(^.*;)([a-zA-Z]+)(,.+$)", "\\2", header[line_param]) } else { param <- gsub("Exportzeitreihe: ;", "", header[line_param]) param <- strsplit(param, split = ",")[[1]][[2]] } message("data set: ", param, " in ", station) # return time series object of class xts return(xts(infile$value, order.by = infile$time)) }
A nicely implemented version to solve the task at hand is the function read.hzb()
which can be found at
https://github.com/mundl/readhyd
Post A Reply