Performing table joins by group and interval in R

Dear all,

when working with linestring features (e.g. river or infrastructure networks), we might want to join two data frames by group and interval. I recently had to perform such a task when I wanted to join traffic volume data of permanent traffic counters (available at https://www.asfinag.at/verkehr/verkehrszaehlung/) on a road graph of the Austrian highway network. Obviously, traffic volume remains constant between junctions, entailing that traffic volumes measured at a certain point can be attributed to the whole section between the last and the next junction. Having constructed the two data frames I intended to join, I expected that performing the table joins would be a straightforward task – however, I soon had to realize that this was a little bit more complex than initially expected.

I will try to illustrate the problem at the example of two small data sets:

For instance, assume that we have a small road graph of several highways (called A1, A2 and A3) consisting of segments of 1 kilometer. Usually, various features are linked to each segment
(such as data on roadway geometry, e.g. skid resistance, number of lanes or speed limits), but in order to keep the example as simple as possible I have not added extra columns for additional parameters.

segments <- data_frame(road_name = rep(paste0("A", 1:3), each = 10),
                       from = rep(0:9, 3),
                       to = rep(1:10, 3)
                       )

Our second data set contains our traffic counters, indicating a value for the annual average daily traffic (AADT) as well as the range between two junctions where this measured value is valid.

set.seed(42)
measurement <- data_frame(road = rep(paste0("A", 1:3), each = 2),
                          km_start = c(2.4, 4.2, 1.2, 6.5, 0, 7.3),
                          km_end = c(4.2, 8.7, 6.5, 12.4, 7.3, 9.2),
                          aadt = sample(x = 5000:10000, size = 6, replace = FALSE)
                          )

When searching for appropriate functions I could use to match the AADT-values of the intervals in measurement to the intervals in segments while also taking into account the road ID, I did not come across a convenient function that would perform such a task. There is the %within% operator in the lubridate package – which would be helpful if I was dealing with date-time-objects. A possible solution could be based on rolling joins as implemented in the data.table package, but I eventually decided to write my own function using the base function findInterval().

match_intervals <- function(obj, obj_start, obj_end, obj_group,
                            match_to, match_start, match_end,
                            match_group, match_id, sf = FALSE){

  # helper function to obtain simple vectors from tibbles
  uu <- function(x) unname(unlist(x))

  # safe copy of input obj
  obj_tbl <- obj

  # define list to store temporary results
  tmp <- list()

  # remove geometry if obj is sf
  # if(sf) st_geometry(obj_tbl) <- NULL

  # rename group columns and id column
  names(obj_tbl)[names(obj_tbl) == obj_group] <- "group"
  names(match_to)[names(match_to) == match_group] <- "group"
  names(match_to)[names(match_to) == match_id] <- "id"

  # loop over all rows
  for (i in 1:nrow(match_to)){
    # get limits
    limits <- c(uu(match_to[match_start])[i],
                uu(match_to[match_end])[i])
    
    # index for start
    index_start <- obj_tbl %>%
      filter(group == match_to$group[i]) %>%
      select(obj_start) %>%
      uu %>%
      findInterval(limits)
    
    # index for end
    index_end <- obj_tbl %>%
      filter(group == match_to$group[i]) %>%
      select(obj_end) %>%
      uu %>%
      findInterval(limits)

    # combine indices
    index <- ifelse(index_start == 1 | index_end == 1, TRUE, FALSE)

    if(i == 1){
      # exception for first element
      res <- obj_tbl %>%
        filter(group == match_to$group[i]) %>%
        mutate(match = ifelse(index, match_to$id[i], NA))
      tmp[[i]] <- res
    } else {
      # if subsequent rows belong to the same group
      if(match_to$group[i] == match_to$group[i-1]){
        res <- tmp[[i-1]] %>%
        mutate(match = ifelse(index, match_to$id[i], match))
      tmp[[i]] <- res
      } else {
        # if new group
        res <- obj_tbl %>%
          filter(group == match_to$group[i]) %>%
          mutate(match = ifelse(index, match_to$id[i], NA))
        tmp[[i]] <- res
      }
    }
  }
  selector <- (1:nrow(match_to))[!duplicated(match_to$group, fromLast=T)]
  res <- bind_rows(tmp[selector])

 # revert names to original names
 names(res)[names(res) == "group"] <- obj_group
 names(res)[names(res) == "match"] <- match_id

 # revert back to sf
 if(sf) st_geometry(res) <- st_geometry(obj)

 # return final output
 return(res)
}

Using this – admittedly rather ugly function, we can perform the table join by specifying all relevant arguments. Concerning the the first data frame (obj) we have to set determine the column names for segment limits (obj_start and obj_end) and the grouping variable (obj_group). Subsequently, we specify match_to, which is our second data frame to be matched to obj, as well as the column names for segment limits (match_start and match_end), the grouping variable (match_group) and the column we want to join to obj (match_id).

segments_matched <- match_intervals(obj = segments,
                                    obj_start = "from",
                                    obj_end = "to",
                                    obj_group = "road_name",
                                    match_to = measurement,
                                    match_start = "km_start",
                                    match_end = "km_end",
                                    match_group = "road",
                                    match_id = "aadt")

This will eventually yield the desired output:

> segments_matched
# A tibble: 30 x 4
   road_name  from    to  aadt
       <chr> <int> <int> <int>
 1        A1     0     1    NA
 2        A1     1     2    NA
 3        A1     2     3  9574
 4        A1     3     4  9574
 5        A1     4     5  9685
 6        A1     5     6  9685
 7        A1     6     7  9685
 8        A1     7     8  9685
 9        A1     8     9  9685
10        A1     9    10    NA
# ... with 20 more rows

Please be aware that this is just my first attempt to provide a function for performing this task and thus will need some refactoring – especially the nested ifelse-statement at the end is somewhat really ugly.

If you have any suggestions on how to improve this workflow feel free to comment below 🙂

Regards,
Matthias

About This Author

Matthias studied Environmental Information Management at the University of Natural Resources and Life Sciences Vienna and holds a PhD in environmental statistics. The focus of his thesis was on the statistical modelling of rare (extreme) events as a basis for vulnerability assessment of critical infrastructure. He is working at the Austrian national weather and geophysical service (ZAMG) and at the Institute of Mountain Risk Engineering at BOKU University. He currently focuses the (statistical) assessment of adverse weather events and natural hazards, and disaster risk reduction. His main interests are statistical modelling of environmental phenomena as well as open source tools for data science, geoinformation and remote sensing.

Post A Reply

*