Dear all,
when working with linestring features (e.g. river or infrastructure networks), we might want to join two data frames by group and interval. I recently had to perform such a task when I wanted to join traffic volume data of permanent traffic counters (available at https://www.asfinag.at/verkehr/verkehrszaehlung/) on a road graph of the Austrian highway network. Obviously, traffic volume remains constant between junctions, entailing that traffic volumes measured at a certain point can be attributed to the whole section between the last and the next junction. Having constructed the two data frames I intended to join, I expected that performing the table joins would be a straightforward task – however, I soon had to realize that this was a little bit more complex than initially expected.
I will try to illustrate the problem at the example of two small data sets:
For instance, assume that we have a small road graph of several highways (called A1, A2 and A3) consisting of segments of 1 kilometer. Usually, various features are linked to each segment
(such as data on roadway geometry, e.g. skid resistance, number of lanes or speed limits), but in order to keep the example as simple as possible I have not added extra columns for additional parameters.
segments <- data_frame(road_name = rep(paste0("A", 1:3), each = 10), from = rep(0:9, 3), to = rep(1:10, 3) )
Our second data set contains our traffic counters, indicating a value for the annual average daily traffic (AADT) as well as the range between two junctions where this measured value is valid.
set.seed(42) measurement <- data_frame(road = rep(paste0("A", 1:3), each = 2), km_start = c(2.4, 4.2, 1.2, 6.5, 0, 7.3), km_end = c(4.2, 8.7, 6.5, 12.4, 7.3, 9.2), aadt = sample(x = 5000:10000, size = 6, replace = FALSE) )
When searching for appropriate functions I could use to match the AADT-values of the intervals in measurement
to the intervals in segments
while also taking into account the road ID, I did not come across a convenient function that would perform such a task. There is the %within%
operator in the lubridate
package – which would be helpful if I was dealing with date-time-objects. A possible solution could be based on rolling joins as implemented in the data.table
package, but I eventually decided to write my own function using the base function findInterval()
.
match_intervals <- function(obj, obj_start, obj_end, obj_group, match_to, match_start, match_end, match_group, match_id, sf = FALSE){ # helper function to obtain simple vectors from tibbles uu <- function(x) unname(unlist(x)) # safe copy of input obj obj_tbl <- obj # define list to store temporary results tmp <- list() # remove geometry if obj is sf # if(sf) st_geometry(obj_tbl) <- NULL # rename group columns and id column names(obj_tbl)[names(obj_tbl) == obj_group] <- "group" names(match_to)[names(match_to) == match_group] <- "group" names(match_to)[names(match_to) == match_id] <- "id" # loop over all rows for (i in 1:nrow(match_to)){ # get limits limits <- c(uu(match_to[match_start])[i], uu(match_to[match_end])[i]) # index for start index_start <- obj_tbl %>% filter(group == match_to$group[i]) %>% select(obj_start) %>% uu %>% findInterval(limits) # index for end index_end <- obj_tbl %>% filter(group == match_to$group[i]) %>% select(obj_end) %>% uu %>% findInterval(limits) # combine indices index <- ifelse(index_start == 1 | index_end == 1, TRUE, FALSE) if(i == 1){ # exception for first element res <- obj_tbl %>% filter(group == match_to$group[i]) %>% mutate(match = ifelse(index, match_to$id[i], NA)) tmp[[i]] <- res } else { # if subsequent rows belong to the same group if(match_to$group[i] == match_to$group[i-1]){ res <- tmp[[i-1]] %>% mutate(match = ifelse(index, match_to$id[i], match)) tmp[[i]] <- res } else { # if new group res <- obj_tbl %>% filter(group == match_to$group[i]) %>% mutate(match = ifelse(index, match_to$id[i], NA)) tmp[[i]] <- res } } } selector <- (1:nrow(match_to))[!duplicated(match_to$group, fromLast=T)] res <- bind_rows(tmp[selector]) # revert names to original names names(res)[names(res) == "group"] <- obj_group names(res)[names(res) == "match"] <- match_id # revert back to sf if(sf) st_geometry(res) <- st_geometry(obj) # return final output return(res) }
Using this – admittedly rather ugly function, we can perform the table join by specifying all relevant arguments. Concerning the the first data frame (obj
) we have to set determine the column names for segment limits (obj_start
and obj_end
) and the grouping variable (obj_group
). Subsequently, we specify match_to, which is our second data frame to be matched to obj
, as well as the column names for segment limits (match_start
and match_end
), the grouping variable (match_group
) and the column we want to join to obj
(match_id
).
segments_matched <- match_intervals(obj = segments, obj_start = "from", obj_end = "to", obj_group = "road_name", match_to = measurement, match_start = "km_start", match_end = "km_end", match_group = "road", match_id = "aadt")
This will eventually yield the desired output:
> segments_matched # A tibble: 30 x 4 road_name from to aadt <chr> <int> <int> <int> 1 A1 0 1 NA 2 A1 1 2 NA 3 A1 2 3 9574 4 A1 3 4 9574 5 A1 4 5 9685 6 A1 5 6 9685 7 A1 6 7 9685 8 A1 7 8 9685 9 A1 8 9 9685 10 A1 9 10 NA # ... with 20 more rows
Please be aware that this is just my first attempt to provide a function for performing this task and thus will need some refactoring – especially the nested ifelse-statement at the end is somewhat really ugly.
If you have any suggestions on how to improve this workflow feel free to comment below 🙂
Regards,
Matthias
Post A Reply