## Performing table joins by group and interval in R  Dear all,

when working with linestring features (e.g. river or infrastructure networks), we might want to join two data frames by group and interval. I recently had to perform such a task when I wanted to join traffic volume data of permanent traffic counters (available at https://www.asfinag.at/verkehr/verkehrszaehlung/) on a road graph of the Austrian highway network. Obviously, traffic volume remains constant between junctions, entailing that traffic volumes measured at a certain point can be attributed to the whole section between the last and the next junction. Having constructed the two data frames I intended to join, I expected that performing the table joins would be a straightforward task – however, I soon had to realize that this was a little bit more complex than initially expected.

I will try to illustrate the problem at the example of two small data sets:

For instance, assume that we have a small road graph of several highways (called A1, A2 and A3) consisting of segments of 1 kilometer. Usually, various features are linked to each segment
(such as data on roadway geometry, e.g. skid resistance, number of lanes or speed limits), but in order to keep the example as simple as possible I have not added extra columns for additional parameters.

```segments <- data_frame(road_name = rep(paste0("A", 1:3), each = 10),
from = rep(0:9, 3),
to = rep(1:10, 3)
)```

Our second data set contains our traffic counters, indicating a value for the annual average daily traffic (AADT) as well as the range between two junctions where this measured value is valid.

```set.seed(42)
measurement <- data_frame(road = rep(paste0("A", 1:3), each = 2),
km_start = c(2.4, 4.2, 1.2, 6.5, 0, 7.3),
km_end = c(4.2, 8.7, 6.5, 12.4, 7.3, 9.2),
aadt = sample(x = 5000:10000, size = 6, replace = FALSE)
)```

When searching for appropriate functions I could use to match the AADT-values of the intervals in `measurement` to the intervals in `segments` while also taking into account the road ID, I did not come across a convenient function that would perform such a task. There is the `%within%` operator in the `lubridate` package – which would be helpful if I was dealing with date-time-objects. A possible solution could be based on rolling joins as implemented in the `data.table` package, but I eventually decided to write my own function using the base function `findInterval()`.

```match_intervals <- function(obj, obj_start, obj_end, obj_group,
match_to, match_start, match_end,
match_group, match_id, sf = FALSE){

# helper function to obtain simple vectors from tibbles
uu <- function(x) unname(unlist(x))

# safe copy of input obj
obj_tbl <- obj

# define list to store temporary results
tmp <- list()

# remove geometry if obj is sf
# if(sf) st_geometry(obj_tbl) <- NULL

# rename group columns and id column
names(obj_tbl)[names(obj_tbl) == obj_group] <- "group"
names(match_to)[names(match_to) == match_group] <- "group"
names(match_to)[names(match_to) == match_id] <- "id"

# loop over all rows
for (i in 1:nrow(match_to)){
# get limits
limits <- c(uu(match_to[match_start])[i],
uu(match_to[match_end])[i])

# index for start
index_start <- obj_tbl %>%
filter(group == match_to\$group[i]) %>%
select(obj_start) %>%
uu %>%
findInterval(limits)

# index for end
index_end <- obj_tbl %>%
filter(group == match_to\$group[i]) %>%
select(obj_end) %>%
uu %>%
findInterval(limits)

# combine indices
index <- ifelse(index_start == 1 | index_end == 1, TRUE, FALSE)

if(i == 1){
# exception for first element
res <- obj_tbl %>%
filter(group == match_to\$group[i]) %>%
mutate(match = ifelse(index, match_to\$id[i], NA))
tmp[[i]] <- res
} else {
# if subsequent rows belong to the same group
if(match_to\$group[i] == match_to\$group[i-1]){
res <- tmp[[i-1]] %>%
mutate(match = ifelse(index, match_to\$id[i], match))
tmp[[i]] <- res
} else {
# if new group
res <- obj_tbl %>%
filter(group == match_to\$group[i]) %>%
mutate(match = ifelse(index, match_to\$id[i], NA))
tmp[[i]] <- res
}
}
}
selector <- (1:nrow(match_to))[!duplicated(match_to\$group, fromLast=T)]
res <- bind_rows(tmp[selector])

# revert names to original names
names(res)[names(res) == "group"] <- obj_group
names(res)[names(res) == "match"] <- match_id

# revert back to sf
if(sf) st_geometry(res) <- st_geometry(obj)

# return final output
return(res)
}```

Using this – admittedly rather ugly function, we can perform the table join by specifying all relevant arguments. Concerning the the first data frame (`obj`) we have to set determine the column names for segment limits (`obj_start` and `obj_end`) and the grouping variable (`obj_group`). Subsequently, we specify match_to, which is our second data frame to be matched to `obj`, as well as the column names for segment limits (`match_start` and `match_end`), the grouping variable (`match_group`) and the column we want to join to `obj` (`match_id`).

```segments_matched <- match_intervals(obj = segments,
obj_start = "from",
obj_end = "to",
match_to = measurement,
match_start = "km_start",
match_end = "km_end",

This will eventually yield the desired output:

```> segments_matched
# A tibble: 30 x 4
<chr> <int> <int> <int>
1        A1     0     1    NA
2        A1     1     2    NA
3        A1     2     3  9574
4        A1     3     4  9574
5        A1     4     5  9685
6        A1     5     6  9685
7        A1     6     7  9685
8        A1     7     8  9685
9        A1     8     9  9685
10        A1     9    10    NA
# ... with 20 more rows```

Please be aware that this is just my first attempt to provide a function for performing this task and thus will need some refactoring – especially the nested ifelse-statement at the end is somewhat really ugly.

If you have any suggestions on how to improve this workflow feel free to comment below 🙂

Regards,
Matthias 