Efficient import of large csv files into R

Hey everybody,

after a short summer break we are ready to emerge from several weeks in the doldrums 🙂

If you are working with large data sets stored in .csv format, the files are likely to comprise hundreds of MB or several GB of data.
Importing these csv files using the common read.csv or read.table command will take a considerable amount of time. While you can surely tweak read.table to run a bit faster by knowing your data set and specifying the parameters accordingly or simply use scan instead, the fastest and most convenient method I came across so far is to use fread from the package data.table. Thus, data.table is a really helpful package that improves efficiency in R if you work with large csv files.

Other packages you might want to check out in this context are readrsqldfbigmemory and ff.

Here is a short comparison of how the running time of R improves when using fread instead of read.table on a randomly generated data set. Our random data set will feature more than 8 million rows and 8 columns and comprises around 570 MB of data:

library(data.table)

n <- as.numeric(Sys.Date())*500
sampledata <- data.table(
    r1 = sample(1:10000, n, replace=TRUE),
    r2 = rnorm(n),
    r3 = sample(1:5000, n, replace=TRUE),
    r4 = sample(c("one","two","three","four","five"), n, replace=TRUE),
    r5 = rpois(n, lambda = 2000),
    r6 = sample(state.name, n, replace=TRUE),
    r7 = runif(n, 0, 42)
)

write.table(sampledata,"sample.csv", sep=",", quote=FALSE)

system.time(df1 <- read.table("sample.csv", sep=","))
system.time(df2 <- fread("sample.csv"))

The results are quite obvious: while it takes more than 100 seconds to read all data using the basic read.table command, using fread completes this task in less than 20 seconds:

system.time(df1 <- read.table("sample.csv", sep=","))
       user      system     elapsed 
     105.23        2.09      108.48

system.time(df2 <- fread("sample.csv"))
Read 8336000 rows and 8 (of 8) columns from 0.566 GB file in 00:00:18
       user      system     elapsed 
      16.94        0.20       17.40
Matthias

Matthias studied Environmental and Bio-Resources Management with a specialization in Environmental Information Management at the University of Natural Resources and Life Sciences (Vienna). He is currently a PhD student working at the Austrian Institute of Technology. Having written his master's thesis about extreme weather risk identification for the Austrian road network, he currently focuses on modeling of adverse weather events as a basis for risk assessment of road infrastructure networks.

Post A Reply

*