Efficient import of large csv files into R

Hey everybody,

after a short summer break we are ready to emerge from several weeks in the doldrums 🙂

If you are working with large data sets stored in .csv format, the files are likely to comprise hundreds of MB or several GB of data.
Importing these csv files using the common read.csv or read.table command will take a considerable amount of time. While you can surely tweak read.table to run a bit faster by knowing your data set and specifying the parameters accordingly or simply use scan instead, the fastest and most convenient method I came across so far is to use fread from the package data.table. Thus, data.table is a really helpful package that improves efficiency in R if you work with large csv files.

Other packages you might want to check out in this context are readrsqldfbigmemory and ff.

Here is a short comparison of how the running time of R improves when using fread instead of read.table on a randomly generated data set. Our random data set will feature more than 8 million rows and 8 columns and comprises around 570 MB of data:

library(data.table)

n <- as.numeric(Sys.Date())*500
sampledata <- data.table(
    r1 = sample(1:10000, n, replace=TRUE),
    r2 = rnorm(n),
    r3 = sample(1:5000, n, replace=TRUE),
    r4 = sample(c("one","two","three","four","five"), n, replace=TRUE),
    r5 = rpois(n, lambda = 2000),
    r6 = sample(state.name, n, replace=TRUE),
    r7 = runif(n, 0, 42)
)

write.table(sampledata,"sample.csv", sep=",", quote=FALSE)

system.time(df1 <- read.table("sample.csv", sep=","))
system.time(df2 <- fread("sample.csv"))

The results are quite obvious: while it takes more than 100 seconds to read all data using the basic read.table command, using fread completes this task in less than 20 seconds:

system.time(df1 <- read.table("sample.csv", sep=","))
       user      system     elapsed 
     105.23        2.09      108.48

system.time(df2 <- fread("sample.csv"))
Read 8336000 rows and 8 (of 8) columns from 0.566 GB file in 00:00:18
       user      system     elapsed 
      16.94        0.20       17.40
About This Author

Matthias studied Environmental Information Management at the University of Natural Resources and Life Sciences Vienna and holds a PhD in environmental statistics. The focus of his thesis was on the statistical modelling of rare (extreme) events as a basis for vulnerability assessment of critical infrastructure. He is working at the Austrian national weather and geophysical service (ZAMG) and at the Institute of Mountain Risk Engineering at BOKU University. He currently focuses the (statistical) assessment of adverse weather events and natural hazards, and disaster risk reduction. His main interests are statistical modelling of environmental phenomena as well as open source tools for data science, geoinformation and remote sensing.

Post A Reply

*