after a short summer break we are ready to emerge from several weeks in the doldrums 🙂
If you are working with large data sets stored in .csv format, the files are likely to comprise hundreds of MB or several GB of data.
Importing these csv files using the common
read.table command will take a considerable amount of time. While you can surely tweak
read.table to run a bit faster by knowing your data set and specifying the parameters accordingly or simply use
scan instead, the fastest and most convenient method I came across so far is to use
fread from the package
data.table is a really helpful package that improves efficiency in R if you work with large csv files.
Here is a short comparison of how the running time of R improves when using
fread instead of
read.table on a randomly generated data set. Our random data set will feature more than 8 million rows and 8 columns and comprises around 570 MB of data:
library(data.table) n <- as.numeric(Sys.Date())*500 sampledata <- data.table( r1 = sample(1:10000, n, replace=TRUE), r2 = rnorm(n), r3 = sample(1:5000, n, replace=TRUE), r4 = sample(c("one","two","three","four","five"), n, replace=TRUE), r5 = rpois(n, lambda = 2000), r6 = sample(state.name, n, replace=TRUE), r7 = runif(n, 0, 42) ) write.table(sampledata,"sample.csv", sep=",", quote=FALSE) system.time(df1 <- read.table("sample.csv", sep=",")) system.time(df2 <- fread("sample.csv"))
The results are quite obvious: while it takes more than 100 seconds to read all data using the basic
read.table command, using
fread completes this task in less than 20 seconds:
system.time(df1 <- read.table("sample.csv", sep=",")) user system elapsed 105.23 2.09 108.48 system.time(df2 <- fread("sample.csv")) Read 8336000 rows and 8 (of 8) columns from 0.566 GB file in 00:00:18 user system elapsed 16.94 0.20 17.40