Hey everybody,
after a short summer break we are ready to emerge from several weeks in the doldrums 🙂
If you are working with large data sets stored in .csv format, the files are likely to comprise hundreds of MB or several GB of data.
Importing these csv files using the common read.csv
or read.table
command will take a considerable amount of time. While you can surely tweak read.table
to run a bit faster by knowing your data set and specifying the parameters accordingly or simply use scan
instead, the fastest and most convenient method I came across so far is to use fread
from the package data.table
. Thus, data.table
is a really helpful package that improves efficiency in R if you work with large csv files.
Other packages you might want to check out in this context are readr
, sqldf
, bigmemory
and ff
.
Here is a short comparison of how the running time of R improves when using fread
instead of read.table
on a randomly generated data set. Our random data set will feature more than 8 million rows and 8 columns and comprises around 570 MB of data:
library(data.table) n <- as.numeric(Sys.Date())*500 sampledata <- data.table( r1 = sample(1:10000, n, replace=TRUE), r2 = rnorm(n), r3 = sample(1:5000, n, replace=TRUE), r4 = sample(c("one","two","three","four","five"), n, replace=TRUE), r5 = rpois(n, lambda = 2000), r6 = sample(state.name, n, replace=TRUE), r7 = runif(n, 0, 42) ) write.table(sampledata,"sample.csv", sep=",", quote=FALSE) system.time(df1 <- read.table("sample.csv", sep=",")) system.time(df2 <- fread("sample.csv"))
The results are quite obvious: while it takes more than 100 seconds to read all data using the basic read.table
command, using fread
completes this task in less than 20 seconds:
system.time(df1 <- read.table("sample.csv", sep=",")) user system elapsed 105.23 2.09 108.48 system.time(df2 <- fread("sample.csv")) Read 8336000 rows and 8 (of 8) columns from 0.566 GB file in 00:00:18 user system elapsed 16.94 0.20 17.40
Post A Reply