Hello there!
Welcome to the first part of my series on data management with R. Today I am going to introduce the {tidyr}
package that is used to prepare and clean data before further transformation steps are applied. Its main purpose is to make the underlying information more accessible and easier to use in other R functions. (Please note that the content of this post follows the webinar on Data wrangling with R by the RStudio-Team.)
Lets start with some basics – tidy data, what’s that?:
Datasets come in many formats, most R functions however prefere a certain format type to others. This prefered format is called tidy and has the following characteristics:
- Each variable is saved in its own column.
- Each obervation is saved in its own row.
- Each “type” of obersvation is stored in a single table.
This is how a tidy dataset looks like (using free data from the EDAWR package):
devtools::install_github("rstudio/EDAWR") library(EDAWR) storms Source: local data frame [6 x 4] storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Alex 45 1009 1998-07-27 3 Allison 65 1005 1995-06-03 4 Ana 40 1013 1997-06-30 5 Arlene 50 1010 1999-06-11 6 Arthur 45 1010 1996-06-17
As you can see every variable (stormname, wind, pressure, date) has its own column and every obsertvation is saved in its own row.
The next examples show non-tidy datasets from the same package:
cases country 2011 2012 2013 1 FR 7000 6900 7000 2 DE 5800 6000 6200 3 US 15000 14000 13000
The cases
datasets has three variables: the country code in each row, the year in each column and a count for every combination of row and columns. As you can see this structure doesn’t comply with the characteristics of a tidy dataset mentioned above.
Let’s have a look at another example of a non-tidy dataset called pollution:
pollution city size amount 1 New York large 23 2 New York small 14 3 London large 22 4 London small 16 5 Beijing large 121 6 Beijing small 56
Here we have three variables: city names, amount of small particles and amount of large particles each city has. Again, this isn’t complying with the characteristics of a tidy dataset. The next picture summarises the structure of all three datasets graphically:

Tidy vs. Messy Data by R Studio
So you are asking yourself right now, what are the advantages of tidy data?
Advantages of tidy data:
To explain the advantages of a tidy data format in R, we will have a look on how to pull out values out of the above mentioned datasets. Let’s start with the tidy dataset storms.
If you for example want to extract the storm names and pressure, this is the straight forward and works like this:
storms$storm storm$pressure
The second example shows the “untidy” pollution
dataset. If we want to extract its city names or amounts of all large particles, the code already looks much messier and more complex:
pollution$city[1,3,5] pollution$amount[2,4,6]
You can see that a tidy dataset enables an automatised extraction of values that always follows the same pattern. This automatisation is not always possible with untidy data. Imagine for example large datasets with several thousand observations, here value extraction will soon get very complicated and time consuming. Shortly said: the tidy format is the most efficient way to operate R.
How to tidy up data?
So how do I make my dataset tidy, you ask? A very nice package that is able to clean up your data is called tidyr
. Its main purpose is to reshape the layout of dataframes where two main functions come to its use: gather()
and spread()
.
gather()
If we look up the documentation for this function we can find: “You use gather() when you notice that you have columns that are not variables.” This statement applies to the cases
dataset described above. So let`s try to make it tidy by using the gather()
function.
tidy.cases <- gather(cases, "year", "n", 2:4)
Lets compare the resulting dataframe to the initial dataframe:
#Inital DF: cases country 2011 2012 2013 1 FR 7000 6900 7000 2 DE 5800 6000 6200 3 US 15000 14000 13000 #Tidy DF: tidy.cases country year n 1 FR 2011 7000 2 DE 2011 5800 3 US 2011 15000 4 FR 2012 6900 5 DE 2012 6000 6 US 2012 14000 7 FR 2013 7000 8 DE 2013 6200 9 US 2013 13000
The result is a tidy dataframe with three columns, where every column represents a variable and every obeservation is saved in its own row. The following picture below gives an insight how the gather()
function works in detail:

gather() – tidyr by RStudio
spread()
The second main function from tidyr is spread()
. This function takes data that is in a key-value format and returns a rectangular tidy cell format. This may sound confusing, so lets apply this function on the pollution dataframe to illustrate its functionality:
tidy.pollution <- spread(polution, size, amount)
Let`s again compare the resulting tidy dataframe to the initial dataframe:
#Initial DF: pollution city size amount 1 New York large 23 2 New York small 14 3 London large 22 4 London small 16 5 Beijing large 121 6 Beijing small 56 #Tidy DF: tidy.pollution city large small 1 Beijing 121 56 2 London 22 16 3 New York 23 14
As you can see spread()
restructures the dataframe by removing redundant rows without losing any information. The following picture below gives an insight how the spread()
function works in detail:

spread – tidyr by RStudio
Overview – gather vs. spread
If you paid close attention in the sections above, you maybe noticed that gather()
and spread()
actually do the opposite of each other. As gather()
is used to tranform a dataframe from a cell format to a tidy key-value format, spread()
makes the transition from a cell format to a tidy key-value format. The picture below summarises this:

spread vs. gather by RStudio
You don’t have to apply these two functions to the same columns each time you use them. This means, that by iterating over those to functions with different parameters, you can take the same data and reshape its layout into the right format need for your analysis.
Other useful functions from tidyr – separate & unite
Two other useful functions form tidy are unite()
and separate()
. To show their use, lets look at the storms
dataframe again:
storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Alex 45 1009 1998-07-27 3 Allison 65 1005 1995-06-03 4 Ana 40 1013 1997-06-30 5 Arlene 50 1010 1999-06-11 6 Arthur 45 1010 1996-06-17
separate()
can be used to seperate a column into multiple other columns by using a separator. Let’s for example say, instead of the date format YYYY-MM-DD in a single column, we want three sepearate columns: one with the year, one with the month and one with the day. This can be achieved with the following command:
#create new df with 3 new columns "year", "month", "day" out of the inital "date" column storms.sep <- separate(storms, date, c("year", "month", "day"), sep="-") storm wind pressure year month day 1 Alberto 110 1007 2000 08 03 2 Alex 45 1009 1998 07 27 3 Allison 65 1005 1995 06 03 4 Ana 40 1013 1997 06 30 5 Arlene 50 1010 1999 06 11 6 Arthur 45 1010 1996 06 17
unite()
does the exact opposite. It unites multiple columns into a single column. This can be demonstrated using our newly created storms.sep
dataframe:
#unite the columns 4:6 into a new column unite(storms.sep, "date" , 4:6 , sep ="-") storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Alex 45 1009 1998-07-27 3 Allison 65 1005 1995-06-03 4 Ana 40 1013 1997-06-30 5 Arlene 50 1010 1999-06-11 6 Arthur 45 1010 1996-06-17
I hope you enjoyed the brief introduction into the tidyr
package and I would be happy if you will apply some of the mentioned functions in your next data processing work to make it more efficent and time saving.
The next part will be on dplyr
and the %>%
operator. Hope to see you soon!
Cheers
Martin
4 Comments
You can post comments in this post.
Hi very helpful resource,
I have a question/problem with tidy, spread() operator using my own function, lest see the example below.
my_summarise %
group_by(!!group_var1,!!group_var2) %>%
summarise(CR=mean(a))%>%
spread(!!group_var2,CR)
}
then when we write the function we have to use quo() operator,
my_summarise(df = Data_S,quo(Age),quo(segment))
and works with the group_by area (diplyr), but not with the spread(), we get the follow error
Error: Invalid column specification
Any helps?
Thanks in advance
Gonzalo 6 years ago
Hii,
Great Post and the best thing for a blog post is to engage people with it.
Charles 4 years ago
Thanks for the wonderful article of data management with R, I think it is helpful for me.
Data Conversion Service 4 years ago
Hi, I am peter ross I own business in the USA . The blog you have posted was very informative keep posting stuff like this. I would like to read more from you.
Peter Ross 3 years ago
Post A Reply