Data Management with R {tidyr} – Part 1

Tidy vs Messy Data

Hello there!

Welcome to the first part of my series on data management with R. Today I am going to introduce the {tidyr} package that is used to prepare and clean data before further transformation steps are applied. Its main purpose is to make the underlying information more accessible and easier to use in other R functions. (Please note that the content of this post follows the webinar on Data wrangling with R by the RStudio-Team.)

Lets start with some basics – tidy data, what’s that?:

Datasets come in many formats, most R functions however prefere a certain format type to others. This prefered format is called tidy and has the following characteristics:

  1. Each variable is saved in its own column.
  2. Each obervation is saved in its own row.
  3. Each “type” of obersvation is stored in a single table.

This is how a tidy dataset looks like (using free data from the EDAWR package):

devtools::install_github("rstudio/EDAWR")
library(EDAWR)
storms
Source: local data frame [6 x 4]

    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2    Alex   45     1009 1998-07-27
3 Allison   65     1005 1995-06-03
4     Ana   40     1013 1997-06-30
5  Arlene   50     1010 1999-06-11
6  Arthur   45     1010 1996-06-17

As you can see every variable (stormname, wind, pressure, date) has its own column and every obsertvation is saved in its own row.

The next examples show non-tidy datasets from the same package:

cases
  country  2011  2012  2013
1      FR  7000  6900  7000
2      DE  5800  6000  6200
3      US 15000 14000 13000

The cases datasets has three variables: the country code in each row,  the year in each column and a count for every combination of row and columns. As you can see this structure doesn’t comply with the characteristics of a tidy dataset mentioned above.

Let’s have a look at another example of a non-tidy dataset called pollution:

pollution
      city  size amount
1 New York large     23
2 New York small     14
3   London large     22
4   London small     16
5  Beijing large    121
6  Beijing small     56

Here we have three variables: city names, amount of small particles and amount of large particles each city has. Again, this isn’t complying with the characteristics of a tidy dataset. The next picture summarises the structure of all three datasets graphically:

Tidy vs Messy Data

Tidy vs. Messy Data by R Studio

So you are asking yourself right now, what are the advantages of tidy data?

Advantages of tidy data:

To explain the advantages of a tidy data format in R, we will have a look on how to pull out values out of the above mentioned datasets. Let’s start with the tidy dataset storms. If you for example want to extract the storm names and pressure, this is the straight forward and works like this:

storms$storm
storm$pressure

The second example shows the “untidy” pollution dataset. If we want to extract its city names or amounts of all large particles, the code already looks much messier and more complex:

pollution$city[1,3,5]
pollution$amount[2,4,6]

You can see that a tidy dataset enables an automatised extraction of values that always follows the same pattern. This automatisation is not always possible with untidy data. Imagine for example large datasets with several thousand observations, here value extraction will soon get very complicated and time consuming. Shortly said: the tidy format is the most efficient way to operate R.

How to tidy up data?

So how do I make my dataset tidy, you ask? A very nice package that is able to clean up your data is called tidyr. Its main purpose is to reshape the layout of dataframes where two main functions come to its use:   gather() and spread().

gather()

If we look up the documentation for this function we can find: “You use gather() when you notice that you have columns that are not variables.” This statement applies to the cases dataset described above.  So let`s try to make it tidy by using the gather() function.

tidy.cases <- gather(cases, "year", "n", 2:4)

Lets compare the resulting dataframe to the initial dataframe:

#Inital DF: cases
  country  2011  2012  2013
1      FR  7000  6900  7000
2      DE  5800  6000  6200
3      US 15000 14000 13000


#Tidy DF: tidy.cases
  country year     n
1      FR 2011  7000
2      DE 2011  5800
3      US 2011 15000
4      FR 2012  6900
5      DE 2012  6000
6      US 2012 14000
7      FR 2013  7000
8      DE 2013  6200
9      US 2013 13000

The result is a tidy dataframe with three columns, where every column represents a variable and every obeservation is saved in its own row. The following picture below gives an insight how the gather() function works in detail:

 

gather() - tidyr by RStudio

gather() – tidyr by RStudio

spread()

The second main function from tidyr is spread(). This function takes data that is in a key-value format and returns a rectangular tidy cell format. This may sound confusing, so lets apply this function on the pollution dataframe to illustrate its functionality:

tidy.pollution <- spread(polution, size, amount)

Let`s again compare the resulting tidy dataframe to the initial dataframe:

#Initial DF: pollution
      city  size amount
1 New York large     23
2 New York small     14
3   London large     22
4   London small     16
5  Beijing large    121
6  Beijing small     56

#Tidy DF: tidy.pollution
      city large small
1  Beijing   121    56
2   London    22    16
3 New York    23    14

As you can see spread() restructures the dataframe by removing redundant rows without losing any information. The following picture below gives an insight how the spread() function works in detail:

spread - tidyr by RStudio

spread – tidyr by RStudio

Overview – gather vs. spread

If you paid close attention in the sections above, you maybe noticed that gather() and spread() actually do the opposite of each other. As gather() is used to tranform a dataframe from a cell format to a tidy key-value format, spread() makes the transition from a cell format to a tidy key-value format. The picture below summarises this:

spread vs. gather by RStudio

spread vs. gather by RStudio

You don’t have to apply these two functions to the same columns each time you use them. This means, that by iterating over those to functions with different parameters, you can take the same data and reshape its layout into the right format need  for your analysis.

Other useful functions from tidyr – separate & unite

Two other useful functions form tidy are unite() and separate(). To show their use, lets look at the storms dataframe again:

    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2    Alex   45     1009 1998-07-27
3 Allison   65     1005 1995-06-03
4     Ana   40     1013 1997-06-30
5  Arlene   50     1010 1999-06-11
6  Arthur   45     1010 1996-06-17

separate() can be used to seperate a column into multiple other columns by using a separator. Let’s for example say, instead of the date format YYYY-MM-DD in a single column, we want three sepearate columns: one with the year, one with the month and one with the day. This can be achieved with the following command:

#create new df with 3 new columns "year", "month", "day" out of the inital "date" column
storms.sep <- separate(storms, date, c("year", "month", "day"), sep="-")

    storm wind pressure year month day
1 Alberto  110     1007 2000    08  03
2    Alex   45     1009 1998    07  27
3 Allison   65     1005 1995    06  03
4     Ana   40     1013 1997    06  30
5  Arlene   50     1010 1999    06  11
6  Arthur   45     1010 1996    06  17

unite() does the exact opposite. It unites multiple columns into a single column. This can be demonstrated using our newly created storms.sep dataframe:

#unite the columns 4:6 into a new column
unite(storms.sep, "date" , 4:6 , sep ="-")

    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2    Alex   45     1009 1998-07-27
3 Allison   65     1005 1995-06-03
4     Ana   40     1013 1997-06-30
5  Arlene   50     1010 1999-06-11
6  Arthur   45     1010 1996-06-17

I hope you enjoyed the brief introduction into the tidyr package and I would be happy if you will apply some of the mentioned functions in your next data processing work to make it more efficent and time saving.
The next part will be on dplyr and the %>% operator. Hope to see you soon!

Cheers
Martin

About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at an Earth Observation Company in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and TV series.

1 Comment

You can post comments in this post.


  • Hi very helpful resource,

    I have a question/problem with tidy, spread() operator using my own function, lest see the example below.

    my_summarise %
    group_by(!!group_var1,!!group_var2) %>%
    summarise(CR=mean(a))%>%
    spread(!!group_var2,CR)
    }

    then when we write the function we have to use quo() operator,

    my_summarise(df = Data_S,quo(Age),quo(segment))

    and works with the group_by area (diplyr), but not with the spread(), we get the follow error

    Error: Invalid column specification

    Any helps?

    Thanks in advance

    Gonzalo 1 month ago Reply


Post A Reply

*