Data Management with R {dplyr} – Part 2

dplyr by RStudio


It`s been a while, since my last post on data management with tidyr.

Today I would like to go one step further and finally introduce you to the dplyr package from RStudio.

This post is a summary of the webinar on Data wrangling with R and RStudio, so if you want to learn more, just visit the link above.


dplyr – what`s that?

dplyr is a data management / analysis package for R that helps you to transform tabluar data and to access “hidden” information in your data.

So what are the ways to access information form your data and how do you extract this information? Below you can see a list a possible ways and the corresponding dplyr function that helps you with that task:

  1. Extract existing variables: select()
  2. Extract existing observations: filter()
  3. Derive new variables: mutate()
  4. Change the unit of analysis: summarise()

Extract existing variables: select()

select() is a very simple but useful function, it helps you to pull out extsiting columns/variables out of a dataframe. Let`s load some sample data form the “EDWAR” package:

    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2    Alex   45     1009 1998-07-27
3 Allison   65     1005 1995-06-03
4     Ana   40     1013 1997-06-30
5  Arlene   50     1010 1999-06-11
6  Arthur   45     1010 1996-06-17

Let`s say we only want to select the storm name and the pressure:

select(storms, storm, pressure)
    storm pressure
1 Alberto     1007
2    Alex     1009
3 Allison     1005
4     Ana     1013
5  Arlene     1010
6  Arthur     1010

Intuitive and easy, right? You can use other notations and functions which make the select function really powerful:

Select everything but
: Select range
contains() Select columns whose name contains a character string
ends_with() Select columns whose name ends wit a string
everything() Select every column
matches() Select columns whose name matches a regular expression
num_range() Select columns named x1, x2, x3, x4, x5
one_of() Select columns whose names are in a group of names
starts_with() Select columns whose name starts with a character string

Extract existing observations: filter()

filter() extracts observations based on a test specified by the user. Let`s have a look at the storms dataset again, but this time we only want to select rows with a wind speed >= 50:

filter(storms, wind>= 50)
    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2 Allison   65     1005 1995-06-03
3  Arlene   50     1010 1999-06-11

You can combine tests with a comma inbetween tests:

filter(storms, wind>= 50, pressure > 1005)
    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2  Arlene   50     1010 1999-06-11

Derive new variables: mutate()

With mutate() you can derive and calculate new variables form existing variables. Let`s for example create a new column called ratio that corresponds to the expression pressure/wind:

mutate(storms, ratio = pressure/wind)
    storm wind pressure       date     ratio
1 Alberto  110     1007 2000-08-03  9.154545
2    Alex   45     1009 1998-07-27 22.422222
3 Allison   65     1005 1995-06-03 15.461538
4     Ana   40     1013 1997-06-30 25.325000
5  Arlene   50     1010 1999-06-11 20.200000
6  Arthur   45     1010 1996-06-17 22.444444

There are a lot of functions that make mutate() even more powerful:


pmin(), pmax() Element-wise min and max
cumin(), cummax() Cumulative min and max
cumsum(), cumprod() Cumulative sum and product
between() Are values between a and b?
cume_dist() Cumulative distribution of values
cumall(), cumany() Cumulative all and any
cummean() Cumulative mean
lead(), lag() Copy with values one position
ntile() Bin vector into n buckets
dense_rank(), min_rank() Various ranking methods
percent_rank(), row_number()

Change the unit of analysis: summarise()

summarise() takes a dataframe and calculates statistics from it. Let`s look at an example using the pollution dataframe:

      city  size amount
1 New York large     23
2 New York small     14
3   London large     22
4   London small     16
5  Beijing large    121
6  Beijing small     56

summarise(pollution, median=median(amount), variance=var(amount))

  median variance
1   22.5   1731.6

Here the pollution dataframe is used to calculate its median and variance of the amount value.
Here is a list of useful functions you can inside the summarise() function:

min(), max() Minimum and maximum values
mean() Mean value
median() Median value
sum() Sum of values
var(), sd() Variance and standard dev. of vector
first() First value in vector
last() Last value in vector
nth() Nth value in a vector
n() The number of values in a vector
n_distinct() The number of distinct values in a vector

That`s it for today. I hope I could give you an good overview and introduction into the dplyr package from RStudio. If you want to learn more, please visit the webinar Data Wrangling with R and RStudio, where you can find a much more detailed version of what I`ve just explained.

The next post will be explaining the %>% pipe operator and how all of the above presented functions can be used together in an efficient way.



About This Author

Martin was born in Czech Republic and studied at the University of Natural Resources and Life Sciences, Vienna. He is currently working at an Earth Observation Company in Austria, specialised in Land Monitoring. His main interests are: Open-source applications like R, (geospatial) statistics and data-management, web-mapping and visualization. He loves travelling, geocaching, photography and TV series.

Post A Reply