Hey!
It`s been a while, since my last post on data management with tidyr.
Today I would like to go one step further and finally introduce you to the dplyr
package from RStudio.
This post is a summary of the webinar on Data wrangling with R and RStudio, so if you want to learn more, just visit the link above.
dplyr – what`s that?
dplyr is a data management / analysis package for R that helps you to transform tabluar data and to access “hidden” information in your data.
So what are the ways to access information form your data and how do you extract this information? Below you can see a list a possible ways and the corresponding dplyr function that helps you with that task:
- Extract existing variables:
select()
- Extract existing observations:
filter()
- Derive new variables:
mutate()
- Change the unit of analysis:
summarise()
Extract existing variables: select()
select() is a very simple but useful function, it helps you to pull out extsiting columns/variables out of a dataframe. Let`s load some sample data form the “EDWAR” package:
install.packages("dplyr") devtools::install_github("rstudio/EDAWR") library(EDAWR) library(dplyr) storms storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Alex 45 1009 1998-07-27 3 Allison 65 1005 1995-06-03 4 Ana 40 1013 1997-06-30 5 Arlene 50 1010 1999-06-11 6 Arthur 45 1010 1996-06-17
Let`s say we only want to select the storm name and the pressure:
select(storms, storm, pressure) storm pressure 1 Alberto 1007 2 Alex 1009 3 Allison 1005 4 Ana 1013 5 Arlene 1010 6 Arthur 1010
Intuitive and easy, right? You can use other notations and functions which make the select function really powerful:
– | Select everything but |
: | Select range |
contains() | Select columns whose name contains a character string |
ends_with() | Select columns whose name ends wit a string |
everything() | Select every column |
matches() | Select columns whose name matches a regular expression |
num_range() | Select columns named x1, x2, x3, x4, x5 |
one_of() | Select columns whose names are in a group of names |
starts_with() | Select columns whose name starts with a character string |
Extract existing observations: filter()
filter() extracts observations based on a test specified by the user. Let`s have a look at the storms dataset again, but this time we only want to select rows with a wind speed >= 50:
filter(storms, wind>= 50) storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Allison 65 1005 1995-06-03 3 Arlene 50 1010 1999-06-11
You can combine tests with a comma inbetween tests:
filter(storms, wind>= 50, pressure > 1005) storm wind pressure date 1 Alberto 110 1007 2000-08-03 2 Arlene 50 1010 1999-06-11
Derive new variables: mutate()
With mutate() you can derive and calculate new variables form existing variables. Let`s for example create a new column called ratio that corresponds to the expression pressure/wind:
mutate(storms, ratio = pressure/wind) storm wind pressure date ratio 1 Alberto 110 1007 2000-08-03 9.154545 2 Alex 45 1009 1998-07-27 22.422222 3 Allison 65 1005 1995-06-03 15.461538 4 Ana 40 1013 1997-06-30 25.325000 5 Arlene 50 1010 1999-06-11 20.200000 6 Arthur 45 1010 1996-06-17 22.444444
There are a lot of functions that make mutate() even more powerful:
pmin(), pmax() | Element-wise min and max |
cumin(), cummax() | Cumulative min and max |
cumsum(), cumprod() | Cumulative sum and product |
between() | Are values between a and b? |
cume_dist() | Cumulative distribution of values |
cumall(), cumany() | Cumulative all and any |
cummean() | Cumulative mean |
lead(), lag() | Copy with values one position |
ntile() | Bin vector into n buckets |
dense_rank(), min_rank() | Various ranking methods |
percent_rank(), row_number() |
Change the unit of analysis: summarise()
summarise() takes a dataframe and calculates statistics from it. Let`s look at an example using the pollution dataframe:
pollution city size amount 1 New York large 23 2 New York small 14 3 London large 22 4 London small 16 5 Beijing large 121 6 Beijing small 56 summarise(pollution, median=median(amount), variance=var(amount)) median variance 1 22.5 1731.6
Here the pollution dataframe is used to calculate its median and variance of the amount value.
Here is a list of useful functions you can inside the summarise() function:
min(), max() | Minimum and maximum values |
mean() | Mean value |
median() | Median value |
sum() | Sum of values |
var(), sd() | Variance and standard dev. of vector |
first() | First value in vector |
last() | Last value in vector |
nth() | Nth value in a vector |
n() | The number of values in a vector |
n_distinct() | The number of distinct values in a vector |
That`s it for today. I hope I could give you an good overview and introduction into the dplyr package from RStudio. If you want to learn more, please visit the webinar Data Wrangling with R and RStudio, where you can find a much more detailed version of what I`ve just explained.
The next post will be explaining the %>% pipe operator and how all of the above presented functions can be used together in an efficient way.
Cheers
Martin
Post A Reply