6 Tidyverse: dplyr

Now that you understand vectors, tables, functions and pipes, and you know what our end goal is (a tidy table), we can start with data wrangling and Tidyverse way of doing it. All functions discussed below are part of dplyr46 “grammar of data manipulation” package. Grab the exercise notebook!

6.1 Tidyverse philosophy

Data analysis is different from “normal” programming as it mostly involves a series of sequential operations on the same table. You might load the table, transform some variables, filter data, select smaller subset of columns, aggregate by summarizing across different groups of variables before plotting it or formally analyzing it via statistical tests. Tidyverse is built around this serial nature of data analysis of piping a table through a chain of functions. Accordingly, Tidyverse functions take a table (data.frame or tibble) as their first parameter, which makes piping simpler, and return a modified table as an output. This table-in → table-out consistency makes it easy to pipe these operations one after another. For me, it helps to think about Tidyverse functions as verbs: Actions that I perform on the table at each step.

Here is quick teaser of how such sequential piping works. Below, we will examine each verb/function separately and I will also show you how same operations can be carried out using base R. Note that I put each verb/function on a separate line. I don’t need to do this, but it makes it easier to understand how many different operations you perform (number of lines), how complex they are (how long individuals lines of code are), and makes them easy to read line-by-line. Note that even though we have many lines, it is fairly easy to follow the entire code sequence.

# miles-per-gallon to kilometers-per-liter conversion factor
mpg2kpl <- 2.82481061

mpg |>
  # we filter the table by rows, 
  # only keeping rows for which year is 2008
  filter(year == 2008) |>
  
  # we change cty and hwy columns by turning
  # miles/gallon into liters/kilometer
  mutate(cty_kpl = cty / mpg2kpl,
         hwy_kpl = hwy / mpg2kpl) |>
  
  # we create a new column by computing an 
  # average efficiency as mean between city and highway cycles
  mutate(avg_kpl = (cty_kpl + hwy_kpl) / 2) |>
  
  # we convert kilometers-per-liter to liters for 100 KM
  mutate(avg_for_100 = 100 / avg_kpl) |>
  
  # we reduce the table to only two columns
  # class (of car) and avg_mpg
  select(class, avg_for_100)  |>

  # we group by each class of car
  # and compute average efficiency for each group (class of car)
  group_by(class) |>
  summarise(class_avg = mean(avg_for_100), .groups = "keep") |>
  
  # we round the value to just one digit after the decimal point
  mutate(class_avg = round(class_avg, 1)) |>
  
  # we sort table rows to go from best to worst on efficiency
  arrange(class_avg) |>
  
  # we rename the class_avg_lpk to have a more meaningul name
  rename("Average liters per 100 KM" = class_avg) |>

  # we kable (Knit the tABLE) to make it look nicer in the document
  knitr::kable()
class Average liters per 100 KM
compact 11.6
midsize 12.0
subcompact 12.6
2seater 14.0
minivan 15.3
suv 18.1
pickup 19.5

6.2 select() columns by name

Select verb allows you to select/pick columns in a table using their names. This is very similar to using columns names as indexes for tables that you have learned in seminar 3.

First, let us make a shorter version of mpg table by keeping only the first five rows. Note that you can also pick first N rows via head() function or slide_head() function from dplyr itself .

short_mpg <- mpg[1:5, ]

# same "first five rows" but via head() function
short_mpg <- head(mpg, 5)

# same "first five rows" but via dplyr::slice_head() function
short_mpg <- slice_head(mpg, n = 5)


knitr::kable(short_mpg)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact

Here is how you can select only model and cty columns via preserving [] subsetting

short_mpg[, c("model", "cty")] 
model cty
a4 18
a4 21
a4 20
a4 21
a4 16

And here is how it is done via select().

short_mpg |>
  select(model, cty)
model cty
a4 18
a4 21
a4 20
a4 21
a4 16

The idea of Tidyverse functions is to adopt to you, so you can use quotes or pass a vector of strings with column names. All calls below produce the same effect, so pick the style you prefer (mine, is in the code above) and stick to it47.

short_mpg |>
  select(c("model", "cty"))
  
short_mpg |>
  select("model", "cty")
  
short_mpg |>
  select(c(model, cty))

As you surely remember, you can use negation to select other indexes within a vector (c(4, 5, 6)[-2] gives you [4, 6]). For the single brackets [] this mechanism does not work with column names (only with their indexes). However, select has you covered, so we can select everything but cty and model

short_mpg |>
  select(-cty, -model)
manufacturer displ year cyl trans drv hwy fl class
audi 1.8 1999 4 auto(l5) f 29 p compact
audi 1.8 1999 4 manual(m5) f 29 p compact
audi 2.0 2008 4 manual(m6) f 31 p compact
audi 2.0 2008 4 auto(av) f 30 p compact
audi 2.8 1999 6 auto(l5) f 26 p compact

In the current version of dplyr, you can do the same negation via ! (a logical not operator, you will meet later), moreover, it is now a recommended way of writing the selection48. The - and ! are not synonyms and the difference is subtle but important, see below.

# This will NOT produce the same result as above
# Note that the model column is still in the table
short_mpg |>
  select(!cty, !model)
manufacturer model displ year cyl trans drv hwy fl class cty
audi a4 1.8 1999 4 auto(l5) f 29 p compact 18
audi a4 1.8 1999 4 manual(m5) f 29 p compact 21
audi a4 2.0 2008 4 manual(m6) f 31 p compact 20
audi a4 2.0 2008 4 auto(av) f 30 p compact 21
audi a4 2.8 1999 6 auto(l5) f 26 p compact 16

However, if you stick to putting all column names into a vector, as with the direct selection above, you can use negation with names as strings, you can negate a vector of names, etc. Again, it is mostly a matter of taste with consistency being more important than a specific choice you make.

# will produce the same result as for "-"
short_mpg |>
  select(!c("cty", "model"))

short_mpg |>
  select(!"cty", !"model")
  
short_mpg |>
  select(!c(cty, model))

Unlike vector indexing that forbids mixing positive and negative indexing, select does allow it. However, do not use it49 because results can be fairly counter-intuitive and, on top of that, - and ! work somewhat differently. Note the difference between ! and -: In the former case only the !model part appears to have the effect, whereas in case of - only cty works.

short_mpg |>
  select(cty, !model)
cty manufacturer displ year cyl trans drv hwy fl class
18 audi 1.8 1999 4 auto(l5) f 29 p compact
21 audi 1.8 1999 4 manual(m5) f 29 p compact
20 audi 2.0 2008 4 manual(m6) f 31 p compact
21 audi 2.0 2008 4 auto(av) f 30 p compact
16 audi 2.8 1999 6 auto(l5) f 26 p compact
short_mpg |>
  select(cty, -model)
cty
18
21
20
21
16

To make things even, worse select(-model, cty) work the same way as select(cty, !model) (sigh…)

short_mpg |>
  select(-model, cty)
manufacturer displ year cyl trans drv cty hwy fl class
audi 1.8 1999 4 auto(l5) f 18 29 p compact
audi 1.8 1999 4 manual(m5) f 21 29 p compact
audi 2.0 2008 4 manual(m6) f 20 31 p compact
audi 2.0 2008 4 auto(av) f 21 30 p compact
audi 2.8 1999 6 auto(l5) f 16 26 p compact

So, bottom line, do not mix positive and negative indexing in select! I am showing you this only to signal the potential danger.

Do exercise 1.

Simple names and their negation will be sufficient for most of your projects. However, I would recommend taking a look at the official manual just to see that select offers a lot of flexibility (selecting range of columns, by column type, by partial name matching, etc), something that might be useful for you in your work.

6.3 Conditions

Before we can work with the next verb, you need to understand conditions. Conditions are statements about values that are either TRUE or FALSE. In the simplest case, you can check whether two values (one in a variable and one hard-coded) are equal via == operator

x <- 5
print(x == 5)
## [1] TRUE
print(x == 3)
## [1] FALSE

For numeric values, you can use all usual comparison operators including not equal !=, less than <, greater than >, less than or equal to <= (note the order of symbols!), and greater than or equal to >= (again, note the order of symbols).

Do exercise 2.

You can negate a statement via not ! symbol as !TRUE is FALSE and vice versa. However, note that round brackets in the examples below! They are critical to express the order of computation. Anything inside the brackets is evaluated first. And if you have brackets inside the brackets, similar to nested functions, it is the innermost expression that get evaluated first. In the example below, x==5 is evaluated first and logical inversion happens only after it. In this particular example, you may not need them but I would suggest using them to ensure clarity.

x <- 5
print(!(x == 5))
## [1] FALSE
print(!(x == 3))
## [1] TRUE

Do exercise 3.

You can also combine several conditions using and & and or | operators50. Again, note round brackets that explicitly define what is evaluated first.

x <- 5
y <- 2

# x is not equal to 5 OR y is equal to 1
print((x != 5) | (y == 1))
## [1] FALSE
# x less than 10 AND y is greater than or equal to 1
print((x < 10) & (y >= 1))
## [1] TRUE

Do exercise 4.

All examples above used scalars but you remember that everything is a vector, including values that we used (they are just vectors of length one). Accordingly, same logic works for vectors of arbitrary length with comparisons working element-wise, so you get a vector of the same length with TRUE or FALSE values for each pairwise comparison.

Do exercise 5.

6.4 Logical indexing

In the second seminar, you learned about vector indexing when you access some elements of a vector by specifying their index. There is an alternative way, called logical indexing. Here, you supply a vector of equal length with logical values and you get elements of the original vector whenever the logical value is TRUE

x <- 1:5
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
## [1] 1 2 4

This is particularly useful, if you are interested in elements that satisfy certain condition. For example, you want all negative values and you can use condition x<5 that will produce a vector of logical values that, in turn, can be used as index

x <- c(-2, 5, 3, -5, -1)
x[x < 0]
## [1] -2 -5 -1

You can have conditions of any complexity by combining them via and & and or | operators. For example, if you want number below -1 or above 3 (be careful to have space between < and -, otherwise it will be interpreted as assignment <-).

x <- c(-2, 5, 3, -5, -1)
x[(x < -1) | (x > 3)]
## [1] -2  5 -5

Do exercise 6.

Sometimes you may want to know the actual index of elements for which some condition is TRUE. Function which() does exactly that.

x <- c(-2, 5, 3, -5, -1)
which( (x< -1) | (x>3) )
## [1] 1 2 4

6.5 filter() rows by values

Now that you understand conditions and logical indexing, using filter() is very straightforward: You simply put condition that describes rows that you want to retain inside the filter() call. For example, we can look at efficiency only for two-seater cars.

mpg |>
  filter(class == "2seater")
manufacturer model displ year cyl trans drv cty hwy fl class
chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
chevrolet corvette 7.0 2008 8 manual(m6) r 15 24 p 2seater

You can use information from any row, so we can look for midsize cars with four-wheel drive.

mpg |>
  filter(class == "midsize" & drv == "4")
manufacturer model displ year cyl trans drv cty hwy fl class
audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize

Do exercise 7.

Note that you can emulate filter() in a very straightforward way using single-brackets base R, the main difference is that you need to prefix every column with the table name, so mpg[["class"]] instead of just class51.

mpg[mpg[["class"]] == "midsize" & mpg[["drv"]] == "4", ]
manufacturer model displ year cyl trans drv cty hwy fl class
audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize

So why use filter() then? In isolation, as a single line computation, both options are equally compact and clear (apart from all the extra table[["..."]] in base R). But pipe-oriented nature of the filter() makes it more suitable for chains of computations, which is the main advantage of Tidyverse.

6.6 arrange() rows in a particular order

Sometimes you might need to sort your table so that rows go in a particular order52. In Tidyverse, you arrange rows based on values of specific variables. This verb is very straightforward, you simply list all variables, which must be used for sorting, in the order the sorting must be carried out. I.e., first the table is sorted based on values of the first variable. Then, for equal values of that variable, rows are sorted based on the second variable, etc. By default, rows are arranged in ascending order but you can reverse it by putting a variable inside of desc() function. Here is the short_mpg table arranged by city cycle highway efficiency (ascending order) and engine displacement (descending order, note the order of the last two rows).

short_mpg |>
  arrange(cty, desc(displ)) 
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact

Do exercise 8.

You can arrange a table using base R via order() function that gives index of ordered elements and can be used inside of preserving subsetting via single brackets [] notation. You can control for ascending/descending of a specific variable using rev() function that is applied after ordering, so rev(order(...)).

short_mpg[order(short_mpg[["cty"]], rev(short_mpg[["displ"]])), ]
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact

Do exercise 9.

6.7 mutate() columns

In Tidyverse, mutate function allows you to both add new columns/variables to a table and change the existing ones. In essence, it is equivalent to a simple column assignment statement in base R.

# base R
short_mpg[["avg_mpg"]] <- (short_mpg[["cty"]] + short_mpg[["hwy"]]) / 2

# Tidyverse equivalent
short_mpg <- 
  short_mpg |>
  mutate(avg_mpg = (cty + hwy) / 2)

Note two critical differences. First, mutate() takes a table as an input and returns a table as an output. This is why you start with a table, pipe it to mutate, and assign the results back to the original variable. If you have more verbs/lines, it is the output of the last computation that is assigned to the variable on the left-hand side of assignment53. Look at the listing below that indexes each line by when it is executed.

some_table <-    # 3. We assign the result to the original table, only once all the code below has been executed.
  some_table |>  # 1. We start here, with the original table and pipe to the next computation
  mutate(...)    # 2. We add/change columns inside of the table. The output is a table which we use for assignment all the way at the top.

Second, you are performing a computation inside the call of the mutate() function, so avg_mpg = (short_mpg$cty + short_mpg$hwy) / 2 is a parameter that you pass to it (yes, it does not look like one). This is why you use = rather than a normal assignment arrow <-. Unfortunately, you can use <- inside the mutate and the computation will work as intended but, for internal-processing reasons, the entire statement, rather than just the left-hand side, will be used as a column name. Thus, use <- outside and = inside of Tydiverse verbs.

short_mpg |>
  select(cty, hwy) |>
  mutate(avg_mpg =  (cty + hwy) / 2,      # column name will be avp_mpg
         avg_mpg <- (cty + hwy) / 2) |>  # column name will be `avg_mpg <- (short_mpg$cty + short_mpg$hwy) / 2`
  knitr::kable()
cty hwy avg_mpg avg_mpg <- (cty + hwy)/2
18 29 23.5 23.5
21 29 25.0 25.0
20 31 25.5 25.5
21 30 25.5 25.5
16 26 21.0 21.0

As shown in the example above, you can perform several computations within a single mutate call and they are executed one after another, just as they would be when using base R.

Do exercise 10.

Finally, mutate has two cousin-verbs called transmute and add_column. The former — transmute — works the same way but discards all original columns that were not modified. You probably won’t use this verb all too often but I want you to be able to recognize it, as its name and function are very similar to mutate and the two are easy to confuse.

short_mpg |>
  transmute(avg_mpg = (cty + hwy) / 2) |>
  knitr::kable(align = "c")
avg_mpg
23.5
25.0
25.5
25.5
21.0

The latter — add_column — is similar to mutate if you need to add a new column rather than to modify a new one. Its advantage is that it will produce an error, if you try to overwrite an existing column. Its disadvantage is that it does not appear to respect data grouping (see below), which can be very confusing. In short, stick to mutate unless you need either of these two functions specifically.

6.8 summarize() table

This verb is used when you aggregate across all rows, reducing them to a single value. Some examples of aggregating functions that are probably already familiar to you are mean, median, standard deviation, min/max. However, you can “aggregate” by taking a first or a last value or even by putting in a constant. Important is that you should assign a single value to the column when using summarize.

If you use summarize on an ungrouped table (these are the only tables we’ve been working on so far), it keeps only the computed columns, which makes you wonder “what’s the point?”

mpg |>
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy))
## # A tibble: 1 x 2
##   avg_cty avg_hwy
##     <dbl>   <dbl>
## 1    16.9    23.4

6.9 Work on individual groups of rows

The real power of summarize and of mutate becomes evident when they are applied to the data that is grouped by certain criteria. group_by() verb groups rows of the table based on values of variables you specified. Behind the scenes, this turns your single table into set of tables, so that your Tidyverse verbs are applied to each table separately. This ability to parse your table into different groups of rows (all rows that belong to a particular participant, or participant and condition, or rows per block, or per trial), change that grouping on the fly, return back to the original full table, etc. makes analysis a breeze. Here is how we can compute average efficiency not across all cars (as in the code above) but for each car class separately.

mpg |>
  # there are seven different classes of cars, so group_by(cars)  will
  # create seven hidden independent tables and all verbs below will be 
  # applied to each table separately
  group_by(class)  |>
  
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()
class avg_cty avg_hwy
2seater 15.40000 24.80000
compact 20.12766 28.29787
midsize 18.75610 27.29268
minivan 15.81818 22.36364
pickup 13.00000 16.87879
subcompact 20.37143 28.14286
suv 13.50000 18.12903

Note that we compute a single value per table but because we do it for seven tables, we get seven rows in our resultant table. And group_by makes it easy to group data in any way you want. Are you interested in manufacturers instead car classes? Easy!

mpg |>
  group_by(manufacturer)  |>
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()
manufacturer avg_cty avg_hwy
audi 17.61111 26.44444
chevrolet 15.00000 21.89474
dodge 13.13514 17.94595
ford 14.00000 19.36000
honda 24.44444 32.55556
hyundai 18.64286 26.85714
jeep 13.50000 17.62500
land rover 11.50000 16.50000
lincoln 11.33333 17.00000
mercury 13.25000 18.00000
nissan 18.07692 24.61538
pontiac 17.00000 26.40000
subaru 19.28571 25.57143
toyota 18.52941 24.91176
volkswagen 20.92593 29.22222

How about efficiency per class and year? Still easy!

mpg |>
  group_by(class, year)  |>
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()
class year avg_cty avg_hwy
2seater 1999 15.50000 24.50000
2seater 2008 15.33333 25.00000
compact 1999 19.76000 27.92000
compact 2008 20.54545 28.72727
midsize 1999 18.15000 26.50000
midsize 2008 19.33333 28.04762
minivan 1999 16.16667 22.50000
minivan 2008 15.40000 22.20000
pickup 1999 13.00000 16.81250
pickup 2008 13.00000 16.94118
subcompact 1999 21.57895 29.00000
subcompact 2008 18.93750 27.12500
suv 1999 13.37931 17.55172
suv 2008 13.60606 18.63636

The .groups parameter of the summarize function determines whether grouping you use should be dropped (.groups = "drop", table become a single ungrouped table) or kept (.groups = "keep"). You will get a warning, if you do not specify .groups parameter, so it is a good idea to do this explicitly. In general, use .groups = "drop" as it is better to later regroup table again then work on a groupped table without realizing it (leads to some weird looking output and a tricky to chase problem). You can also explicitly drop grouping via ungroup() verb.

Finally, there is a cousin verb rowwise() that groups by row. I.e., every row of the table becomes its own group, which could be useful if you need to apply a computation per row and the usual mutate() approach does not work (however, this is something of an advanced topic, so it is more about recognizing it than using it).

Do exercise 11.

You can replicate the functionality of group_by + summarize in base R via aggregate() and group_by + mutate via by functions. However, they are somewhat less straightforward in use as they rely on functional programming (which you haven’t learned about yet) and require both grouping and summary function within a single call. Hence, we will skip on those.

6.10 Putting it all together

Now you have enough tools at your disposal to start programming a continuous analysis pipeline!

Do exercise 12.

6.11 Should I use Tidyverse?

As you saw above, whatever Tidyverse can do, base R can do as well. So why use a non-standard family of packages? If you are using each function in isolation, there is probably not much sense in this. Base R can do it equally well and each individual function is also compact and simple. However, if you need to chain your computation, which is almost always the case, Tidyverse’s ability to pipe the entire sequence of functions in a simple consistent and, therefore, easy to understand way is a game-changer. In the long run, pick your style. Either go “all in” with Tidyverse (that is my approach), stick to base R, or find some alternative package family (e.g., data.table). However, as far as the book is concerned, it will be almost exclusively Tydiverse from now on.