6 Tidyverse: dplyr

Now that you understand vectors, tables, functions and pipes, and you know what our end goal is (a tidy table), we can start with data wrangling and Tidyverse way of doing it. All functions discussed below are part of dplyr ⁴⁶ “grammar of data manipulation” package. Grab the exercise notebook!

6.1 Tidyverse philosophy

Data analysis is different from “normal” programming as it mostly involves a series of sequential operations on the same table. You might load the table, transform some variables, filter data, select smaller subset of columns, aggregate by summarizing across different groups of variables before plotting it or formally analyzing it via statistical tests. Tidyverse is built around this serial nature of data analysis of piping a table through a chain of functions. Accordingly, Tidyverse functions take a table (data.frame or tibble) as their first parameter, which makes piping simpler, and return a modified table as an output. This table-in → table-out consistency makes it easy to pipe these operations one after another. For me, it helps to think about Tidyverse functions as verbs: Actions that I perform on the table at each step.

Here is quick teaser of how such sequential piping works. Below, we will examine each verb/function separately and I will also show you how same operations can be carried out using base R. Note that I put each verb/function on a separate line. I don’t need to do this, but it makes it easier to understand how many different operations you perform (number of lines), how complex they are (how long individuals lines of code are), and makes them easy to read line-by-line. Note that even though we have many lines, it is fairly easy to follow the entire code sequence.

# miles-per-gallon to kilometers-per-liter conversion factor
mpg2kpl <- 2.82481061

mpg |>
  # we filter the table by rows, 
  # only keeping rows for which year is 2008
  filter(year == 2008) |>
  
  # we change cty and hwy columns by turning
  # miles/gallon into liters/kilometer
  mutate(cty_kpl = cty / mpg2kpl,
         hwy_kpl = hwy / mpg2kpl) |>
  
  # we create a new column by computing an 
  # average efficiency as mean between city and highway cycles
  mutate(avg_kpl = (cty_kpl + hwy_kpl) / 2) |>
  
  # we convert kilometers-per-liter to liters for 100 KM
  mutate(avg_for_100 = 100 / avg_kpl) |>
  
  # we reduce the table to only two columns
  # class (of car) and avg_mpg
  select(class, avg_for_100)  |>

  # we group by each class of car
  # and compute average efficiency for each group (class of car)
  group_by(class) |>
  summarise(class_avg = mean(avg_for_100), .groups = "keep") |>
  
  # we round the value to just one digit after the decimal point
  mutate(class_avg = round(class_avg, 1)) |>
  
  # we sort table rows to go from best to worst on efficiency
  arrange(class_avg) |>
  
  # we rename the class_avg_lpk to have a more meaningul name
  rename("Average liters per 100 KM" = class_avg) |>

  # we kable (Knit the tABLE) to make it look nicer in the document
  knitr::kable()

class	Average liters per 100 KM
compact	11.6
midsize	12.0
subcompact	12.6
2seater	14.0
minivan	15.3
suv	18.1
pickup	19.5

6.2 `select()` columns by name

Select verb allows you to select/pick columns in a table using their names. This is very similar to using columns names as indexes for tables that you have learned in seminar 3.

First, let us make a shorter version of mpg table by keeping only the first five rows. Note that you can also pick first N rows via head() function or slide_head() function from dplyr itself .

short_mpg <- mpg[1:5, ]

# same "first five rows" but via head() function
short_mpg <- head(mpg, 5)

# same "first five rows" but via dplyr::slice_head() function
short_mpg <- slice_head(mpg, n = 5)


knitr::kable(short_mpg)

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

Here is how you can select only model and cty columns via preserving [] subsetting

short_mpg[, c("model", "cty")]

model	cty
a4	18
a4	21
a4	20
a4	21
a4	16

And here is how it is done via select().

short_mpg |>
  select(model, cty)

model	cty
a4	18
a4	21
a4	20
a4	21
a4	16

The idea of Tidyverse functions is to adopt to you, so you can use quotes or pass a vector of strings with column names. All calls below produce the same effect, so pick the style you prefer (mine, is in the code above) and stick to it⁴⁷.

short_mpg |>
  select(c("model", "cty"))
  
short_mpg |>
  select("model", "cty")
  
short_mpg |>
  select(c(model, cty))

As you surely remember, you can use negation to select other indexes within a vector (c(4, 5, 6)[-2] gives you [4, 6]). For the single brackets [] this mechanism does not work with column names (only with their indexes). However, select has you covered, so we can select everything but cty and model

short_mpg |>
  select(-cty, -model)

manufacturer	displ	year	cyl	trans	drv	hwy	fl	class
audi	1.8	1999	4	auto(l5)	f	29	p	compact
audi	1.8	1999	4	manual(m5)	f	29	p	compact
audi	2.0	2008	4	manual(m6)	f	31	p	compact
audi	2.0	2008	4	auto(av)	f	30	p	compact
audi	2.8	1999	6	auto(l5)	f	26	p	compact

In the current version of dplyr, you can do the same negation via ! (a logical not operator, you will meet later), moreover, it is now a recommended way of writing the selection⁴⁸. The - and ! are not synonyms and the difference is subtle but important, see below.

# This will NOT produce the same result as above
# Note that the model column is still in the table
short_mpg |>
  select(!cty, !model)

manufacturer	model	displ	year	cyl	trans	drv	hwy	fl	class	cty
audi	a4	1.8	1999	4	auto(l5)	f	29	p	compact	18
audi	a4	1.8	1999	4	manual(m5)	f	29	p	compact	21
audi	a4	2.0	2008	4	manual(m6)	f	31	p	compact	20
audi	a4	2.0	2008	4	auto(av)	f	30	p	compact	21
audi	a4	2.8	1999	6	auto(l5)	f	26	p	compact	16

However, if you stick to putting all column names into a vector, as with the direct selection above, you can use negation with names as strings, you can negate a vector of names, etc. Again, it is mostly a matter of taste with consistency being more important than a specific choice you make.

# will produce the same result as for "-"
short_mpg |>
  select(!c("cty", "model"))

short_mpg |>
  select(!"cty", !"model")
  
short_mpg |>
  select(!c(cty, model))

Unlike vector indexing that forbids mixing positive and negative indexing, select does allow it. However, do not use it⁴⁹ because results can be fairly counter-intuitive and, on top of that, - and ! work somewhat differently. Note the difference between ! and -: In the former case only the !model part appears to have the effect, whereas in case of - only cty works.

short_mpg |>
  select(cty, !model)

cty	manufacturer	displ	year	cyl	trans	drv	hwy	fl	class
18	audi	1.8	1999	4	auto(l5)	f	29	p	compact
21	audi	1.8	1999	4	manual(m5)	f	29	p	compact
20	audi	2.0	2008	4	manual(m6)	f	31	p	compact
21	audi	2.0	2008	4	auto(av)	f	30	p	compact
16	audi	2.8	1999	6	auto(l5)	f	26	p	compact

short_mpg |>
  select(cty, -model)

cty
18
21
20
21
16

To make things even, worse select(-model, cty) work the same way as select(cty, !model) (sigh…)

short_mpg |>
  select(-model, cty)

manufacturer	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	2.8	1999	6	auto(l5)	f	16	26	p	compact

So, bottom line, do not mix positive and negative indexing in select! I am showing you this only to signal the potential danger.

Do exercise 1.

Simple names and their negation will be sufficient for most of your projects. However, I would recommend taking a look at the official manual just to see that select offers a lot of flexibility (selecting range of columns, by column type, by partial name matching, etc), something that might be useful for you in your work.

6.3 Conditions

Before we can work with the next verb, you need to understand conditions. Conditions are statements about values that are either TRUE or FALSE. In the simplest case, you can check whether two values (one in a variable and one hard-coded) are equal via == operator

x <- 5
print(x == 5)

## [1] TRUE

print(x == 3)

## [1] FALSE

For numeric values, you can use all usual comparison operators including not equal !=, less than <, greater than >, less than or equal to <= (note the order of symbols!), and greater than or equal to >= (again, note the order of symbols).

Do exercise 2.

You can negate a statement via not ! symbol as !TRUE is FALSE and vice versa. However, note that round brackets in the examples below! They are critical to express the order of computation. Anything inside the brackets is evaluated first. And if you have brackets inside the brackets, similar to nested functions, it is the innermost expression that get evaluated first. In the example below, x==5 is evaluated first and logical inversion happens only after it. In this particular example, you may not need them but I would suggest using them to ensure clarity.

x <- 5
print(!(x == 5))

## [1] FALSE

print(!(x == 3))

## [1] TRUE

Do exercise 3.

You can also combine several conditions using and & and or | operators⁵⁰. Again, note round brackets that explicitly define what is evaluated first.

x <- 5
y <- 2

# x is not equal to 5 OR y is equal to 1
print((x != 5) | (y == 1))

## [1] FALSE

# x less than 10 AND y is greater than or equal to 1
print((x < 10) & (y >= 1))

## [1] TRUE

Do exercise 4.

All examples above used scalars but you remember that everything is a vector, including values that we used (they are just vectors of length one). Accordingly, same logic works for vectors of arbitrary length with comparisons working element-wise, so you get a vector of the same length with TRUE or FALSE values for each pairwise comparison.

Do exercise 5.

6.4 Logical indexing

In the second seminar, you learned about vector indexing when you access some elements of a vector by specifying their index. There is an alternative way, called logical indexing. Here, you supply a vector of equal length with logical values and you get elements of the original vector whenever the logical value is TRUE

x <- 1:5
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]

## [1] 1 2 4

This is particularly useful, if you are interested in elements that satisfy certain condition. For example, you want all negative values and you can use condition x<5 that will produce a vector of logical values that, in turn, can be used as index

x <- c(-2, 5, 3, -5, -1)
x[x < 0]

## [1] -2 -5 -1

You can have conditions of any complexity by combining them via and & and or | operators. For example, if you want number below -1 or above 3 (be careful to have space between < and -, otherwise it will be interpreted as assignment <-).

x <- c(-2, 5, 3, -5, -1)
x[(x < -1) | (x > 3)]

## [1] -2  5 -5

Do exercise 6.

Sometimes you may want to know the actual index of elements for which some condition is TRUE. Function which() does exactly that.

x <- c(-2, 5, 3, -5, -1)
which( (x< -1) | (x>3) )

## [1] 1 2 4

6.5 `filter()` rows by values

Now that you understand conditions and logical indexing, using filter() is very straightforward: You simply put condition that describes rows that you want to retain inside the filter() call. For example, we can look at efficiency only for two-seater cars.

mpg |>
  filter(class == "2seater")

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
chevrolet	corvette	5.7	1999	8	manual(m6)	r	16	26	p	2seater
chevrolet	corvette	5.7	1999	8	auto(l4)	r	15	23	p	2seater
chevrolet	corvette	6.2	2008	8	manual(m6)	r	16	26	p	2seater
chevrolet	corvette	6.2	2008	8	auto(s6)	r	15	25	p	2seater
chevrolet	corvette	7.0	2008	8	manual(m6)	r	15	24	p	2seater

You can use information from any row, so we can look for midsize cars with four-wheel drive.

mpg |>
  filter(class == "midsize" & drv == "4")

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a6 quattro	2.8	1999	6	auto(l5)	4	15	24	p	midsize
audi	a6 quattro	3.1	2008	6	auto(s6)	4	17	25	p	midsize
audi	a6 quattro	4.2	2008	8	auto(s6)	4	16	23	p	midsize

Do exercise 7.

Note that you can emulate filter() in a very straightforward way using single-brackets base R, the main difference is that you need to prefix every column with the table name, so mpg[["class"]] instead of just class⁵¹.

mpg[mpg[["class"]] == "midsize" & mpg[["drv"]] == "4", ]

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a6 quattro	2.8	1999	6	auto(l5)	4	15	24	p	midsize
audi	a6 quattro	3.1	2008	6	auto(s6)	4	17	25	p	midsize
audi	a6 quattro	4.2	2008	8	auto(s6)	4	16	23	p	midsize

So why use filter() then? In isolation, as a single line computation, both options are equally compact and clear (apart from all the extra table[["..."]] in base R). But pipe-oriented nature of the filter() makes it more suitable for chains of computations, which is the main advantage of Tidyverse.

6.6 `arrange()` rows in a particular order

Sometimes you might need to sort your table so that rows go in a particular order⁵². In Tidyverse, you arrange rows based on values of specific variables. This verb is very straightforward, you simply list all variables, which must be used for sorting, in the order the sorting must be carried out. I.e., first the table is sorted based on values of the first variable. Then, for equal values of that variable, rows are sorted based on the second variable, etc. By default, rows are arranged in ascending order but you can reverse it by putting a variable inside of desc() function. Here is the short_mpg table arranged by city cycle highway efficiency (ascending order) and engine displacement (descending order, note the order of the last two rows).

short_mpg |>
  arrange(cty, desc(displ))

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact

Do exercise 8.

You can arrange a table using base R via order() function that gives index of ordered elements and can be used inside of preserving subsetting via single brackets [] notation. You can control for ascending/descending of a specific variable using rev() function that is applied after ordering, so rev(order(...)).

short_mpg[order(short_mpg[["cty"]], rev(short_mpg[["displ"]])), ]

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact

Do exercise 9.

6.7 `mutate()` columns

In Tidyverse, mutate function allows you to both add new columns/variables to a table and change the existing ones. In essence, it is equivalent to a simple column assignment statement in base R.

# base R
short_mpg[["avg_mpg"]] <- (short_mpg[["cty"]] + short_mpg[["hwy"]]) / 2

# Tidyverse equivalent
short_mpg <- 
  short_mpg |>
  mutate(avg_mpg = (cty + hwy) / 2)

Note two critical differences. First, mutate() takes a table as an input and returns a table as an output. This is why you start with a table, pipe it to mutate, and assign the results back to the original variable. If you have more verbs/lines, it is the output of the last computation that is assigned to the variable on the left-hand side of assignment⁵³. Look at the listing below that indexes each line by when it is executed.

some_table <-    # 3. We assign the result to the original table, only once all the code below has been executed.
  some_table |>  # 1. We start here, with the original table and pipe to the next computation
  mutate(...)    # 2. We add/change columns inside of the table. The output is a table which we use for assignment all the way at the top.

Second, you are performing a computation inside the call of the mutate() function, so avg_mpg = (short_mpg$cty + short_mpg$hwy) / 2 is a parameter that you pass to it (yes, it does not look like one). This is why you use = rather than a normal assignment arrow <-. Unfortunately, you can use <- inside the mutate and the computation will work as intended but, for internal-processing reasons, the entire statement, rather than just the left-hand side, will be used as a column name. Thus, use <- outside and = inside of Tydiverse verbs.

short_mpg |>
  select(cty, hwy) |>
  mutate(avg_mpg =  (cty + hwy) / 2,      # column name will be avp_mpg
         avg_mpg <- (cty + hwy) / 2) |>  # column name will be `avg_mpg <- (short_mpg$cty + short_mpg$hwy) / 2`
  knitr::kable()

cty	hwy	avg_mpg	avg_mpg <- (cty + hwy)/2
18	29	23.5	23.5
21	29	25.0	25.0
20	31	25.5	25.5
21	30	25.5	25.5
16	26	21.0	21.0

As shown in the example above, you can perform several computations within a single mutate call and they are executed one after another, just as they would be when using base R.

Do exercise 10.

Finally, mutate has two cousin-verbs called transmute and add_column. The former — transmute — works the same way but discards all original columns that were not modified. You probably won’t use this verb all too often but I want you to be able to recognize it, as its name and function are very similar to mutate and the two are easy to confuse.

short_mpg |>
  transmute(avg_mpg = (cty + hwy) / 2) |>
  knitr::kable(align = "c")

avg_mpg
23.5
25.0
25.5
25.5
21.0

The latter — add_column — is similar to mutate if you need to add a new column rather than to modify a new one. Its advantage is that it will produce an error, if you try to overwrite an existing column. Its disadvantage is that it does not appear to respect data grouping (see below), which can be very confusing. In short, stick to mutate unless you need either of these two functions specifically.

6.8 `summarize()` table

This verb is used when you aggregate across all rows, reducing them to a single value. Some examples of aggregating functions that are probably already familiar to you are mean, median, standard deviation, min/max. However, you can “aggregate” by taking a first or a last value or even by putting in a constant. Important is that you should assign a single value to the column when using summarize.

If you use summarize on an ungrouped table (these are the only tables we’ve been working on so far), it keeps only the computed columns, which makes you wonder “what’s the point?”

mpg |>
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy))

## # A tibble: 1 × 2
##   avg_cty avg_hwy
##     <dbl>   <dbl>
## 1    16.9    23.4

6.9 Work on individual groups of rows

The real power of summarize and of mutate becomes evident when they are applied to the data that is grouped by certain criteria. group_by() verb groups rows of the table based on values of variables you specified. Behind the scenes, this turns your single table into set of tables, so that your Tidyverse verbs are applied to each table separately. This ability to parse your table into different groups of rows (all rows that belong to a particular participant, or participant and condition, or rows per block, or per trial), change that grouping on the fly, return back to the original full table, etc. makes analysis a breeze. Here is how we can compute average efficiency not across all cars (as in the code above) but for each car class separately.

mpg |>
  # there are seven different classes of cars, so group_by(cars)  will
  # create seven hidden independent tables and all verbs below will be 
  # applied to each table separately
  group_by(class)  |>
  
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()

class	avg_cty	avg_hwy
2seater	15.40000	24.80000
compact	20.12766	28.29787
midsize	18.75610	27.29268
minivan	15.81818	22.36364
pickup	13.00000	16.87879
subcompact	20.37143	28.14286
suv	13.50000	18.12903

Note that we compute a single value per table but because we do it for seven tables, we get seven rows in our resultant table. And group_by makes it easy to group data in any way you want. Are you interested in manufacturers instead car classes? Easy!

mpg |>
  group_by(manufacturer)  |>
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()

manufacturer	avg_cty	avg_hwy
audi	17.61111	26.44444
chevrolet	15.00000	21.89474
dodge	13.13514	17.94595
ford	14.00000	19.36000
honda	24.44444	32.55556
hyundai	18.64286	26.85714
jeep	13.50000	17.62500
land rover	11.50000	16.50000
lincoln	11.33333	17.00000
mercury	13.25000	18.00000
nissan	18.07692	24.61538
pontiac	17.00000	26.40000
subaru	19.28571	25.57143
toyota	18.52941	24.91176
volkswagen	20.92593	29.22222

How about efficiency per class and year? Still easy!

mpg |>
  group_by(class, year)  |>
  # same mean computation but per table and we've got seven of them
  summarise(avg_cty = mean(cty),
            avg_hwy = mean(hwy),
            .groups = "drop") |>
  knitr::kable()

class	year	avg_cty	avg_hwy
2seater	1999	15.50000	24.50000
2seater	2008	15.33333	25.00000
compact	1999	19.76000	27.92000
compact	2008	20.54545	28.72727
midsize	1999	18.15000	26.50000
midsize	2008	19.33333	28.04762
minivan	1999	16.16667	22.50000
minivan	2008	15.40000	22.20000
pickup	1999	13.00000	16.81250
pickup	2008	13.00000	16.94118
subcompact	1999	21.57895	29.00000
subcompact	2008	18.93750	27.12500
suv	1999	13.37931	17.55172
suv	2008	13.60606	18.63636

The .groups parameter of the summarize function determines whether grouping you use should be dropped (.groups = "drop", table become a single ungrouped table) or kept (.groups = "keep"). You will get a warning, if you do not specify .groups parameter, so it is a good idea to do this explicitly. In general, use .groups = "drop" as it is better to later regroup table again then work on a groupped table without realizing it (leads to some weird looking output and a tricky to chase problem). You can also explicitly drop grouping via ungroup() verb.

Finally, there is a cousin verb rowwise() that groups by row. I.e., every row of the table becomes its own group, which could be useful if you need to apply a computation per row and the usual mutate() approach does not work (however, this is something of an advanced topic, so it is more about recognizing it than using it).

Do exercise 11.

You can replicate the functionality of group_by + summarize in base R via aggregate() and group_by + mutate via by functions. However, they are somewhat less straightforward in use as they rely on functional programming (which you haven’t learned about yet) and require both grouping and summary function within a single call. Hence, we will skip on those.

6.10 Putting it all together

Now you have enough tools at your disposal to start programming a continuous analysis pipeline!

Do exercise 12.

6.11 Should I use Tidyverse?

As you saw above, whatever Tidyverse can do, base R can do as well. So why use a non-standard family of packages? If you are using each function in isolation, there is probably not much sense in this. Base R can do it equally well and each individual function is also compact and simple. However, if you need to chain your computation, which is almost always the case, Tidyverse’s ability to pipe the entire sequence of functions in a simple consistent and, therefore, easy to understand way is a game-changer. In the long run, pick your style. Either go “all in” with Tidyverse (that is my approach), stick to base R, or find some alternative package family (e.g., data.table). However, as far as the book is concerned, it will be almost exclusively Tydiverse from now on.

5 ggplot2: Grammar of Graphics

7 Working with Factors