6 Tidyverse: dplyr
Now that you understand vectors, tables, functions and pipes, and you know what our end goal is (a tidy table), we can start with data wrangling and Tidyverse way of doing it. All functions discussed below are part of dplyr46 “grammar of data manipulation” package. Grab the exercise notebook!
6.1 Tidyverse philosophy
Data analysis is different from “normal” programming as it mostly involves a series of sequential operations on the same table. You might load the table, transform some variables, filter data, select smaller subset of columns, aggregate by summarizing across different groups of variables before plotting it or formally analyzing it via statistical tests. Tidyverse is built around this serial nature of data analysis of piping a table through a chain of functions. Accordingly, Tidyverse functions take a table (data.frame
or tibble
) as their first parameter, which makes piping simpler, and return a modified table as an output. This table-in → table-out consistency makes it easy to pipe these operations one after another. For me, it helps to think about Tidyverse functions as verbs: Actions that I perform on the table at each step.
Here is quick teaser of how such sequential piping works. Below, we will examine each verb/function separately and I will also show you how same operations can be carried out using base R. Note that I put each verb/function on a separate line. I don’t need to do this, but it makes it easier to understand how many different operations you perform (number of lines), how complex they are (how long individuals lines of code are), and makes them easy to read line-by-line. Note that even though we have many lines, it is fairly easy to follow the entire code sequence.
# miles-per-gallon to kilometers-per-liter conversion factor
mpg2kpl <- 2.82481061
mpg |>
# we filter the table by rows,
# only keeping rows for which year is 2008
filter(year == 2008) |>
# we change cty and hwy columns by turning
# miles/gallon into liters/kilometer
mutate(cty_kpl = cty / mpg2kpl,
hwy_kpl = hwy / mpg2kpl) |>
# we create a new column by computing an
# average efficiency as mean between city and highway cycles
mutate(avg_kpl = (cty_kpl + hwy_kpl) / 2) |>
# we convert kilometers-per-liter to liters for 100 KM
mutate(avg_for_100 = 100 / avg_kpl) |>
# we reduce the table to only two columns
# class (of car) and avg_mpg
select(class, avg_for_100) |>
# we group by each class of car
# and compute average efficiency for each group (class of car)
group_by(class) |>
summarise(class_avg = mean(avg_for_100), .groups = "keep") |>
# we round the value to just one digit after the decimal point
mutate(class_avg = round(class_avg, 1)) |>
# we sort table rows to go from best to worst on efficiency
arrange(class_avg) |>
# we rename the class_avg_lpk to have a more meaningul name
rename("Average liters per 100 KM" = class_avg) |>
# we kable (Knit the tABLE) to make it look nicer in the document
knitr::kable()
class | Average liters per 100 KM |
---|---|
compact | 11.6 |
midsize | 12.0 |
subcompact | 12.6 |
2seater | 14.0 |
minivan | 15.3 |
suv | 18.1 |
pickup | 19.5 |
6.2 select()
columns by name
Select verb allows you to select/pick columns in a table using their names. This is very similar to using columns names as indexes for tables that you have learned in seminar 3.
First, let us make a shorter version of mpg
table by keeping only the first five rows. Note that you can also pick first N rows via head() function or slide_head() function from dplyr itself .
short_mpg <- mpg[1:5, ]
# same "first five rows" but via head() function
short_mpg <- head(mpg, 5)
# same "first five rows" but via dplyr::slice_head() function
short_mpg <- slice_head(mpg, n = 5)
knitr::kable(short_mpg)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
Here is how you can select only model
and cty
columns via preserving []
subsetting
short_mpg[, c("model", "cty")]
model | cty |
---|---|
a4 | 18 |
a4 | 21 |
a4 | 20 |
a4 | 21 |
a4 | 16 |
And here is how it is done via select()
.
short_mpg |>
select(model, cty)
model | cty |
---|---|
a4 | 18 |
a4 | 21 |
a4 | 20 |
a4 | 21 |
a4 | 16 |
The idea of Tidyverse functions is to adopt to you, so you can use quotes or pass a vector of strings with column names. All calls below produce the same effect, so pick the style you prefer (mine, is in the code above) and stick to it47.
short_mpg |>
select(c("model", "cty"))
short_mpg |>
select("model", "cty")
short_mpg |>
select(c(model, cty))
As you surely remember, you can use negation to select other indexes within a vector (c(4, 5, 6)[-2]
gives you [4, 6]
). For the single brackets []
this mechanism does not work with column names (only with their indexes). However, select
has you covered, so we can select everything but cty
and model
short_mpg |>
select(-cty, -model)
manufacturer | displ | year | cyl | trans | drv | hwy | fl | class |
---|---|---|---|---|---|---|---|---|
audi | 1.8 | 1999 | 4 | auto(l5) | f | 29 | p | compact |
audi | 1.8 | 1999 | 4 | manual(m5) | f | 29 | p | compact |
audi | 2.0 | 2008 | 4 | manual(m6) | f | 31 | p | compact |
audi | 2.0 | 2008 | 4 | auto(av) | f | 30 | p | compact |
audi | 2.8 | 1999 | 6 | auto(l5) | f | 26 | p | compact |
In the current version of dplyr
, you can do the same negation via !
(a logical not
operator, you will meet later), moreover, it is now a recommended way of writing the selection48. The -
and !
are not synonyms and the difference is subtle but important, see below.
# This will NOT produce the same result as above
# Note that the model column is still in the table
short_mpg |>
select(!cty, !model)
manufacturer | model | displ | year | cyl | trans | drv | hwy | fl | class | cty |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 29 | p | compact | 18 |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 29 | p | compact | 21 |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 31 | p | compact | 20 |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 30 | p | compact | 21 |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 26 | p | compact | 16 |
However, if you stick to putting all column names into a vector, as with the direct selection above, you can use negation with names as strings, you can negate a vector of names, etc. Again, it is mostly a matter of taste with consistency being more important than a specific choice you make.
# will produce the same result as for "-"
short_mpg |>
select(!c("cty", "model"))
short_mpg |>
select(!"cty", !"model")
short_mpg |>
select(!c(cty, model))
Unlike vector indexing that forbids mixing positive and negative indexing, select
does allow it. However, do not use it49 because results can be fairly counter-intuitive and, on top of that, -
and !
work somewhat differently. Note the difference between !
and -
: In the former case only the !model
part appears to have the effect, whereas in case of -
only cty
works.
short_mpg |>
select(cty, !model)
cty | manufacturer | displ | year | cyl | trans | drv | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|
18 | audi | 1.8 | 1999 | 4 | auto(l5) | f | 29 | p | compact |
21 | audi | 1.8 | 1999 | 4 | manual(m5) | f | 29 | p | compact |
20 | audi | 2.0 | 2008 | 4 | manual(m6) | f | 31 | p | compact |
21 | audi | 2.0 | 2008 | 4 | auto(av) | f | 30 | p | compact |
16 | audi | 2.8 | 1999 | 6 | auto(l5) | f | 26 | p | compact |
short_mpg |>
select(cty, -model)
cty |
---|
18 |
21 |
20 |
21 |
16 |
To make things even, worse select(-model, cty)
work the same way as select(cty, !model)
(sigh…)
short_mpg |>
select(-model, cty)
manufacturer | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|
audi | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
So, bottom line, do not mix positive and negative indexing in select
! I am showing you this only to signal the potential danger.
Do exercise 1.
Simple names and their negation will be sufficient for most of your projects. However, I would recommend taking a look at the official manual just to see that select
offers a lot of flexibility (selecting range of columns, by column type, by partial name matching, etc), something that might be useful for you in your work.
6.3 Conditions
Before we can work with the next verb, you need to understand conditions. Conditions are statements about values that are either TRUE
or FALSE
. In the simplest case, you can check whether two values (one in a variable and one hard-coded) are equal via ==
operator
x <- 5
print(x == 5)
## [1] TRUE
print(x == 3)
## [1] FALSE
For numeric values, you can use all usual comparison operators including not equal !=
, less than <
, greater than >
, less than or equal to <=
(note the order of symbols!), and greater than or equal to >=
(again, note the order of symbols).
Do exercise 2.
You can negate a statement via not !
symbol as !TRUE
is FALSE
and vice versa. However, note that round brackets in the examples below! They are critical to express the order of computation. Anything inside the brackets is evaluated first. And if you have brackets inside the brackets, similar to nested functions, it is the innermost expression that get evaluated first. In the example below, x==5
is evaluated first and logical inversion happens only after it. In this particular example, you may not need them but I would suggest using them to ensure clarity.
x <- 5
print(!(x == 5))
## [1] FALSE
print(!(x == 3))
## [1] TRUE
Do exercise 3.
You can also combine several conditions using and &
and or |
operators50. Again, note round brackets that explicitly define what is evaluated first.
x <- 5
y <- 2
# x is not equal to 5 OR y is equal to 1
print((x != 5) | (y == 1))
## [1] FALSE
# x less than 10 AND y is greater than or equal to 1
print((x < 10) & (y >= 1))
## [1] TRUE
Do exercise 4.
All examples above used scalars but you remember that everything is a vector, including values that we used (they are just vectors of length one). Accordingly, same logic works for vectors of arbitrary length with comparisons working element-wise, so you get a vector of the same length with TRUE
or FALSE
values for each pairwise comparison.
Do exercise 5.
6.4 Logical indexing
In the second seminar, you learned about vector indexing when you access some elements of a vector by specifying their index. There is an alternative way, called logical indexing. Here, you supply a vector of equal length with logical values and you get elements of the original vector whenever the logical value is TRUE
x <- 1:5
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
## [1] 1 2 4
This is particularly useful, if you are interested in elements that satisfy certain condition. For example, you want all negative values and you can use condition x<5
that will produce a vector of logical values that, in turn, can be used as index
x <- c(-2, 5, 3, -5, -1)
x[x < 0]
## [1] -2 -5 -1
You can have conditions of any complexity by combining them via and &
and or |
operators. For example, if you want number below -1 or above 3 (be careful to have space between <
and -
, otherwise it will be interpreted as assignment <-
).
x <- c(-2, 5, 3, -5, -1)
x[(x < -1) | (x > 3)]
## [1] -2 5 -5
Do exercise 6.
Sometimes you may want to know the actual index of elements for which some condition is TRUE
. Function which() does exactly that.
## [1] 1 2 4
6.5 filter()
rows by values
Now that you understand conditions and logical indexing, using filter() is very straightforward: You simply put condition that describes rows that you want to retain inside the filter()
call. For example, we can look at efficiency only for two-seater cars.
mpg |>
filter(class == "2seater")
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
chevrolet | corvette | 5.7 | 1999 | 8 | manual(m6) | r | 16 | 26 | p | 2seater |
chevrolet | corvette | 5.7 | 1999 | 8 | auto(l4) | r | 15 | 23 | p | 2seater |
chevrolet | corvette | 6.2 | 2008 | 8 | manual(m6) | r | 16 | 26 | p | 2seater |
chevrolet | corvette | 6.2 | 2008 | 8 | auto(s6) | r | 15 | 25 | p | 2seater |
chevrolet | corvette | 7.0 | 2008 | 8 | manual(m6) | r | 15 | 24 | p | 2seater |
You can use information from any row, so we can look for midsize cars with four-wheel drive.
mpg |>
filter(class == "midsize" & drv == "4")
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a6 quattro | 2.8 | 1999 | 6 | auto(l5) | 4 | 15 | 24 | p | midsize |
audi | a6 quattro | 3.1 | 2008 | 6 | auto(s6) | 4 | 17 | 25 | p | midsize |
audi | a6 quattro | 4.2 | 2008 | 8 | auto(s6) | 4 | 16 | 23 | p | midsize |
Do exercise 7.
Note that you can emulate filter()
in a very straightforward way using single-brackets base R, the main difference is that you need to prefix every column with the table name, so mpg[["class"]]
instead of just class
51.
mpg[mpg[["class"]] == "midsize" & mpg[["drv"]] == "4", ]
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a6 quattro | 2.8 | 1999 | 6 | auto(l5) | 4 | 15 | 24 | p | midsize |
audi | a6 quattro | 3.1 | 2008 | 6 | auto(s6) | 4 | 17 | 25 | p | midsize |
audi | a6 quattro | 4.2 | 2008 | 8 | auto(s6) | 4 | 16 | 23 | p | midsize |
So why use filter()
then? In isolation, as a single line computation, both options are equally compact and clear (apart from all the extra table[["..."]]
in base R). But pipe-oriented nature of the filter()
makes it more suitable for chains of computations, which is the main advantage of Tidyverse.
6.6 arrange()
rows in a particular order
Sometimes you might need to sort your table so that rows go in a particular order52. In Tidyverse, you arrange rows based on values of specific variables. This verb is very straightforward, you simply list all variables, which must be used for sorting, in the order the sorting must be carried out. I.e., first the table is sorted based on values of the first variable. Then, for equal values of that variable, rows are sorted based on the second variable, etc. By default, rows are arranged in ascending order but you can reverse it by putting a variable inside of desc() function. Here is the short_mpg
table arranged by city cycle highway efficiency (ascending order) and engine displacement (descending order, note the order of the last two rows).
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
Do exercise 8.
You can arrange a table using base R via order() function that gives index of ordered elements and can be used inside of preserving subsetting via single brackets []
notation. You can control for ascending/descending of a specific variable using rev() function that is applied after ordering, so rev(order(...))
.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
Do exercise 9.
6.7 mutate()
columns
In Tidyverse, mutate function allows you to both add new columns/variables to a table and change the existing ones. In essence, it is equivalent to a simple column assignment statement in base R.
# base R
short_mpg[["avg_mpg"]] <- (short_mpg[["cty"]] + short_mpg[["hwy"]]) / 2
# Tidyverse equivalent
short_mpg <-
short_mpg |>
mutate(avg_mpg = (cty + hwy) / 2)
Note two critical differences. First, mutate()
takes a table as an input and returns a table as an output. This is why you start with a table, pipe it to mutate, and assign the results back to the original variable. If you have more verbs/lines, it is the output of the last computation that is assigned to the variable on the left-hand side of assignment53. Look at the listing below that indexes each line by when it is executed.
some_table <- # 3. We assign the result to the original table, only once all the code below has been executed.
some_table |> # 1. We start here, with the original table and pipe to the next computation
mutate(...) # 2. We add/change columns inside of the table. The output is a table which we use for assignment all the way at the top.
Second, you are performing a computation inside the call of the mutate()
function, so avg_mpg = (short_mpg$cty + short_mpg$hwy) / 2
is a parameter that you pass to it (yes, it does not look like one). This is why you use =
rather than a normal assignment arrow <-
. Unfortunately, you can use <-
inside the mutate
and the computation will work as intended but, for internal-processing reasons, the entire statement, rather than just the left-hand side, will be used as a column name. Thus, use <-
outside and =
inside of Tydiverse verbs.
short_mpg |>
select(cty, hwy) |>
mutate(avg_mpg = (cty + hwy) / 2, # column name will be avp_mpg
avg_mpg <- (cty + hwy) / 2) |> # column name will be `avg_mpg <- (short_mpg$cty + short_mpg$hwy) / 2`
knitr::kable()
cty | hwy | avg_mpg | avg_mpg <- (cty + hwy)/2 |
---|---|---|---|
18 | 29 | 23.5 | 23.5 |
21 | 29 | 25.0 | 25.0 |
20 | 31 | 25.5 | 25.5 |
21 | 30 | 25.5 | 25.5 |
16 | 26 | 21.0 | 21.0 |
As shown in the example above, you can perform several computations within a single mutate
call and they are executed one after another, just as they would be when using base R.
Do exercise 10.
Finally, mutate
has two cousin-verbs called transmute and add_column. The former — transmute
— works the same way but discards all original columns that were not modified. You probably won’t use this verb all too often but I want you to be able to recognize it, as its name and function are very similar to mutate
and the two are easy to confuse.
avg_mpg |
---|
23.5 |
25.0 |
25.5 |
25.5 |
21.0 |
The latter — add_column
— is similar to mutate
if you need to add a new column rather than to modify a new one. Its advantage is that it will produce an error, if you try to overwrite an existing column. Its disadvantage is that it does not appear to respect data grouping (see below), which can be very confusing. In short, stick to mutate
unless you need either of these two functions specifically.
6.8 summarize()
table
This verb is used when you aggregate across all rows, reducing them to a single value. Some examples of aggregating functions that are probably already familiar to you are mean, median, standard deviation, min/max. However, you can “aggregate” by taking a first or a last value or even by putting in a constant. Important is that you should assign a single value to the column when using summarize
.
If you use summarize
on an ungrouped table (these are the only tables we’ve been working on so far), it keeps only the computed columns, which makes you wonder “what’s the point?”
## # A tibble: 1 × 2
## avg_cty avg_hwy
## <dbl> <dbl>
## 1 16.9 23.4
6.9 Work on individual groups of rows
The real power of summarize
and of mutate
becomes evident when they are applied to the data that is grouped by certain criteria. group_by() verb groups rows of the table based on values of variables you specified. Behind the scenes, this turns your single table into set of tables, so that your Tidyverse verbs are applied to each table separately. This ability to parse your table into different groups of rows (all rows that belong to a particular participant, or participant and condition, or rows per block, or per trial), change that grouping on the fly, return back to the original full table, etc. makes analysis a breeze. Here is how we can compute average efficiency not across all cars (as in the code above) but for each car class separately.
mpg |>
# there are seven different classes of cars, so group_by(cars) will
# create seven hidden independent tables and all verbs below will be
# applied to each table separately
group_by(class) |>
# same mean computation but per table and we've got seven of them
summarise(avg_cty = mean(cty),
avg_hwy = mean(hwy),
.groups = "drop") |>
knitr::kable()
class | avg_cty | avg_hwy |
---|---|---|
2seater | 15.40000 | 24.80000 |
compact | 20.12766 | 28.29787 |
midsize | 18.75610 | 27.29268 |
minivan | 15.81818 | 22.36364 |
pickup | 13.00000 | 16.87879 |
subcompact | 20.37143 | 28.14286 |
suv | 13.50000 | 18.12903 |
Note that we compute a single value per table but because we do it for seven tables, we get seven rows in our resultant table. And group_by
makes it easy to group data in any way you want. Are you interested in manufacturers instead car classes? Easy!
mpg |>
group_by(manufacturer) |>
# same mean computation but per table and we've got seven of them
summarise(avg_cty = mean(cty),
avg_hwy = mean(hwy),
.groups = "drop") |>
knitr::kable()
manufacturer | avg_cty | avg_hwy |
---|---|---|
audi | 17.61111 | 26.44444 |
chevrolet | 15.00000 | 21.89474 |
dodge | 13.13514 | 17.94595 |
ford | 14.00000 | 19.36000 |
honda | 24.44444 | 32.55556 |
hyundai | 18.64286 | 26.85714 |
jeep | 13.50000 | 17.62500 |
land rover | 11.50000 | 16.50000 |
lincoln | 11.33333 | 17.00000 |
mercury | 13.25000 | 18.00000 |
nissan | 18.07692 | 24.61538 |
pontiac | 17.00000 | 26.40000 |
subaru | 19.28571 | 25.57143 |
toyota | 18.52941 | 24.91176 |
volkswagen | 20.92593 | 29.22222 |
How about efficiency per class and year? Still easy!
mpg |>
group_by(class, year) |>
# same mean computation but per table and we've got seven of them
summarise(avg_cty = mean(cty),
avg_hwy = mean(hwy),
.groups = "drop") |>
knitr::kable()
class | year | avg_cty | avg_hwy |
---|---|---|---|
2seater | 1999 | 15.50000 | 24.50000 |
2seater | 2008 | 15.33333 | 25.00000 |
compact | 1999 | 19.76000 | 27.92000 |
compact | 2008 | 20.54545 | 28.72727 |
midsize | 1999 | 18.15000 | 26.50000 |
midsize | 2008 | 19.33333 | 28.04762 |
minivan | 1999 | 16.16667 | 22.50000 |
minivan | 2008 | 15.40000 | 22.20000 |
pickup | 1999 | 13.00000 | 16.81250 |
pickup | 2008 | 13.00000 | 16.94118 |
subcompact | 1999 | 21.57895 | 29.00000 |
subcompact | 2008 | 18.93750 | 27.12500 |
suv | 1999 | 13.37931 | 17.55172 |
suv | 2008 | 13.60606 | 18.63636 |
The .groups
parameter of the summarize function determines whether grouping you use should be dropped (.groups = "drop"
, table become a single ungrouped table) or kept (.groups = "keep"
). You will get a warning, if you do not specify .groups
parameter, so it is a good idea to do this explicitly. In general, use .groups = "drop"
as it is better to later regroup table again then work on a groupped table without realizing it (leads to some weird looking output and a tricky to chase problem). You can also explicitly drop grouping via ungroup() verb.
Finally, there is a cousin verb rowwise() that groups by row. I.e., every row of the table becomes its own group, which could be useful if you need to apply a computation per row and the usual mutate() approach does not work (however, this is something of an advanced topic, so it is more about recognizing it than using it).
Do exercise 11.
You can replicate the functionality of group_by
+ summarize
in base R via aggregate() and group_by
+ mutate
via by functions. However, they are somewhat less straightforward in use as they rely on functional programming (which you haven’t learned about yet) and require both grouping and summary function within a single call. Hence, we will skip on those.
6.10 Putting it all together
Now you have enough tools at your disposal to start programming a continuous analysis pipeline!
Do exercise 12.
6.11 Should I use Tidyverse?
As you saw above, whatever Tidyverse can do, base R can do as well. So why use a non-standard family of packages? If you are using each function in isolation, there is probably not much sense in this. Base R can do it equally well and each individual function is also compact and simple. However, if you need to chain your computation, which is almost always the case, Tidyverse’s ability to pipe the entire sequence of functions in a simple consistent and, therefore, easy to understand way is a game-changer. In the long run, pick your style. Either go “all in” with Tidyverse (that is my approach), stick to base R, or find some alternative package family (e.g., data.table). However, as far as the book is concerned, it will be almost exclusively Tydiverse from now on.