8 Tidyng your data: joins and pivots

It is fun to work with tidy complete data. Unfortunately, more often than not you will need to preprocess and tidy it up before you can analyze it. In addition to dplyr, Tidyverse has tidyr package that helps you with some of the problems. Grab the exercise notebook and let’s get started.

8.1 Joining tables

Sometimes, results of a study are stored in separate tables. For example, demographic information can be stored separately from responses, as the former is the same for all trials and conditions, so it makes little sense to duplicate it. For the actual analysis, you may need to add this information, merging or, in SQL/Tidyverse-speak, joining two tables (I’ll use term “join” from hereon).

Examples make joining operations much easier to understand, so let us create two tables.

demographics <- tibble(ID = c("A", "B", "C", "D"),
                       Age = c(20, 19, 30, 22))
reports <- 
  tibble(ID = c("A", "B", "C", "D", "A", "B", "C", "D")) |>
  mutate(Report = round(runif(length(ID), 1, 7)), 2) |>
  arrange(ID)
Table 8.1: demographics (left) and report (right) tables
ID Age
A 20
B 19
C 30
D 22
ID Report 2
A 5 2
A 7 2
B 6 2
B 4 2
C 6 2
C 5 2
D 7 2
D 1 2

To join two tables, you require “key” columns: columns that contain values that will be used to identify matching rows in the two tables. You can use multiple and differently named columns for that purpose but we will start with a simplest case: a single column with same name "ID". A join function (inner_join() here, see below for details on different joins) takes a first row in demographics table that has "A" in ID column and will look for all rows in table reports that has ID == "A". Then, it will do the same for all remaining rows for demographics table, one row at a time. It takes three parameters: table x (first table), table y (second table), and by - a vector with key column names argument57. A call to merge() function is very similar and produces identical results. Note how column order has changed for merge(), because I used a different order of tables, but the content itself is the same.

via_join <- inner_join(demographics, reports, by = "ID")
via_merge <- merge(reports, demographics, by = "ID")
Table 8.2: Joining/merging tables with fully matching keys via join (left) and merge (right). Note how a different order of tables results in a different order of columns.
ID Age Report 2
A 20 5 2
A 20 7 2
B 19 6 2
B 19 4 2
C 30 6 2
C 30 5 2
D 22 7 2
D 22 1 2
ID Report 2 Age
A 5 2 20
A 7 2 20
B 6 2 19
B 4 2 19
C 6 2 30
C 5 2 30
D 7 2 22
D 1 2 22

Things are easy when every key in the first table has its counterpart in the second one and vice versa. Things get tricky, if that is not the case. Which is why there are four different ways to join tables (note that they will all produce identical results for fully matching set of keys, as in the example above):

  • inner join: uses only key values that are present in both tables.
  • full join: uses all key values from both tables, if matching keys are missing in one of the tables, rows are filled with NA.
  • left join: uses only key values that are present in the left (first) table, if matching keys are missing in the right table, , rows are filled with NA.
  • right join: mirror twin of the left join, uses only key values that are present in the right (second) table, if matching keys are missing in the left table, , rows are filled with NA.

To see each join in action, let us slightly modify the reports table to include ID “E” instead of “D”. Now, the “D” is missing the second table but “E” is missing in the demographics:

reports <- 
  tibble(ID = c("A", "B", "C", "E", "A", "B", "C", "E")) |>
  mutate(Report = runif(length(ID), 1, 7)) |>
  arrange(ID)
Table 8.3: Now demographics (left) and reports (right) table have unique keys (“D” for demographics, “E” for reports) without a match in the second table.
ID Age
A 20
B 19
C 30
D 22
ID Report
A 5.049373
A 2.364149
B 5.406722
B 1.712982
C 4.460129
C 2.055873
E 5.500388
E 2.932832

Inner join is most conservative and excludes any non-matching keys, note that rows with both ID == "D" and ID == "E" are missing. This is the default behavior for the merge function.

inner_tidy <- inner_join(reports, demographics, by = "ID") 
inner_base <- merge(reports, demographics, by = "ID")
Table 8.4: Inner join. Only rows for matching keys are merged
ID Report Age
A 5.049373 20
A 2.364149 20
B 5.406722 19
B 1.712982 19
C 4.460129 30
C 2.055873 30

In contrast, full join is the most liberal as it includes all rows from both tables, filling in missing values with NA (e.g., see Report for ID == "D" and Age for ID == "E"). In base R merge() function, you turn an inner join into a full one using all = TRUE.

full_tidy <- full_join(demographics, reports, by = "ID") 
full_base <- merge(demographics, reports, by = "ID", all = TRUE)
Table 8.5: Full join. All rows are merged, NA are used for missing values in rows from a complementary table
ID Age Report
A 20 5.049373
A 20 2.364149
B 19 5.406722
B 19 1.712982
C 30 4.460129
C 30 2.055873
D 22 NA
E NA 5.500388
E NA 2.932832

Left join uses only rows from the left (first) table, dropping extra rows from the second one. Note NA in Report column for ID == "D" and no rows for ID == "E". To do a left join via merge(), you need to specify all.x = TRUE.

left_tidy <- left_join(demographics, reports, by = "ID")
left_base <- merge(demographics, reports, by = "ID", all.x = TRUE)
Table 8.6: Left join. All rows from demographics tables are used and missing matching rows are filled with NA
ID Age Report
A 20 5.049373
A 20 2.364149
B 19 5.406722
B 19 1.712982
C 30 4.460129
C 30 2.055873
D 22 NA

Right join is a mirror twin of the left join, so now rows for ID == "D" are missing and there are missing values for ID == "E". You include all.y = TRUE for a right join via merge().

right_tidy <- right_join(demographics, reports, by = "ID") 
right_base <- merge(demographics, reports, by = "ID", all.y = TRUE)
Table 8.7: Right join. All rows from reports tables are used and missing matching rows are filled with NA
ID Age Report
A 20 5.049373
A 20 2.364149
B 19 5.406722
B 19 1.712982
C 30 4.460129
C 30 2.055873
E NA 5.500388
E NA 2.932832

As noted above, you can also use more than one key.

demographics <- tibble(ID = c("A", "B", "A", "B"),
                       Gender = c("M", "F", "F", "M"),
                       Age = c(20, 19, 30, 22))
reports <- tibble(ID = c("A", "B", "A", "B"),
                  Gender = c("M", "F", "F", "M"),
                  Report = runif(length(ID), 1, 7))
Table 8.8: Two identically named key columns: ID and Gender.
ID Gender Age
A M 20
B F 19
A F 30
B M 22
ID Gender Report
A M 5.501210
B F 2.377711
A F 3.543415
B M 4.912266
inner_multi_tidy <- inner_join(demographics, reports, by = c("ID", "Gender"))
inner_multi_base <- merge(demographics, reports, by = c("ID", "Gender"))
Table 8.9: Joining/merging two tables by ID and Gender.
ID Gender Age Report
A M 20 5.501210
B F 19 2.377711
A F 30 3.543415
B M 22 4.912266

Finally, key columns can be named differently in two tables. In this case, you need to “match” them explicitly. For dplyr joins, you use a named vector to match pairs of individual columns. For merge(), you supply two vectors: one for columns in the first table (parameter by.x) and one columns in the the second one (parameter by.y). Here, you need to be careful and check that columns order matches in both parameters.

demographics <- tibble(VPCode = c("A", "B", "A", "B"),
                       Sex = c("M", "F", "F", "M"),
                       Age = c(20, 19, 30, 22))

reports <- tibble(ID = c("A", "B", "A", "B"),
                  Gender = c("M", "F", "F", "M"),
                  Report = runif(length(ID), 1, 7))
Table 8.10: Differently named key columns. VPCode in demographics table corresponds to ID in reports; sex in demographics corresponds to Gender in reports
VPCode Sex Age
A M 20
B F 19
A F 30
B M 22
ID Gender Report
A M 1.475147
B F 3.427061
A F 2.174806
B M 4.177253
inner_diff_tidy <- inner_join(demographics, reports, by=c("VPCode"="ID", "Sex"="Gender"))
inner_diff_base <- merge(demographics, reports, by.x=c("VPCode", "Sex"), by.y=c("ID", "Gender"))
Table 8.11: Joining tables by matching VPCode to ID and Sex to Gender.
VPCode Sex Age Report
A F 30 2.174806
A M 20 1.475147
B F 19 3.427061
B M 22 4.177253

As you saw from examples above, dplyr joins and merge() produce identical results. However, I would recommend to use the former, simply because function names make it explicit which kind of join you perform (something you can figure out only by checking additional parameters of merge()).

Download files IM.csv and GP.csv that you will need for exercise #1. These are participants responses on two questionnaires with each participant identified by their ID (Participant in IM.csv and Respondent in GP.csv), Condition (which experimental group they belong to), and their Gender. Read both tables and join them so that there are no missing values in the table (some participants are missing in GP.csv, so there are three joins that can do this, which one will you pick?). Then, turn Condition and Gender into factors, so that for Condition levels are "control" (2) and "game" (1) and for Gender levels are "female" (1) and "male" (2). Your final table should look as follows (I’ve dropped most of the columns for IM.csv, so they would fit to the page):

Participant Condition Gender IM01_01 IM01_02 IM01_03 GP01_01 GP02_01 GP02_02 GP02_03 GP02_04 GP02_05 GP02_06 GP02_07 GP02_08 GP02_09 GP03_01
wi1Kv game female 3 2 5 1 1 1 1 1 6 7 2 1 1 1
9QQjf control female 3 4 4 1 4 4 6 4 6 4 4 4 4 4
qcp03 control female 2 3 6 1 1 1 1 1 4 4 5 4 1 1
eV1RY control female 2 5 6 1 1 5 1 5 7 7 1 7 1 3
JymeM control female 2 1 7 1 1 1 5 1 5 1 3 5 1 3
hxrts game female 1 3 3 1 1 3 5 1 7 5 3 4 4 2

Do exercise 1.

Repeat the same exercise but use merge().

Do exercise 2.

Now let us practice joining and simulating data as well. Create two tables that need to be joined by a single key column. In the first table, Use rnorm() function to generate normally distributed data with mean of 175 and standard deviation of 7.6 for column Height (what range would you expect to cover 95% of the data?). In the second table, use same normal distribution but with mean of 80 and standard deviation of 10 for column Weight. When fill in key column for both tables, do it so that inner and right join would give the same final table but left and full would give you a longer one (test this explicitly!). After joining two tables, plot Height against Weight and superimpose linear regression fit. Are two columns correlated? Should they be given how we generated the data?

Do exercise 3.

8.2 Pivoting

Recall the idea of tidy data:

  • variables are in columns,
  • observations are in rows,
  • values are in cells.

And, also recall, that quite often data is stored in a wide format that is easier for humans read.

Table 8.12: Wide non-tidy table.
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1
2 M-2 3 7 2

Here, Symmetry, Attractiveness, Trustworthiness are different face properties participants responded on, whereas values are Response they gave. You can work with a table like that but it is often more convenient to have a column Scale that will code which face property participants respond on and a column Response to hold values. Then, you can easily split or group your data by property while performing the same analysis on all of them.

You can do pivoting using base R reshape() function. But pivoting is a fairly confusing business, so Tidyverse alone offers three different solutions starting with reshape2 package58, and continuing with original gather() and spread() functions in tidyr to modern pivot_longer() and pivot_wider() functions in the same package.

8.3 Pivot longer

pivot_longer() takes a table, which you can pipe to the function Tidyverse-style, and a vector of column names that need to be transformed. All column names go to one new column and all the values go to another new column. Defaults names of these two columns are, respectively, "name" and "value" but you can specify something more suitable via, respectively, names_to and values_to parameters.

There are many more bells-and-whistles (name and value transformations, removing a prefix via regular expressions, etc.), so I recommend looking at the manual and a vignette. However, in most cases these four parameters will be all you need, so let us see pivot_longer in action.

I assume that table presented above is in widish_df table defined above. The columns that we want to transform are Symmetry, Attractiveness, Trustworthiness. Thus, the simplest call with all defaults is

long_tidy <- tidyr::pivot_longer(widish_df, 
                                cols=c("Symmetry", "Attractiveness", "Trustworthiness"))
Participant Face name value
1 M-1 Symmetry 6
1 M-1 Attractiveness 4
1 M-1 Trustworthiness 3
1 M-2 Symmetry 4
1 M-2 Attractiveness 7
1 M-2 Trustworthiness 6
2 M-1 Symmetry 5
2 M-1 Attractiveness 2
2 M-1 Trustworthiness 1
2 M-2 Symmetry 3
2 M-2 Attractiveness 7
2 M-2 Trustworthiness 2

When you compare the two tables, you will see that original three columns × four rows are now stretched into twelve rows and name-value pairs are consistent across the two tables59. As noted above, we can improve on that by specifying proper names for new columns.

long_tidy <- tidyr::pivot_longer(widish_df, 
                               cols=c("Symmetry", "Attractiveness", "Trustworthiness"),
                               names_to = "Scale",
                               values_to = "Response")
Participant Face Scale Response
1 M-1 Symmetry 6
1 M-1 Attractiveness 4
1 M-1 Trustworthiness 3
1 M-2 Symmetry 4
1 M-2 Attractiveness 7
1 M-2 Trustworthiness 6
2 M-1 Symmetry 5
2 M-1 Attractiveness 2
2 M-1 Trustworthiness 1
2 M-2 Symmetry 3
2 M-2 Attractiveness 7
2 M-2 Trustworthiness 2

If you want to stick to base R, you can pivot longer via reshape() function that can do both pivot to longer and wider tables. It is more flexible and, therefore, much more confusing (at least for me). Here are some parameters that we need to specify in order to emulate pivot_longer() call above:

  • direction. Either "long" (here), or "wide".
  • idvar: names of variables that identify multiple records in the long format. In contrast, pivot_longer(), assumes that all columns that you did not transform are identity columns.
  • varying: names of columns that will be turned into a single variable that contains only values in a new long table. Corresponds to cols argument in pivot_longer()
  • v.names: name of the column with values. Corresponds to values_to parameter of pivot_longer().
  • timevar : sort of corresponds to names_to parameter of pivot_longer(), so it is the name of the column where indexes or labels of transformed columns will go.
  • times : by default, reshape() does not put column names into timevar column but uses their relative indexes instead. E.g., Symmetry column will get index of 1, Attractiveness will get 2, Trustworthiness will be 3. You can replaces these indexes with any labels. Below, I used the same labels as column names but I could have used any three values for this.
long_base <- reshape(widish_df, 
                     direction = "long",
                     idvar = c("Participant",  "Face"),
                     varying = c("Symmetry", "Attractiveness", "Trustworthiness"),
                     v.names="Response",
                     times = c("Symmetry", "Attractiveness", "Trustworthiness"),
                     timevar = "Scale")
Table 8.13: Same table pivoted via reshape.
Participant Face Scale Response
1 M-1 Symmetry 6
1 M-2 Symmetry 4
2 M-1 Symmetry 5
2 M-2 Symmetry 3
1 M-1 Attractiveness 4
1 M-2 Attractiveness 7
2 M-1 Attractiveness 2
2 M-2 Attractiveness 7
1 M-1 Trustworthiness 3
1 M-2 Trustworthiness 6
2 M-1 Trustworthiness 1
2 M-2 Trustworthiness 2

For using reshape(), I strongly suggest experimenting with its parameters to get a feeling for how it should be (and should not be!) used. As you can see, it is more involved and uses fewer defaults than pivot_longer(), so you need to make sure you understand it.

8.4 Practice using Likert scale data

Let us put this new knowledge to practice, using GP.csv file. This is a questionnaire on gaming habits, which was conducted prior to an experiment to check whether two groups of participants assigned to Game and Experiment conditions have similar gaming habits. We would like to visually inspect responses to individual items in a questionnaire appear for different conditions, as this will tell us whether we should expect a difference. For this, we will reuse our approach for summarizing and plotting ordinal responses we did during the last seminar. Split the computations below into two pipelines. One that loads and pre-processes the data (steps 1-4). Another one that produces a summary and stores it into a different table (steps 5-6, see previous seminar if you forgot how we did it). Advice, implement it one step at a time, checking the table and making sure that you get expected results before piping it and adding the next operation.

  1. Read the file, make sure you specify column types.
  2. Convert Condition column to a factor with “game” (1) and “control” (2).
  3. Pivot all GP.. columns. You should get a table with five columns: Respondent, Condition, Gender, Item (column where original column names go), and Response (a column where original go). Hint, you can use slicing : to specify the range of columns or starts_with() function to specify a common prefix. Try both approaches.
  4. Convert Response column to a factor assuming a seven-point scale, the levels are
  5. Not at all
  6. Slightly
  7. Somewhat
  8. Moderately
  9. Quite a bit
  10. Considerably
  11. Very much
  12. Count responses for each item and condition.
  13. Convert counts to proportion of responses for each item and condition.
Your first table, in long format, should look like this (I show only first four rows)
Respondent Condition Gender Item Response
NV5UJ game 1 GP01_01 Not at all
NV5UJ game 1 GP02_01 Not at all
NV5UJ game 1 GP02_02 Not at all
NV5UJ game 1 GP02_03 Somewhat
And your second table, with aggregated results, show be like this
Condition Item Response n Proportion
game GP01_01 Not at all 6 0.6
game GP01_01 Slightly 3 0.3
game GP01_01 Very much 1 0.1
game GP02_01 Not at all 8 0.8

Do exercise 4.

Repeat the exercise but now using base R reshape() function. ::: {.practice} Do exercise 5. :::

Now you have a table with proportion of response for each condition and item, lets plot them to compare visually. Use Response for x-axis, Proportion for y-axis, Condition for color, think about how facet the plot.

Do exercise 6.

Note that our computation and plot have a problem: whenever a response level was not used by any participant, it is missing both in the table and in the plot. However, we should have an entry for every response level but with 0 for both count and proportion, 0 counts for a response level is not a missing or absent response level. A way to solve this problem is by using function complete() that adds missing combinations of the data based on a subset of columns you specified and fills in values for the remaining columns either with NA or with values that you specified. Here is an example for a table that contains columns Class and Count. Note that I defined Class as a factor with five levels (“A” till “E”) and the latter is missing from the table. Complete adds the row and fill the remaining column with ‘NA’.

class_df <-
  tibble(Class = c("A", "B", "C", "D"),
         Count = c(20, 19, 30, 22)) |>
  mutate(Class = factor(Class, levels = c("A", "B", "C", "D", "E")))

class_df |>
  complete(Class)
## # A tibble: 5 x 2
##   Class Count
##   <fct> <dbl>
## 1 A        20
## 2 B        19
## 3 C        30
## 4 D        22
## 5 E        NA

Although in some cases, filling with NA may be reasonable, here, we want to fill in the count and it should be 0 (the fact that a particular class is missing means that we never counted it). Note that fill parameter takes a named list not a named vector (Why is that? If about data types of different columns and if you are still not sure why, reread on difference between atomic vectors and lists).

class_df |>
  complete(Class, fill = list(Count = 0))
## # A tibble: 5 x 2
##   Class Count
##   <fct> <dbl>
## 1 A        20
## 2 B        19
## 3 C        30
## 4 D        22
## 5 E         0
Here is how the plot look once we completed the table

Do exercise 7.

8.5 Pivot wider

You can also always go from a long to a wide representation via pivot_wider() or reshape functions. The logic is reverse, you need to specify which columns identify different rows that belong together, which columns contain column names, and which contain their values. For our example table on face ratings, the names of the columns are in the column Scale and values are in Response. But what about columns that identify the rows that belong together? In our case, these are Participant and Face, so all rows from a long table that have same combination of Participant and Face values should be merged together into a single row. If you do not explicitly specify id_cols, then by default, pivot_wider() will use all other remaining columns to identify which rows belong together. This is irrelevant in this toy example, as Participant and Face is all we have left anyhow but below I will show you how things can get confusing and how to overcome this.

Let us undo our previous wide-to-long transformation and get our original wide table back!60

wide_again_tidy <-
  long_tidy |>
  pivot_wider(names_from = "Scale", values_from="Response")

Or, using explicit id_cols

wide_again_tidy <-
  long_tidy |>
  pivot_wider(id_cols = c("Participant", "Face"), names_from = "Scale", values_from="Response")
Table 8.14: Our table is wide again!
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1
2 M-2 3 7 2

You can pivot wider using reshape() as well. However, note that, as of 07.10.2023, it works correctly only with data frames, so if you have a tibble (as I do), you need to convert it to a data frame via data.frame() or as.data.frame() (drop the as.data.frame() in the example above to see what I mean). Otherwise, you need to specify

  • direction = "wide"
  • idvar : different rows that belong together. Same as id_cols for pivot_wider() but no defaults here.
  • timevar : same as names_from for pivot_wider(), column with values that will be used as column names.
  • v.names : same as values_from.
  • sep: the new column names will constructed as v.names + sep + timevar. By default sep=".".

The main difference, as compared to pivot_wider(), is how the column names are constructed. With reshape() function, the v.names + sep + timevar rule means that you end up with column names such as Response.Symmetry instead of just Symmetry.

wide_again_base <- reshape(as.data.frame(long_tidy),
                           direction = "wide",
                           idvar = c("Participant", "Face"),
                           timevar = "Scale",
                           v.names = "Response",
                           sep = "_")
Table 8.15: Our table is wide again but via reshape()!
Participant Face Response_Symmetry Response_Attractiveness Response_Trustworthiness
1 1 M-1 6 4 3
4 1 M-2 4 7 6
7 2 M-1 5 2 1
10 2 M-2 3 7 2

Let us take a look at the importance of id_cols. Imagine that we have another column, say, response times. So, our long table will look like this

long_tidy_rt <-
  long_tidy |>
  ungroup() |>
  mutate(RT = round(rgamma(n(), 4, 3), 2))
Participant Face Scale Response RT
1 M-1 Symmetry 6 0.65
1 M-1 Attractiveness 4 1.59
1 M-1 Trustworthiness 3 1.35
1 M-2 Symmetry 4 1.54
1 M-2 Attractiveness 7 0.38
1 M-2 Trustworthiness 6 1.27
2 M-1 Symmetry 5 2.78
2 M-1 Attractiveness 2 0.93
2 M-1 Trustworthiness 1 1.28
2 M-2 Symmetry 3 1.02
2 M-2 Attractiveness 7 1.35
2 M-2 Trustworthiness 2 1.08

For pivot_wider, if we do not specify which columns identify rows that belong together, RT will be used as well. But, because it is different for every response, each row in the original table will be unique and we will end up with a weird looking table wit lots of NAs.

wide_odd_rt <-
  pivot_wider(long_tidy_rt, names_from = "Scale", values_from="Response")
Participant Face RT Symmetry Attractiveness Trustworthiness
1 M-1 0.65 6 NA NA
1 M-1 1.59 NA 4 NA
1 M-1 1.35 NA NA 3
1 M-2 1.54 4 NA NA
1 M-2 0.38 NA 7 NA
1 M-2 1.27 NA NA 6
2 M-1 2.78 5 NA NA
2 M-1 0.93 NA 2 NA
2 M-1 1.28 NA NA 1
2 M-2 1.02 3 NA NA
2 M-2 1.35 NA 7 NA
2 M-2 1.08 NA NA 2

To remedy that, we need to specify id columns explicitly, so that pivot_wider() can ignore and drop the rest:

wide_rt <-
  pivot_wider(long_tidy_rt,
              id_cols = c("Participant", "Face"),
              names_from = "Scale",
              values_from="Response") 
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1
2 M-2 3 7 2

For practice, let us take adaptation data and turn it onto a wide format that is easier for humans to read. In the original form, the table is a long format with a row for each pair of prime and probe stimuli.

Participant Prime Probe Nsame Ntotal
LUH-1992-M Sphere Sphere 22 119
LUH-1992-M Sphere Quadro 23 118
LUH-1992-M Sphere Dual 15 120
LUH-1992-M Sphere Single 31 115

Let us turn it into a wider table, so that a single row corresponds to a single prime and four new columns contain proportion of same responses for individual probes. The table should look like this (use round() function to reduce the number of digits):

Participant Prime Sphere Quadro Dual Single Average
LUH-1992-M Sphere 0.18 0.19 0.12 0.27 0.19
LUH-1992-M Quadro 0.21 0.22 0.14 0.34 0.23
LUH-1992-M Dual 0.25 0.30 0.27 0.48 0.32
LUH-1992-M Single 0.34 0.30 0.48 0.39 0.38

The overall procedure is fairly straightforward (should be a single pipe!):

  1. Read the file (don’t forget to specify column types!)
  2. Computer Psame proportion of same responses given number of total responses for each Probe.
  3. Pivot the table wider, think about your id columns. Also try this without specifying any and see what you get.
  4. Compute an average stability across all probes and put it into a new Average column.
  5. Pipe it to the output, using knitr::kable().

Use pivot_wider() in exercise 8.

Do exercise 8.

Repeat the analysis but now using the reshape() function.

Do exercise 9.

Let us practice more and create group average summary as a square 5×4 table with a single row per Prime and four columns for Probe plus a column that says which prime the row corresponds to. As a value for each cell, we want to code a median value. The table should look like this:

Prime Sphere Quadro Dual Single
Sphere 0.13 0.13 0.17 0.32
Quadro 0.19 0.12 0.21 0.38
Dual 0.15 0.30 0.27 0.48
Single 0.34 0.30 0.48 0.51

You know almost everything you need, so think about how you would implement this as a single pipeline. Hints: to match my table you will definitely to convert Prime and Probe to factors to ensure consistent ordering (otherwise, they will be sorted alphabetically), you will need to group individual combinations of prime and probe before computing a summary statistics. And, of course, you will need to pivot the table wider (use your preferred method).

Do exercise 10.