11 Working with strings

When working with strings, I strongly suggest consulting a manual and vignettes of the stringr package. It has many functions that cover most needs. Grab exercise notebook before we start.

11.1 Warming up

Before we start working with strings, let us warm up by preprocessing band-adaptation.csv that we will be working with.

Read it (try specifying the URL instead of the local filename). Do not forget to specify column types!
compute proportion of “same” responses as a using Nsame (number of “same” responses) and Ntotal (total number of trials).
Convert Prime and Probe column to factors with the order “Sphere”, “Quadro”, “Dual”, “Single”.
Compute median and median absolute deviation from the median for Psame for all combinations of Prime and Probe.

Your table should look as follows:

Table 11.1: bands_df
Prime	Probe	Pmedian	Pmad
Sphere	Sphere	0.13	0.06
Sphere	Quadro	0.13	0.04
Sphere	Dual	0.17	0.10
Sphere	Single	0.32	0.08
Quadro	Sphere	0.19	0.07
Quadro	Quadro	0.12	0.12
Quadro	Dual	0.21	0.19
Quadro	Single	0.38	0.07
Dual	Sphere	0.15	0.15
Dual	Quadro	0.30	0.14
Dual	Dual	0.27	0.15
Dual	Single	0.48	0.16
Single	Sphere	0.34	0.18
Single	Quadro	0.30	0.20
Single	Dual	0.48	0.12
Single	Single	0.51	0.18

Do exercise 1.

11.2 Formatting strings via `glue()`

The table above gives us information about median probability of seeing the same rotation and about its absolute deviation from the median. However, it would be more convenient for a reader if we combine these two pieces of information into a single entry in form for of “ ± ”. Plus, it would be easier to see the pattern in a square table with one Prime per row and one Probe per column. The table I have in mind look like this:

Table 11.2: Probability of persistence, median ± MAD
Prime	Sphere	Quadro	Dual	Single
Sphere	0.13 ± 0.06	0.13 ± 0.04	0.17 ± 0.1	0.32 ± 0.08
Quadro	0.19 ± 0.07	0.12 ± 0.12	0.21 ± 0.19	0.38 ± 0.07
Dual	0.15 ± 0.15	0.3 ± 0.14	0.27 ± 0.15	0.48 ± 0.16
Single	0.34 ± 0.18	0.3 ± 0.2	0.48 ± 0.12	0.51 ± 0.18

You already know how to perform the second step (pivoting table wider to turn Probe factor levels into columns). For the first step, you need to combine two values into a string. There are different ways to construct this string via sprintf(), paste(), or via glue package. We will start with Tidyverse’s glue() and explore base R functions later.

glue package is part of the Tidyverse, so it should be already installed. However, it is not part of core tidyverse, so it does not get imported automatically via library(tidyverse) and you need to import it separately or use glue:: prefix. Function glue() allows you to “glue” values and code directly into a string. You simply surround any R code by wiggly brackets inside the string and the result of the code execution is glued in. If you use just a variable, its value will be glued-in. But you can put any code inside, although, the more code you put, the harder it will be to read and understand it.

answer <- 42
bad_answer <- 41
glue::glue("The answer is {answer}, not {abs(bad_answer / -4)}")

## The answer is 42, not 10.25

Use the table that you prepared during exercise 1 to compute a new column with “ ± ” (you will want to use round() function to restrict values to just 2 digit after the decimal point). Think about when you want to perform this computation to make it easier (before or after pivoting?) and which column(s?) do you need to pivot wider.

Do exercise 2.

11.3 Formatting strings via `paste()`

Base R has functions paste() and paste0() that concatenate a vector of strings into a single string. If you recall, vector values can only be of one (most flexible) type. Therefore, if you have a vector that intersperses strings with other values, they will be first converted to strings anyhow. The difference between paste() and paste0() is that the former puts a separator string in-between each value (defaults to ' ' but you can define your own via sep argument), whereas paste0() uses no separator. We can replicate our glue() example.

answer <- 42
bad_answer <- 41
paste("The answer is ", answer, ", not ", abs(bad_answer / -4), sep = "")

## [1] "The answer is 42, not 10.25"

paste0("The answer is ", answer, ", not ", abs(bad_answer / -4))

## [1] "The answer is 42, not 10.25"

Redo exercise 2 but using one of the paste functions instead of the glue().

Do exercise 3.

11.4 Formatting strings via `sprintf()`

For detailed string formatting, base R has a sprintf() function that provides a C-style string formatting (same as Python’s original string formatting and a common way to format a string in many programming languages). The general function call is sprintf("string with formatting", value1, value2, value), where values are are inserted into the string. In "string with formatting", you specify where you want to put the value via % symbol that is followed by an optional formatting info and the required symbol that defines the type of the value. The type symbols are

s for string
d for an integer
f for a float value using a “fixed point” decimal notation
e for a float value using a scientific notation (e.g., 1e2).
g for an “optimally” printed float value, so that scientific notation is used for very large or very small values (e.g., 1e+5 instead of 100000 and 1-e5 for 0.00001).

Here is an example of formatting a string using an integer:

sprintf("I had %d pancakes for breakfast", 10)

## [1] "I had 10 pancakes for breakfast"

You are not limited to a single value that you can put into a string. You can specify more locations via % but you must make sure that you pass the matching number of values. If there fewer parameters when you specified in the string, you will receive an error. If there are too many, only a warning⁶⁹. Before running it, can you figure out which call will actually work (and what will be the output) and which will produce an error or a warning?

sprintf("I had %d pancakes and either %d  or %d stakes for dinner", 2)
sprintf("I had %d pancakes and %d stakes for dinner", 7, 10)
sprintf("I had %d pancake and %d stakes for dinner", 1, 7, 10)

In case of real values you have two options: %f and %g. The latter uses scientific notation (e.g. 1e10 for 10000000000) to make a representation more compact. When formatting floating numbers, you can specify the number of decimal points to be displayed.

e <- 2.71828182845904523536028747135266249775724709369995
sprintf("Euler's number is roughly %.4f", e)

## [1] "Euler's number is roughly 2.7183"

Note that as most functions in R, sprintf() is vectorized so when you pass a vector of values it will generate a vector of strings with one formatted string for a value.

sprintf("The number is %d", c(2, 3))

## [1] "The number is 2" "The number is 3"

This means that you can use sprintf() to work on column both in base R and inside mutate() Tidyverse verb.

tibble(Number = 1:3) |>
  mutate(Message = sprintf("The number is %d", Number)) |>
  knitr::kable()

Number	Message
1	The number is 1
2	The number is 2
3	The number is 3

Redo exercise #2 but use sprintf() instead of glue().

Do exercise 4.

11.5 Extracting information from a string

Previous exercises dealt with combining various bits of information into a single string. Often, you also need to do the opposite: extract bits of information from a single string. For example, in the toy table on face perception, we have been working with, Face column code gender of the face "M" (table is short but you can easily assume that faces of both genders were used) and the second is its index (1 and 2). When we worked with persistence, Participant column encoded year of birth and gender, whereas Session contained detailed information about year, month, day, hour, minutes, and seconds all merged together. There are several ways to extract this information, either by extracting one piece at a time via substr() or string processing library stringr. Alternatively, you can split a string column into several columns via separate() or use extract() function.

11.6 Splitting strings via `separate()`

Function separate() is part of tidyr and its use is very straightforward: you pass 1) the name of the column that you want to split, 2) names of the columns it needs to be split into, 3) a separator symbol or indexes of splitting positions. Examples using the face table should make it clear. Reminder, this is the original wide table and we want to separate Face into FaceGender and FaceIndex.

widish_df <- 
  tibble(Participant = c(1, 1, 2, 2),
         Face = rep(c("M-1", "M-2"), 2), 
         Symmetry = c(6, 4, 5, 3),
         Attractiveness = c(4, 7, 2, 7),
         Trustworthiness = c(3, 6, 1, 2))

knitr::kable(widish_df)

Participant	Face	Symmetry	Attractiveness	Trustworthiness
1	M-1	6	4	3
1	M-2	4	7	6
2	M-1	5	2	1
2	M-2	3	7	2

As there is a very convenient “dash” between the two, we can use it for a separator symbol:

widish_df |>
  separate(Face, into=c("FaceGender", "FaceIndex"), sep="-")

Participant	FaceGender	FaceIndex	Symmetry	Attractiveness	Trustworthiness
1	M	1	6	4	3
1	M	2	4	7	6
2	M	1	5	2	1
2	M	2	3	7	2

Note that the original Face column is gone. We can keep it via remove=FALSE option

widish_df |>
  separate(Face, into=c("FaceGender", "FaceIndex"), sep="-", remove=FALSE)

Participant	Face	FaceGender	FaceIndex	Symmetry	Attractiveness	Trustworthiness
1	M-1	M	1	6	4	3
1	M-2	M	2	4	7	6
2	M-1	M	1	5	2	1
2	M-2	M	2	3	7	2

We also do not need to extract all information. For example, we can extract only face gender or face index. To get only the gender, we only specify one into column and add extra="drop" parameter, telling separate() to drop any extra piece it obtained:

widish_df |>
  separate(Face, into=c("Gender"), sep="-", remove=FALSE, extra="drop")

Participant	Face	Gender	Symmetry	Attractiveness	Trustworthiness
1	M-1	M	6	4	3
1	M-2	M	4	7	6
2	M-1	M	5	2	1
2	M-2	M	3	7	2

Alternatively, we can explicitly ignore pieces by using NA for their column name:

widish_df |>
  separate(Face, into=c("Gender", NA), sep="-", remove=FALSE)

widish_df |>
  separate(Face, into=c("Gender", NA), sep="-", remove=FALSE) |>
  knitr::kable()

Participant	Face	Gender	Symmetry	Attractiveness	Trustworthiness
1	M-1	M	6	4	3
1	M-2	M	4	7	6
2	M-1	M	5	2	1
2	M-2	M	3	7	2

What about keeping only the second piece in a FaceIndex column? We ignore the first one via NA

widish_df |>
  separate(Face, into=c(NA, "Index"), sep="-", remove=FALSE)

widish_df |>
  separate(Face, into=c(NA, "Index"), sep="-", remove=FALSE) |>
  knitr::kable(align = "c")

Participant	Face	Index	Symmetry	Attractiveness	Trustworthiness
1	M-1	1	6	4	3
1	M-2	2	4	7	6
2	M-1	1	5	2	1
2	M-2	2	3	7	2

Let’s practice. Use separate() to preprocess persistence data and create two new columns for hour and minutes from Session column. Do it in a single pipeline, starting with reading all files (use tidyverse read_csv() and specify column types!) and renaming Shape1 (Prime) and Shape2 (Probe) columns. Your results should look like this, think about columns that you drop or keep (this is only first four rows, think of how you can limit your output the same way via head() or slice_head() functions):

Participant	Hour	Minutes	Trial	OnsetDelay	Bias	Prime	Probe	Response1	Response2	RT1	RT2
AKM1995M	14	07	0	0.5746952	left	stripes-8	stripes-4	right	left	5.055481	1.0238089
AKM1995M	14	07	1	0.5741707	left	stripes-4	heavy poles sphere	left	right	2.969246	0.8239294
AKM1995M	14	07	2	0.5082200	left	stripes-2	stripes-2	right	left	3.162331	0.6718403
AKM1995M	14	07	3	0.6065058	right	stripes-8	stripes-2	right	right	1.021163	0.5919555

Do exercise 5.

As noted above, if position of individual pieces is fixed, you can specify it explicitly. Let us make out toy table a bit more explicit

Participant	Face	Symmetry	Attractiveness	Trustworthiness
1	M-01	6	4	3
1	F-02	4	7	6
2	M-01	5	2	1
2	F-02	3	7	2

For our toy faces table, the first piece is the gender and the last one is its index. Thus, we tell separate() starting position each pieces, starting with the second one:

widish_df |>
  separate(Face, into=c("FaceGender", "Dash", "FaceIndex"), sep=c(1, 2))

widish_df |>
  separate(Face, 
           into = c("FaceGender", "Dash", "FaceIndex"), 
           sep = c(1, 2), 
           remove = FALSE) |>
  knitr::kable()

Participant	Face	FaceGender	Dash	FaceIndex	Symmetry	Attractiveness	Trustworthiness
1	M-01	M	-	01	6	4	3
1	F-02	F	-	02	4	7	6
2	M-01	M	-	01	5	2	1
2	F-02	F	-	02	3	7	2

Here, I’ve create Dash column for the separator but, of course, I could have omitted it via NA column name.

widish_df |>
  separate(Face, into=c("FaceGender", NA, "FaceIndex"), sep=c(1, 2))

widish_df |>
  separate(Face,
           into = c("FaceGender", NA, "FaceIndex"), 
           sep = c(1, 2)) |>
  knitr::kable()

Participant	FaceGender	FaceIndex	Symmetry	Attractiveness	Trustworthiness
1	M	01	6	4	3
1	F	02	4	7	6
2	M	01	5	2	1
2	F	02	3	7	2

Practice time! Using same persistence data extract birth year and gender of participants from Participant code (however, keep the code column). Put a nice extra touch by converting year to a number (separate() splits a string into strings as well) and gender into a factor type with better labels. Here is how should look like:

Participant	BirthYear	Gender	Hour	Minutes	Trial	OnsetDelay	Bias	Prime	Probe	Response1	Response2	RT1	RT2
AKM1995M	1995	Male	14	07	0	0.5746952	left	stripes-8	stripes-4	right	left	5.055481	1.0238089
AKM1995M	1995	Male	14	07	1	0.5741707	left	stripes-4	heavy poles sphere	left	right	2.969246	0.8239294
AKM1995M	1995	Male	14	07	2	0.5082200	left	stripes-2	stripes-2	right	left	3.162331	0.6718403
AKM1995M	1995	Male	14	07	3	0.6065058	right	stripes-8	stripes-2	right	right	1.021163	0.5919555

Do exercise 6.

11.7 Extracting a substring when you know its location

Base R provides a function extract a substring (or many substrings) via substr() function (you can also its alias substring()). It takes a string (or a vector of strings) and vectors with start and stop indexes of each substring.

face_img <- c("M01", "M02", "F01", "F02")
substr(face_img, 2, 3)

## [1] "01" "02" "01" "02"

Repeat exercise 6 but use substr() to extract each column (BirthYear and Gender) from the participant code.

Do exercise 7.

Tidyverse has its own stringr library for working with strings. Its uses a consistent naming scheme str_<action> for its function and covers virtually all tasks that are related to working with strings. stringr equivalent of substr() is str_sub() that behaves similarly.

face_img <- c("M01", "M02", "F01", "F02")
str_sub(face_img, 2, 3)

## [1] "01" "02" "01" "02"

Repeat exercise 7 but using str_sub() function.

Do exercise 8.

11.8 Detecting a substring using regular expressions

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Jamie Zawinsk

One of the most powerful ways to work with strings is via regular expressions that allow you to code a flexible pattern that is matched to a substring within a string. For example, you can detect whether a string contains a number without knowing where it is located. Here a pattern "\\d{3}" means that we are looking for 3 (hence the {3}) digits (hence the \\d). The base R has functions grepl()⁷⁰ and grep() that, correspondingly, return a vector of logical values of whether the pattern was match or index of vector elements for which that matched.

QandA <- c("What was the answer, 42, right?", "No idea! What could it be, 423?")
# returns logical vector for each element
grepl("\\d{3}", QandA)

## [1] FALSE  TRUE

# returns index of elements for which pattern was matched
grep("\\d{3}", QandA)

## [1] 2

Stringr library has it own version with a more obvious name str_detect() that acts similar to grepl(), i.e., returns vector of logical values on whether the pattern was matched. Note, however, the reverse order of arguments, as str_ function always take (a vector of) strings as a first parameter

str_detect(QandA, "\\d{3}")

## [1] FALSE  TRUE

You can also look for 1 or more digits (which is +)

str_detect(QandA, "\\d+")

## [1] TRUE TRUE

Or for a specific word

str_detect(QandA, "What")

## [1] TRUE TRUE

Or for a specific word only at the beginning (^) of the string

str_detect(QandA, "^What")

## [1]  TRUE FALSE

When it comes to regular expressions, what I have shown you so far is not even a tip of an iceberg, it is a tip of a tip of an iceberg at best. They are very flexible, allowing you to code very complicated patterns but they are also hard to read and, therefore, hard to debug⁷¹. For example, this is a regular expression to check validity on an email address⁷²

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Still, if you need to work with text they are indispensable, so your should remember about them. When facing an actual task grab a cheatsheet and use an online expression tester to debug the pattern.

In the next exercise, use a regular expression to filter() out Primes and Probes that end with a single digit. I know that all of them end with a single digit, if digit is in them, so you can make a very simple expression that would do the job. But I want you to practice working with the cheatsheet, so it must specify that only one digit is allowed and that it must be the last symbol. When you pattern works, you should end up with a table where all Primes and Probes are "heavy poles sphere".

Do exercise 9.

11.9 Extracting substring defined by a regular expression

You can not just detect a substring defined by a regular expression but also extract it. The advantage is that you may not know how many symbols are in the substring or where it starts, so regular expression give you maximal flexibility. The function for this is str_extract() that works very similar to str_detect() but returns an actual detected substring instead of just TRUE or FALSE. Use it to extract the participants unique code, the first three letters of Participant column. Again, here you can simply use a substr() but I want you write a pattern that matches 1) one or more 2) upper case letters 3) at the beginning of the string.

Do exercise 10.

11.10 Replacing substring defined by a regular expression

Another manipulation is to replace an arbitrary substring with a fixed one. The base R provides functions sub() that replaces only the first occurence of the matched pattern and gsub() that replaces all matched substring. Stringr equivalents are str_replace() and str_replace_all(). The main difference, as with grepl() versus str_detect() is the order of parameters: for str_detect() input string is the first parameter, followed by a pattern and a replacement string, whereas for grepl() it is pattern, replacement, input string order.

As an exercise, use sub() and str_replace() to anonymize the birth year of our participants. You need to replace the four digits that represent their birth year with a single "-". The table should look as follows:

Participant	Hour	Minutes	Trial	OnsetDelay	Bias	Prime	Probe	Response1	Response2	RT1	RT2
AKM-M	14	07	0	0.5746952	left	stripes-8	stripes-4	right	left	5.055481	1.0238089
AKM-M	14	07	1	0.5741707	left	stripes-4	heavy poles sphere	left	right	2.969246	0.8239294
AKM-M	14	07	2	0.5082200	left	stripes-2	stripes-2	right	left	3.162331	0.6718403
AKM-M	14	07	3	0.6065058	right	stripes-8	stripes-2	right	right	1.021163	0.5919555

Do exercise 11.

Now, repeat the exercise but replace any single digit in the Participant code with ‘-’. Which functions do you use to produce the same results as in exercise 11?

Do exercise 12.

10 Missing data

12 Sampling and simulations