11 Working with strings
When working with strings, I strongly suggest consulting a manual and vignettes of the stringr package. It has many functions that cover most needs. Grab exercise notebook before we start.
11.1 Warming up
Before we start working with strings, let us warm up by preprocessing band-adaptation.csv that we will be working with.
- Read it (try specifying the URL instead of the local filename). Do not forget to specify column types!
- compute proportion of “same” responses as a using
Nsame
(number of “same” responses) andNtotal
(total number of trials). - Convert
Prime
andProbe
column to factors with the order “Sphere”, “Quadro”, “Dual”, “Single”. - Compute median and median absolute deviation from the median for
Psame
for all combinations ofPrime
andProbe
.
Your table should look as follows:
Prime | Probe | Pmedian | Pmad |
---|---|---|---|
Sphere | Sphere | 0.13 | 0.06 |
Sphere | Quadro | 0.13 | 0.04 |
Sphere | Dual | 0.17 | 0.10 |
Sphere | Single | 0.32 | 0.08 |
Quadro | Sphere | 0.19 | 0.07 |
Quadro | Quadro | 0.12 | 0.12 |
Quadro | Dual | 0.21 | 0.19 |
Quadro | Single | 0.38 | 0.07 |
Dual | Sphere | 0.15 | 0.15 |
Dual | Quadro | 0.30 | 0.14 |
Dual | Dual | 0.27 | 0.15 |
Dual | Single | 0.48 | 0.16 |
Single | Sphere | 0.34 | 0.18 |
Single | Quadro | 0.30 | 0.20 |
Single | Dual | 0.48 | 0.12 |
Single | Single | 0.51 | 0.18 |
Do exercise 1.
11.2 Formatting strings via glue()
The table above gives us information about median probability of seeing the same rotation and about its absolute deviation from the median. However, it would be more convenient for a reader if we combine these two pieces of information into a single entry in form for of “Prime
per row and one Probe
per column. The table I have in mind look like this:
Prime | Sphere | Quadro | Dual | Single |
---|---|---|---|---|
Sphere | 0.13 ± 0.06 | 0.13 ± 0.04 | 0.17 ± 0.1 | 0.32 ± 0.08 |
Quadro | 0.19 ± 0.07 | 0.12 ± 0.12 | 0.21 ± 0.19 | 0.38 ± 0.07 |
Dual | 0.15 ± 0.15 | 0.3 ± 0.14 | 0.27 ± 0.15 | 0.48 ± 0.16 |
Single | 0.34 ± 0.18 | 0.3 ± 0.2 | 0.48 ± 0.12 | 0.51 ± 0.18 |
You already know how to perform the second step (pivoting table wider to turn Probe
factor levels into columns). For the first step, you need to combine two values into a string. There are different ways to construct this string via sprintf(), paste(), or via glue package. We will start with Tidyverse’s glue() and explore base R functions later.
glue package is part of the Tidyverse, so it should be already installed. However, it is not part of core tidyverse, so it does not get imported automatically via library(tidyverse)
and you need to import it separately or use glue::
prefix. Function glue() allows you to “glue” values and code directly into a string. You simply surround any R code by wiggly brackets inside the string and the result of the code execution is glued in. If you use just a variable, its value will be glued-in. But you can put any code inside, although, the more code you put, the harder it will be to read and understand it.
answer <- 42
bad_answer <- 41
glue::glue("The answer is {answer}, not {abs(bad_answer / -4)}")
## The answer is 42, not 10.25
Use the table that you prepared during exercise 1 to compute a new column with “
Do exercise 2.
11.3 Formatting strings via paste()
Base R has functions paste() and paste0() that concatenate a vector of strings into a single string. If you recall, vector values can only be of one (most flexible) type. Therefore, if you have a vector that intersperses strings with other values, they will be first converted to strings anyhow. The difference between paste()
and paste0()
is that the former puts a separator string in-between each value (defaults to ' '
but you can define your own via sep
argument), whereas paste0()
uses no separator. We can replicate our glue() example.
answer <- 42
bad_answer <- 41
paste("The answer is ", answer, ", not ", abs(bad_answer / -4), sep = "")
## [1] "The answer is 42, not 10.25"
## [1] "The answer is 42, not 10.25"
Redo exercise 2 but using one of the paste functions instead of the glue().
Do exercise 3.
11.4 Formatting strings via sprintf()
For detailed string formatting, base R has a sprintf() function that provides a C-style string formatting (same as Python’s original string formatting and a common way to format a string in many programming languages). The general function call is sprintf("string with formatting", value1, value2, value)
, where values are are inserted into the string. In "string with formatting"
, you specify where you want to put the value via %
symbol that is followed by an optional formatting info and the required symbol that defines the type of the value. The type symbols are
-
s
for string -
d
for an integer -
f
for a float value using a “fixed point” decimal notation -
e
for a float value using a scientific notation (e.g.,1e2
). -
g
for an “optimally” printed float value, so that scientific notation is used for very large or very small values (e.g.,1e+5
instead of100000
and1-e5
for0.00001
).
Here is an example of formatting a string using an integer:
sprintf("I had %d pancakes for breakfast", 10)
## [1] "I had 10 pancakes for breakfast"
You are not limited to a single value that you can put into a string. You can specify more locations via %
but you must make sure that you pass the matching number of values. If there fewer parameters when you specified in the string, you will receive an error. If there are too many, only a warning69. Before running it, can you figure out which call will actually work (and what will be the output) and which will produce an error or a warning?
sprintf("I had %d pancakes and either %d or %d stakes for dinner", 2)
sprintf("I had %d pancakes and %d stakes for dinner", 7, 10)
sprintf("I had %d pancake and %d stakes for dinner", 1, 7, 10)
In case of real values you have two options: %f
and %g
. The latter uses scientific notation (e.g. 1e10
for 10000000000
) to make a representation more compact. When formatting floating numbers, you can specify the number of decimal points to be displayed.
e <- 2.71828182845904523536028747135266249775724709369995
sprintf("Euler's number is roughly %.4f", e)
## [1] "Euler's number is roughly 2.7183"
Note that as most functions in R, sprintf() is vectorized so when you pass a vector of values it will generate a vector of strings with one formatted string for a value.
## [1] "The number is 2" "The number is 3"
This means that you can use sprintf() to work on column both in base R and inside mutate() Tidyverse verb.
Number | Message |
---|---|
1 | The number is 1 |
2 | The number is 2 |
3 | The number is 3 |
Redo exercise #2 but use sprintf() instead of glue().
Do exercise 4.
11.5 Extracting information from a string
Previous exercises dealt with combining various bits of information into a single string. Often, you also need to do the opposite: extract bits of information from a single string. For example, in the toy table on face perception, we have been working with, Face
column code gender of the face "M"
(table is short but you can easily assume that faces of both genders were used) and the second is its index (1
and 2
). When we worked with persistence, Participant
column encoded year of birth and gender, whereas Session
contained detailed information about year, month, day, hour, minutes, and seconds all merged together. There are several ways to extract this information, either by extracting one piece at a time via substr() or string processing library stringr. Alternatively, you can split a string column into several columns via separate() or use extract() function.
11.6 Splitting strings via separate()
Function separate() is part of tidyr and its use is very straightforward: you pass 1) the name of the column that you want to split, 2) names of the columns it needs to be split into, 3) a separator symbol or indexes of splitting positions. Examples using the face table should make it clear. Reminder, this is the original wide table and we want to separate Face
into FaceGender
and FaceIndex
.
widish_df <-
tibble(Participant = c(1, 1, 2, 2),
Face = rep(c("M-1", "M-2"), 2),
Symmetry = c(6, 4, 5, 3),
Attractiveness = c(4, 7, 2, 7),
Trustworthiness = c(3, 6, 1, 2))
knitr::kable(widish_df)
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 4 | 3 |
1 | M-2 | 4 | 7 | 6 |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
As there is a very convenient “dash” between the two, we can use it for a separator symbol:
Participant | FaceGender | FaceIndex | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|
1 | M | 1 | 6 | 4 | 3 |
1 | M | 2 | 4 | 7 | 6 |
2 | M | 1 | 5 | 2 | 1 |
2 | M | 2 | 3 | 7 | 2 |
Note that the original Face
column is gone. We can keep it via remove=FALSE
option
Participant | Face | FaceGender | FaceIndex | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|---|
1 | M-1 | M | 1 | 6 | 4 | 3 |
1 | M-2 | M | 2 | 4 | 7 | 6 |
2 | M-1 | M | 1 | 5 | 2 | 1 |
2 | M-2 | M | 2 | 3 | 7 | 2 |
We also do not need to extract all information. For example, we can extract only face gender or face index. To get only the gender, we only specify one into
column and add extra="drop"
parameter, telling separate()
to drop any extra piece it obtained:
Participant | Face | Gender | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|
1 | M-1 | M | 6 | 4 | 3 |
1 | M-2 | M | 4 | 7 | 6 |
2 | M-1 | M | 5 | 2 | 1 |
2 | M-2 | M | 3 | 7 | 2 |
Alternatively, we can explicitly ignore pieces by using NA
for their column name:
Participant | Face | Gender | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|
1 | M-1 | M | 6 | 4 | 3 |
1 | M-2 | M | 4 | 7 | 6 |
2 | M-1 | M | 5 | 2 | 1 |
2 | M-2 | M | 3 | 7 | 2 |
What about keeping only the second piece in a FaceIndex
column? We ignore the first one via NA
widish_df |>
separate(Face, into=c(NA, "Index"), sep="-", remove=FALSE) |>
knitr::kable(align = "c")
Participant | Face | Index | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|
1 | M-1 | 1 | 6 | 4 | 3 |
1 | M-2 | 2 | 4 | 7 | 6 |
2 | M-1 | 1 | 5 | 2 | 1 |
2 | M-2 | 2 | 3 | 7 | 2 |
Let’s practice. Use separate() to preprocess persistence data and create two new columns for hour and minutes from Session
column. Do it in a single pipeline, starting with reading all files (use tidyverse read_csv() and specify column types!) and renaming Shape1
(Prime
) and Shape2
(Probe
) columns. Your results should look like this, think about columns that you drop or keep (this is only first four rows, think of how you can limit your output the same way via head() or slice_head() functions):
Participant | Hour | Minutes | Block | Trial | OnsetDelay | Bias | Prime | Probe | Response1 | Response2 | RT1 | RT2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AKM1995M | 14 | 07 | 0 | 0 | 0.5746952 | left | stripes-8 | stripes-4 | right | left | 5.055481 | 1.0238089 |
AKM1995M | 14 | 07 | 0 | 1 | 0.5741707 | left | stripes-4 | heavy poles sphere | left | right | 2.969246 | 0.8239294 |
AKM1995M | 14 | 07 | 0 | 2 | 0.5082200 | left | stripes-2 | stripes-2 | right | left | 3.162331 | 0.6718403 |
AKM1995M | 14 | 07 | 0 | 3 | 0.6065058 | right | stripes-8 | stripes-2 | right | right | 1.021163 | 0.5919555 |
Do exercise 5.
As noted above, if position of individual pieces is fixed, you can specify it explicitly. Let us make out toy table a bit more explicit
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-01 | 6 | 4 | 3 |
1 | F-02 | 4 | 7 | 6 |
2 | M-01 | 5 | 2 | 1 |
2 | F-02 | 3 | 7 | 2 |
For our toy faces table, the first piece is the gender and the last one is its index. Thus, we tell separate()
starting position each pieces, starting with the second one:
widish_df |>
separate(Face,
into = c("FaceGender", "Dash", "FaceIndex"),
sep = c(1, 2),
remove = FALSE) |>
knitr::kable()
Participant | Face | FaceGender | Dash | FaceIndex | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|---|---|
1 | M-01 | M | - | 01 | 6 | 4 | 3 |
1 | F-02 | F | - | 02 | 4 | 7 | 6 |
2 | M-01 | M | - | 01 | 5 | 2 | 1 |
2 | F-02 | F | - | 02 | 3 | 7 | 2 |
Here, I’ve create Dash
column for the separator but, of course, I could have omitted it via NA
column name.
widish_df |>
separate(Face,
into = c("FaceGender", NA, "FaceIndex"),
sep = c(1, 2)) |>
knitr::kable()
Participant | FaceGender | FaceIndex | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|---|
1 | M | 01 | 6 | 4 | 3 |
1 | F | 02 | 4 | 7 | 6 |
2 | M | 01 | 5 | 2 | 1 |
2 | F | 02 | 3 | 7 | 2 |
Practice time! Using same persistence data extract birth year and gender of participants from Participant
code (however, keep the code column). Put a nice extra touch by converting year to a number (separate() splits a string into strings as well) and gender into a factor type with better labels. Here is how should look like:
Participant | BirthYear | Gender | Hour | Minutes | Block | Trial | OnsetDelay | Bias | Prime | Probe | Response1 | Response2 | RT1 | RT2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AKM1995M | 1995 | Male | 14 | 07 | 0 | 0 | 0.5746952 | left | stripes-8 | stripes-4 | right | left | 5.055481 | 1.0238089 |
AKM1995M | 1995 | Male | 14 | 07 | 0 | 1 | 0.5741707 | left | stripes-4 | heavy poles sphere | left | right | 2.969246 | 0.8239294 |
AKM1995M | 1995 | Male | 14 | 07 | 0 | 2 | 0.5082200 | left | stripes-2 | stripes-2 | right | left | 3.162331 | 0.6718403 |
AKM1995M | 1995 | Male | 14 | 07 | 0 | 3 | 0.6065058 | right | stripes-8 | stripes-2 | right | right | 1.021163 | 0.5919555 |
Do exercise 6.
11.7 Extracting a substring when you know its location
Base R provides a function extract a substring (or many substrings) via substr() function (you can also its alias substring()
). It takes a string (or a vector of strings) and vectors with start
and stop
indexes of each substring.
## [1] "01" "02" "01" "02"
Repeat exercise 6 but use substr() to extract each column (BirthYear
and Gender
) from the participant code.
Do exercise 7.
Tidyverse has its own stringr library for working with strings. Its uses a consistent naming scheme str_<action>
for its function and covers virtually all tasks that are related to working with strings. stringr equivalent of substr() is str_sub() that behaves similarly.
## [1] "01" "02" "01" "02"
Repeat exercise 7 but using str_sub() function.
Do exercise 8.
11.8 Detecting a substring using regular expressions
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinsk
One of the most powerful ways to work with strings is via regular expressions that allow you to code a flexible pattern that is matched to a substring within a string. For example, you can detect whether a string contains a number without knowing where it is located. Here a pattern "\\d{3}"
means that we are looking for 3 (hence the {3}
) digits (hence the \\d
). The base R has functions grepl()70 and grep() that, correspondingly, return a vector of logical values of whether the pattern was match or index of vector elements for which that matched.
QandA <- c("What was the answer, 42, right?", "No idea! What could it be, 423?")
# returns logical vector for each element
grepl("\\d{3}", QandA)
## [1] FALSE TRUE
# returns index of elements for which pattern was matched
grep("\\d{3}", QandA)
## [1] 2
Stringr library has it own version with a more obvious name str_detect() that acts similar to grepl(), i.e., returns vector of logical values on whether the pattern was matched. Note, however, the reverse order of arguments, as str_
function always take (a vector of) strings as a first parameter
str_detect(QandA, "\\d{3}")
## [1] FALSE TRUE
You can also look for 1 or more digits (which is +
)
str_detect(QandA, "\\d+")
## [1] TRUE TRUE
Or for a specific word
str_detect(QandA, "What")
## [1] TRUE TRUE
Or for a specific word only at the beginning (^
) of the string
str_detect(QandA, "^What")
## [1] TRUE FALSE
When it comes to regular expressions, what I have shown you so far is not even a tip of an iceberg, it is a tip of a tip of an iceberg at best. They are very flexible, allowing you to code very complicated patterns but they are also hard to read and, therefore, hard to debug71. For example, this is a regular expression to check validity on an email address72
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Still, if you need to work with text they are indispensable, so your should remember about them. When facing an actual task grab a cheatsheet and use an online expression tester to debug the pattern.
In the next exercise, use a regular expression to filter() out Primes and Probes that end with a single digit. I know that all of them end with a single digit, if digit is in them, so you can make a very simple expression that would do the job. But I want you to practice working with the cheatsheet, so it must specify that only one digit is allowed and that it must be the last symbol. When you pattern works, you should end up with a table where all Primes and Probes are "heavy poles sphere"
.
Do exercise 9.
11.9 Extracting substring defined by a regular expression
You can not just detect a substring defined by a regular expression but also extract it. The advantage is that you may not know how many symbols are in the substring or where it starts, so regular expression give you maximal flexibility. The function for this is str_extract() that works very similar to str_detect() but returns an actual detected substring instead of just TRUE
or FALSE
. Use it to extract the participants unique code, the first three letters of Participant
column. Again, here you can simply use a substr() but I want you write a pattern that matches 1) one or more 2) upper case letters 3) at the beginning of the string.
Do exercise 10.
11.10 Replacing substring defined by a regular expression
Another manipulation is to replace an arbitrary substring with a fixed one. The base R provides functions sub() that replaces only the first occurence of the matched pattern and gsub() that replaces all matched substring. Stringr equivalents are str_replace() and str_replace_all(). The main difference, as with grepl() versus str_detect() is the order of parameters: for str_detect() input string is the first parameter, followed by a pattern and a replacement string, whereas for grepl() it is pattern, replacement, input string order.
As an exercise, use sub() and str_replace() to anonymize the birth year of our participants. You need to replace the four digits that represent their birth year with a single "-"
. The table should look as follows:
Participant | Hour | Minutes | Block | Trial | OnsetDelay | Bias | Prime | Probe | Response1 | Response2 | RT1 | RT2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AKM-M | 14 | 07 | 0 | 0 | 0.5746952 | left | stripes-8 | stripes-4 | right | left | 5.055481 | 1.0238089 |
AKM-M | 14 | 07 | 0 | 1 | 0.5741707 | left | stripes-4 | heavy poles sphere | left | right | 2.969246 | 0.8239294 |
AKM-M | 14 | 07 | 0 | 2 | 0.5082200 | left | stripes-2 | stripes-2 | right | left | 3.162331 | 0.6718403 |
AKM-M | 14 | 07 | 0 | 3 | 0.6065058 | right | stripes-8 | stripes-2 | right | right | 1.021163 | 0.5919555 |
Do exercise 11.
Now, repeat the exercise but replace any single digit in the Participant
code with ‘-’. Which functions do you use to produce the same results as in exercise 11?
Do exercise 12.