9 Controling computation flow

Grab the exercise notebook before we start.

One of the most powerful features of R is that it is vector-based. Remember, everything is a vector (or a list). In the previous seminars you saw you can apply a function, a filter, or perform a computation on all values of a vector or all rows in a table in a single call. However, sometimes, you need to go over one value or row at a time explicitly. For example, if you working with a time-series, it might be easier to use an explicit for loop to compute current value based on previous state of the system. However, such instances are fairly rare, so the general rule for R is “you probably do not need a loop”. Below, we will go through various tools that render explicit loops redundant, but cover the loops themselves as well as conditional control statements (if-else) and functions (ifelse, case_when).

Note that the material does not cover a while () {<do-code} loop mostly because it is rarely used in data analysis⁶¹ and I could not come up with examples that would not be very artificial looking. At the same time, if you do need it, it has a very simple structure, so understanding and using it should not be a challenge.

9.1 `rep()`

The most basic repetition mechanism in R is rep() function. It takes a vector and repeats it specified number of times.

rep(c(1, 2, 3), times = 4)

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

Alternatively, you can repeat each element specified number of times before repeating the next one via each parameter. The difference between options lies only in the order of elements in the new vector. As you can see both vectors have the same length and each individual value is repeated four times.

rep(c(1, 2, 3), each = 4)

##  [1] 1 1 1 1 2 2 2 2 3 3 3 3

You can specify length of the output vector via length.out. When combined with times it can be useful for producing truncated vectors. E.g., when we repeat a three element vector but we want to get ten values. Using times only, we can get either nine (times = 3) or twelve (times = 4), not ten. length.out = 10 makes it happen.

rep(c(1, 2, 3), times=4, length.out = 10)

##  [1] 1 2 3 1 2 3 1 2 3 1

However, you can also use subsetting of a new vector to achieve the same end.

rep(c(1, 2, 3), times = 4)[1:10]

##  [1] 1 2 3 1 2 3 1 2 3 1

You should be more careful when combining length.out with each, as each value is repeated each times and, if length.out is longer, the same sequence is repeated again. Could be confusing and you might get a very unbalanced repeated sequence.

rep(c(1, 2, 3), each = 8, length.out = 10)

##  [1] 1 1 1 1 1 1 1 1 2 2

Do exercise 1.

9.2 Repeating combinations

To create a table with all combinations of values, you can use either base R expand.grid() or tidyr’s implementation expand_grid(). The latter is a bit more robust and can expand even tables and matrices (but see the documentation for subtle differences in implementation and output).

The usage is very straightforward, you provide column names and values and you get all combinations of their values.

knitr::kable(grid_base)

gender	handidness	colorblindness
female	right	TRUE
male	right	TRUE
female	left	TRUE
male	left	TRUE
female	right	FALSE
male	right	FALSE
female	left	FALSE
male	left	FALSE

expand_grid() works the same they but for the order of values within columns.

expand_grid(gender=c("female", "male"), 
            handidness=c("right", "left"),
            colorblindness=c(TRUE, FALSE))

knitr::kable(grid_tidyr)

gender	handidness	colorblindness
female	right	TRUE
female	right	FALSE
female	left	TRUE
female	left	FALSE
male	right	TRUE
male	right	FALSE
male	left	TRUE
male	left	FALSE

Do exercise 2.

9.3 For loop

You can loop (iterate) over elements of a vector or list via a for loop, which is very similar to for-loops in other programming languages. However, use of the for loop in R is fairly rare, because vectors are a fundamental building block of R and, therefore, it is inherently vectorized (you can do the same thing to all values, not to one value at a time). In a sense, for loop is very un-R, so if you find yourself using it, consider whether there is a simpler or more expressive way to do this. At the same time, if for loop is the simplest, clearest, or more robust way to write you code, by all means, use it!

The general format is

for(loop_variable in vector_or_list){
  ...some operations using loop_variable that
  changes its value on each iteration using
  values from the vector or list... 
}

Note the curly brackets. We used them before to put the code inside a function. Here, we use them to put the code inside the loop. The loop (the code inside curly brackets) is repeated as many times as the number of elements in a vector or a list with a loop variable⁶² getting assigned each vector/list value on each iteration. Thus, to print each value of a vector we can do

for(a_number in c(1, 5, 200)){
  print(a_number)
}

## [1] 1
## [1] 5
## [1] 200

Here, we have three elements in a vector, therefore the code inside curly brackets is repeated three times with the variable a_number taking each value in turn. I.e., a_number is equal to 1 on a first iteration, 5 on a second, 200 on the third. Note that the code above is equivalent to just assigning one value at a time to a_number and calling print() function three times.

a_number <- 1 
print(a_number)

## [1] 1

a_number <- 5
print(a_number)

## [1] 5

a_number <- 200
print(a_number)

## [1] 200

As you can see, it does not really matter how you assign a value to a variable and repeat the code. However, the for loop approach is much more flexible, is easier to read and to maintain. I.e., let us assume that you decided to alter the print() call with cat() instead. In the for loop version there is just one line to take care of. In the copy-paste version, there are three lines you need to alter. And imagine if you need to repeat the same code a hundred or thousand times, copy-paste is clearly not a viable solution.

Also note that you might be interested in repeat the code inside the for loop given number of times but are not interested in a loop variable and values that it takes. For example, we might want to repeat print("Ho!") three times (because it is Christmas time and “Ho! Ho! Ho!” is what Santa Clause says). In this case, we still need a vector with three elements that we can loop over but we do not care what these three elements are.

for(temp in 1:3){
  print("Ho!")
}

## [1] "Ho!"
## [1] "Ho!"
## [1] "Ho!"

Note that we are not using variable temp inside of the loop and it has no effect on the code inside the curly brackets. Which is why we can use any three values, as it is their total number that matters, not the values themselves.

for(temp in c("A", "B", "C")){
  print("Ho!")
}

## [1] "Ho!"
## [1] "Ho!"
## [1] "Ho!"

for(temp in seq(100, 300, length.out=3)){
  print("Ho!")
}

## [1] "Ho!"
## [1] "Ho!"
## [1] "Ho!"

In short, number of elements in the vector determines how many times the code inside the loop will be repeat. Vector elements are stored in the loop variable on each iteration and you can either use it (a_number example above) or ignore them (“Ho! Ho! Ho!”) example.

Do exercises 3 and 4.

One for a typical scenarios for the loop variable is when it is used as an index to access an element of a vector or a list. You can build a vector of indexes via start:stop sequence tool we used for slicing. You can compute a length of an object via length() function. For a data.frame or a tibble, you can figure out number of rows and columns via, respectively, nrow() and ncol() functions.

vector_of_some_numbers <- c(1, 5, 200)
for(index in 1:length(vector_of_some_numbers)){
  print(vector_of_some_numbers[index])
}

## [1] 1
## [1] 5
## [1] 200

Do exercise 5.

You can also nest loops

for(letter in c("A", "B", "C")){
  for(number in 1:2){
    cat(letter, number, "\n")
  }
}

## A 1 
## A 2 
## B 1 
## B 2 
## C 1 
## C 2

Do exercise 6.

One scenario when loops are particularly advantages is when a current value depends on a previous one (or many previous values). In this case, a new value depends on the previous one and, in turn, determines the following ones and they cannot be computed before it is known. This means that vectorized approach that applies the computation to all values in parallel will not work and we need to perform the computation in a sequantial manner.

In the next exercise, use for loop to create a random walk. It should start at zero (so, initial value for your variable should be zero). For each next step, draw a random value from a normal distribution using previous value as a mean (decide on the standard deviation yourself). The function you are looking for is rnorm()). Generate a ten-step random walk, that might look like this (you can my plot replicate exactly, if you set.seed() to 1977):

Do exercise 7.

9.4 Conditional statement

As in all other programming languages, you can control the flow of execution using if-else statements. The general usage is as follows.

if (some_condition) {
  # code runs if some_condition is TRUE
} else {
  # code runs if some_condition is FALSE
}

The rules are the same as for logical indexing, so we can define our condition using mathematical comparisons.

x <- -3
if (x > 0) {
  cat("X is definitely greater than  zero")
} else {
  cat("X is not greater than zero.")
}

## X is not greater than zero.

However, be aware that the if uses a single logical value. If you have more than one value, it stops with an error⁶³.

x <- c(10,  -3)
if (x < 0) {
  cat("Bingo!")
} else {
  cat("I don't like this x.")
}

## Error in if (x < 0) {: the condition has length > 1

As in logical indexing, you can combine several conditions using logical operators such as and && or or ||.

For example, we can check whether x is smaller than zero but larger than -10.

x <- -3
if ((x < 0) && (x > -10)) {
  cat("Bingo!")
} else {
  cat("I don't like this x.")
}

## Bingo!

However, be aware that R has both &/| and &&/|| versions of and and or logical operators (single versus double symbols). Both perform logical operations on vectors but double-symbol works only for a single value. E.g., single-symbol returns TRUE/FALSE for each value of x

x <- c(10, -3, -11)
(x < 0) & (x > -10)

## [1] FALSE  TRUE FALSE

&& and || will generate an error if vectors are longer⁶⁴.

# This works
x <- -3
(x < 0) && (x > -10)

## [1] TRUE

# This generates and error
x <- c(10, -3, -11)
(x < 0) && (x > -10)

## Error in (x < 0) && (x > -10): 'length = 3' in coercion to 'logical(1)'

Let us combine for loop with if-else operator. Generate a vector of ten normally distributed values (again, rnorm() is the function). Loop over them in a for loop and print() or cat() "Positive" if a value is larger than zero and "Not positive" if not. The results should look like this⁶⁵

##  [1]  0.03003683  1.22390943  1.71573769 -0.89994016  0.55507190  0.42319195
##  [7]  0.82993426 -1.28614375  1.21511589 -0.05815403

## [1] "Positive"
## [1] "Positive"
## [1] "Positive"
## [1] "Not positive"
## [1] "Positive"
## [1] "Positive"
## [1] "Positive"
## [1] "Not positive"
## [1] "Positive"
## [1] "Not positive"

Do exercise 8.

9.5 ifelse()

As I wrote above, strength of R lies in built-in vectorization that typically renders loops unnecessary. Although we used the for loop to apply comparison to one element at a time, R has function ifelse() that performs the same task on vectors. It format is ifelse(logical-vector, values-if-true, values-if-false) where the logical vector of TRUE and FALSE values can be produced via a comparison. For example, we can classify letters into "Vowel" and "Consonant" using %in% to check whether each letter is in the vowels list (note the %, this distinguished %in% for matching from in of the for loop).

vowels <- c("a", "e", "i", "o", "u")
letters <- c("a", "b", "c", "d", "e", "f")
ifelse(letters %in% vowels, "Vowel", "Consonant")

## [1] "Vowel"     "Consonant" "Consonant" "Consonant" "Vowel"     "Consonant"

Note that vectors values-if-true and values-if-false must not be of equal length. For example, one can use ifelse() to replace only vowels in the original vector but keep consonant letter as is by using letters vector for values-if-false

ifelse(letters %in% vowels, "Vowel", letters)

## [1] "Vowel" "b"     "c"     "d"     "Vowel" "f"

However vectors values-if-true and values-if-false ideally should either match the length of the logical comparison vector or have a single value. Otherwise, they will be recycled which might lead to confusing results. E.g., here, can you figure out when "Consonant1" and "Consonant2" are used?

ifelse(letters %in% vowels, "Vowel", c("Consonant1", "Consonant2"))

## [1] "Vowel"      "Consonant2" "Consonant1" "Consonant2" "Vowel"     
## [6] "Consonant2"

Use ifelse() to replicate exercise 8 without a loop.

Do exercise 9.

9.6 case_when

Function ifelse() is useful when you have just two outcomes. You can use nested ifelse calls, just like you can use nested if-else statements but it becomes awkward and hard to read very fast. Instead, you can use function case_when() from dplyr library. Here, you can use as many conditions as you want. The general format is

case_when(Condition1 ~ Value1,
          Condition2 ~ Value2,
          ...
          ConditionN ~ ValueN)

For example, the internet tells me that letter y is complicated, as it can be either a consonant or a vowel depending on a word. Thus, we can label it as “Complicated” via case_when() (note the vectorized nature of it, as it applies this comparison to one element of the vector at a time)

vowels <- c("a", "e", "i", "o", "u")
letters <- c("a", "b", "c", "d", "e", "f", "y")
case_when(letters == "y" ~ "Complicated",
          letters %in% vowels ~ "Vowel",
          TRUE ~ "Consonant")

## [1] "Vowel"       "Consonant"   "Consonant"   "Consonant"   "Vowel"      
## [6] "Consonant"   "Complicated"

Note the last condition that is always TRUE, it looks odd but this servers as a “default” branch: if you got that far, all other conditions must be FALSE, so this must be TRUE (you have no other options left).

Do exercise 10.

9.7 Breaking out of the loop

The code inside the for loop is typically repeated for every value of the vector that you have supplied. However, there is a way to break out of it using a break statement. This stops execution of the code inside of the loop immediately and continues with the code immediately after the loop.

for(y in c(1, 2, 3)){
  print("This will be exected")
  break
  print("But this won't be")
}

## [1] "This will be exected"

print("Now to the code AFTER the loop")

## [1] "Now to the code AFTER the loop"

Note that in this case, the code was executed (incompletely!) only once. Typically, break is used in combination with the if-else statement to break out of the loop, if a certain condition is met. Let us practice. Again, generate ten normally distributed numbers, loop over them and print each one. However, break after the fifth value. For this, you need to loop over indexes of values (that go from 1 to the length() of your vector). Thus, you loop variable will contain an index of each element of x (because of that I called it ix) and you need to use it to get a value at this position within the vector. If that index is equal to 5, break out of the loop. For the ten values I’ve generated above, the output will be the following (“Done for today” is printed after the loop, also I could have first printed a value and then exited the loop, that is a analysis logic question, not a technical one).

set.seed(164)
x <- rnorm(10)
for(ix in 1:length(x)){
  if (ix == 5)  break
  print(x[ix])
}

## [1] 0.03003683
## [1] 1.223909
## [1] 1.715738
## [1] -0.8999402

print("Done for today")

## [1] "Done for today"

Do exercise 11.

9.8 Using for loop to load and join multiple data files

Your analysis starts with loading the data. Quite often, data for individual participants is stored in different files but with identical structure, so you need code that figures out which files you need to load, loads them one at a time and then binds them to one final table. Using a for loop is not the most elegant way to implement this but it does the job and gives you another example of how loops can be useful. I will walk you through details and then you We will implement the code that loads and merges individual files for persistence study. Download the persistence.zip and unzip into Persistence subfolder (we do not want to create a mess in your main folder!).

First, you need to have a character vector with relevant file names. Package you are looking for is fs (for File System). It has everything you need to work with the file system, including working with file names (dropping or adding path, extension, etc.), creating/moving/deleting files, checking whether file exists, and what not. One function that I use the most is dir_ls() that list files in a specified folder. The two parameters you need are path to your folder (you can and should use a relative path) and, optionally, glob filter string. The latter is a globbing wildcard pattern, where * stands for “any sequence of characters” and ? stand for “one arbitrary character. For a csv file, this pattern would be "*.csv". Test this single function call using appropriate path and glob parameters and make sure you get all the files in Persistence folder.

Next, you need to create a full table variable (I, typically, call it results or reports) and initialize it to an empty data.frame() (or an empty tibble). You loop over file names, read one file at a time (don’t forget to specify column types), and then use bind_rows() to combine the full table and the new table you loaded. Note that bind_rows() returns a new table, so you need to assign it back to the original full table variable. Once you are done, your table should have 5232 rows and twelve columns.

Participant	Session	Trial	OnsetDelay	Bias	Shape1	Shape2	Response1	Response2	RT1	RT2
AKM1995M	2019-06-12-14-07-17	0	0.5746952	left	stripes-8	stripes-4	right	left	5.0554813	1.0238089
AKM1995M	2019-06-12-14-07-17	1	0.5741707	left	stripes-4	heavy poles sphere	left	right	2.9692460	0.8239294
AKM1995M	2019-06-12-14-07-17	2	0.5082200	left	stripes-2	stripes-2	right	left	3.1623310	0.6718403
AKM1995M	2019-06-12-14-07-17	3	0.6065058	right	stripes-8	stripes-2	right	right	1.0211627	0.5919555
AKM1995M	2019-06-12-14-07-17	4	0.5359504	left	stripes-2	heavy poles sphere	right	right	0.9426957	0.6157635
AKM1995M	2019-06-12-14-07-17	5	0.6435367	right	stripes-4	stripes-4	right	right	1.1646056	0.6398231

Do exercise 12.

9.9 Apply

As noted above, for loops do the job by might not be the most elegant way of doing things. In R, you can apply a function to each row or column of a matrix. In addition, there are more case-specific versions of it, such lapply.

The function is called apply because you apply it to values of a vector. In a sense, you have been applying functions the whole time by calling them. For example, we might compute a sinus of a sequence of numbers as

sin(seq(0, pi, length.out = 5))

## [1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224606e-16

Or, we can apply sinus function to a number sequence (note that I pass the name of the function alone sin but do not call it, so no round brackets!)

sapply(seq(0, pi, length.out = 5), sin)

## [1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224606e-16

You might ask, what is then the point to use apply? Not much for simple vector cases like this, but it is very useful when you have two dimensional data, as you can apply a function along horizontal (rows) or vertical (columns) margin. For example, imagine you need to compute an average (or median, or any other quantile) of each row or column in a matrix (something you might do fairly often for posterior samples in Bayesian statistics).

Let us create a simple 3 by 4 matrix of normally distributed random numbers.

a_matrix <- matrix(rnorm(12), nrow = 3)
a_matrix

##               [,1]       [,2]       [,3]       [,4]
## [1,] -1.306076e-05 -0.9815949  0.3981706 -0.1809577
## [2,]  1.880631e-02 -0.4210554 -1.2544423 -0.2319481
## [3,]  2.429102e+00 -1.1964317  0.2265607  0.5361628

We would expect median value of any row or column to be 0 but because we have so few data points, they will be close but not exactly zero. Computing median for each row (we should get three numbers)

apply(a_matrix, MARGIN = 1, FUN = median)

## [1] -0.0904854 -0.3265017  0.3813618

Similarly for a column (here, it should be four numbers)

apply(a_matrix, MARGIN = 2, FUN = median)

## [1]  0.01880631 -0.98159487  0.22656072 -0.18095773

I will not go into any further details on these functions, concentrating on similar functionality by purrr package. However, if you find yourself working with matrices or needing to apply a function to rows of a data frame, apply might be a simpler solution. Keep this option in mind, if you feel that either looping or purrring looks inadequate.

9.10 Purrr

Package purrr is part of the tidyverse. It provides functional programming approach similar to apply but it easier to use (IMHO) and it has a more explicit and consistent way to describe and combine the output. Language-wise, you do not apply a function, you use it to map inputs on outputs⁶⁶

The basic map() function always returns a list but you can explicitly state that you expect function to return a number (map_dbl()) and all outputs will be combined into a numeric vector. And, unlike apply, map_dbl() will generate an error if outputs cannot be converted to numeric.

The basic call is similar to apply but is easier to use as you can explicitly address current value via . variable (as in other parts of Tidyverse) and you can write a “normal” function call, prefixing it with a ~. Here is the example of computing the sinus again. First, same a apply via function name only

map_dbl(seq(0, pi, length.out = 5), sin)

## [1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224606e-16

Now, we a magic tilde ~. Note an explicit call to sin() function with . as an argument.

map_dbl(seq(0, pi, length.out = 5), ~sin(.))

## [1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224606e-16

Again, using map_dbl() in this case looks as a complete overkill. So let us do something more relevant. Let us implement loading and merging of persistence study files. You already know how to get a vector with names of relevant files. Now you can use map() function on this vector to generate a list of tables and then combine them into a single table by piping it to list_rbind() which binds tables by rows (there is a twin list_cbind() function that binds table by columns). When using ~ call notation, remember . would then correspond to a single value from the vector of file names (so, a single filename). Again, you should get a single table with twelve columns and 5232 rows. You need a single pipeline for the entire operation.

Do exercise 13.

You have just mapped inputs on outputs using read_csv() but functional programming is particularly useful, if you program your own functions. Let us program a function that takes a filename, loads the file and returns total number of trials/rows (if you forgot how to compute number of rows in a table, see above). Once you have a function, use it with map_dbl and a vector of persistence filenames. You should get a vector of ten values. Now we can easily see that there was something wrong with one of the files and we must pay attention to the amount of data that we have.

##  [1] 528 528 480 528 528 528 528 528 528 528

Do exercise 14.

8 Tidyng your data: joins and pivots

10 Missing data