10 Missing data

Grab an exercise notebook before we start!

Sometimes data is missing. It can be missing explicitly with NA standing for Not Available / Missing data. Or, it can be missing implicitly when there is no entry for a particular condition. In the latter case, the strategy is to make missing values explicit first (discussed below).

Then (once) you have missing values, represented by NA in R, you must decide how to deal with them: you can use this information directly as missing data can be diagnostic in itself, you can impute values using either a sophisticated statistical methods or via a simple average/default value strategy, or you can exclude them from the analysis. Every option has pros and cons, so think carefully and do not use an option whose effects you do not fully understand as it will compromise the rest of your analysis.

10.1 Making missing data explicit (completing data)

To make implicit missing data explicit, tidyr provides a function complete() that you already met. It figures out all combinations of values for columns that you specified, finds missing combinations, and adds them using NA (or some other specified value) for other columns. Imagine a toy incomplete table (no data for Participant 2 and Face M-2).

Table 10.1: Table with no data for Face M2 for Participant 2.
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1

We can complete that table by specifying columns that define all required combinations.

complete_df <- complete(incomplete_df, Participant, Face)
Table 10.2: Completed table with explicit NAs
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1
2 M-2 NA NA NA

For non-factor variables (Participant is numeric and Face is character/string), complete finds all unique values for each column and finds all combinations of these elements. However, if a variable is a factor, complete uses it levels, even if not all levels are present in the data. E.g., we can use Face as a factor with three levels: “M-1”, “M-2”, and “F-1”. In this case, information is missing for both participants (neither have responses on face “F-1”) and should be filled with NAs. This approach is useful if you know all combinations that should be present in the data and need to ensure the completeness.

extended_df <-
  incomplete_df |>
  # converting Face to factor with THREE levels (only TWO are present in the data)
  mutate(Face = factor(Face, levels = c("M-1", "M-2", "F-1"))) |>
  # completing the table
  complete(Participant, Face)
Table 10.3: Completed missing data including F-1 face.
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
1 F-1 NA NA NA
2 M-1 5 2 1
2 M-2 NA NA NA
2 F-1 NA NA NA

Do exercise 1.

You can also supply default values via fill parameter that takes a named list, e.g., list(column_name = default_value). However, I’d like to remind you again that you should only impute values that “make sense” given the rest of your analysis. Zeros here are for illustration only and, in a real-life scenario, would ruin your inferences either by artificially lowering symmetry and attractiveness of the second face or (if you are lucky) will break and stop the analysis that expects only values within 1-7 range (rmANOVA won’t be bothered at that would be the first scenario),

filled_df <- 
  incomplete_df |>
  complete(Participant, Face, fill=list(Attractiveness=0, Symmetry=0))
Table 10.4: Completed missing data with non-NA values.
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 4 3
1 M-2 4 7 6
2 M-1 5 2 1
2 M-2 0 0 NA

Do exercise 2.

The complete() is easy to use convenience function that you can easily replicate yourself. To do this, you need to create a new table that lists all combinations of variables that you are interested in (you can use either expand.grid() or expand_grid() for this) and then left joining the original table to it (why left join? Could you use another join for the same purpose?). The results is the same as with a complete() itself.

Do exercise 3.

10.2 Dropping / omitting NAs

There are two approaches for excluding missing values. You can exclude all incomplete rows which have missing values in any variable via na.omit() (base R function) or drop_na() (tidyr package function). Or you can exclude rows only if they have NA in a specific columns by specifying their names.

For a table you see below

Table 10.5: Table with missing values.
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 NA 3
1 M-2 NA 7 NA
2 M-1 5 2 1
2 M-2 3 7 2

First, we can ensure only complete cases via na.omit()

na.omit(widish_df_with_NA)
Table 10.6: Complete cases via na.omit()
Participant Face Symmetry Attractiveness Trustworthiness
2 M-1 5 2 1
2 M-2 3 7 2

or via drop_na()

widish_df_with_NA |>
  drop_na()
Table 10.7: Complete cases via drop_na()
Participant Face Symmetry Attractiveness Trustworthiness
2 M-1 5 2 1
2 M-2 3 7 2

Second, we drop rows only if Attractiveness data is missing.

widish_df_with_NA |>
  drop_na(Attractiveness)
Table 10.8: Complete Attractiveness via drop_na()
Participant Face Symmetry Attractiveness Trustworthiness
1 M-2 NA 7 NA
2 M-1 5 2 1
2 M-2 3 7 2

Practice time. Create you own table with missing values and exclude missing values using na.omit() and drop_na().

Do exercise 4.

drop_na() is a very convenient function but you can replicate it functionality using is.na() in combination with filter dplyr function or logical indexing. Implement code that excludes rows if they contain NA in a specific column using these two approaches.

Do exercises 5 and 6.

Recall that you can write your own functions in R that you can use to create convenience wrappers like drop_na(). Implement code that uses logical indexing as a function that takes table (data.frame) as a first argument and name a of a single column as a second, filters out rows with NA in that column and returns the table back.

Do exercise 7.

As noted above, you can also impute values. The simplest strategy is to use either a fixed or an average (mean, median, etc.) value. tidyr function that performs a simple substitution is replace_na()67 and, as a second parameter, it takes a named list of values list(column_name = value_for_NA). For our toy table, we can replace missing Attractiveness and Symmetry values with some default value, e.g. 0 and -1 (this is very arbitrary, just to demonstrate how it works, do not do things like these for real analysis unless you know what you are doing!)

widish_df_with_NA |>
  replace_na(list(Attractiveness = 0, Symmetry = -1)) 
Table 10.9: Missing values filled with 0 and -1
Participant Face Symmetry Attractiveness Trustworthiness
1 M-1 6 0 3
1 M-2 -1 7 NA
2 M-1 5 2 1
2 M-2 3 7 2

Do exercise 8.

Unfortunately, replace_na() works only with constant values and does not handle grouped tables very well68 So to replace an NA with a mean value of a grouped data, we need to combine some of our old knowledge with an ifelse(conditon, value_if_true, value_if_false) function you learned about before. Recall that this function is a vectorized cousin of the if-else that takes 1) a vector of logical values (condition), 2) a vector values that are returned if condition is true, 3) a vector of values that are returned if condition is false. Note that the usual rules of vector length-matching apply, so if the three vectors have different length, they will be automatically (and silently) adjusted to match the length of condition vector. As with all computations, you can use original values themselves. Here is how to replace only negative values but keep the positive ones:

v <- c(-1, 3, 5, -2, 5)
ifelse(v < 0, 0, v)
## [1] 0 3 5 0 5

We, essentially, tell the function, “if the condition is false, use the original value”. Now, your turn! Using the same vector and ifelse() function, replace negative values with a mean value of the positive values in the vector.

Do exercise 9.

Now that you know how to use ifelse(), replacing NA with a mean will be (relatively) easy. Use adaptation_with_na table and replace missing information using participant-specific values.

Table 10.10: adaptation_with_na.csv with missing values
Participant Prime Probe Nsame Ntotal
ma2 Sphere Sphere NA 119
ma2 Sphere Quadro 23 NA
ma2 Sphere Dual NA 120
ma2 Sphere Single 31 115
ma2 Quadro Sphere 25 120
ma2 Quadro Quadro 26 120

We have missing data in different columns, so we have to use different for each case. Here is one way to approach this problem. We cannot know the number of trials for a specific Prime × Probe combination, but we can replace missing values for Ntotal with a participant-specific median value (a “typical” and integer number of trials but do not forget about na.rm option, see manual for details). Nsame is trickier. For this, compute proportion of same response for each condition Psame = Nsame / Ntotal. This will produce missing values whenever Nsame is missing. Now, replace missing Psame values (is.na()) with a mean Psame per participant (again, watch our for na.rm!) using ifelse() (you can use it inside mutate()). Finally, compute missing values for Nsame from Psame and Ntotal (do not forget to round them, so you end up with integer number of trials). This entire computation should be implemented as a single pipeline. You will end up with a following table.

Table 10.11: adaptation_with_na.csv with imputed values
Participant Prime Probe Nsame Ntotal Psame
ma2 Sphere Sphere 36 119 0.2983741
ma2 Sphere Quadro 23 120 0.1916667
ma2 Sphere Dual 36 120 0.2983741
ma2 Sphere Single 31 115 0.2695652
ma2 Quadro Sphere 25 120 0.2083333
ma2 Quadro Quadro 26 120 0.2166667

Do exercise 10.