10 Missing data
Grab an exercise notebook before we start!
Sometimes data is missing. It can be missing explicitly with NA
standing for Not Available / Missing data. Or, it can be missing implicitly when there is no entry for a particular condition. In the latter case, the strategy is to make missing values explicit first (discussed below).
Then (once) you have missing values, represented by NA
in R, you must decide how to deal with them: you can use this information directly as missing data can be diagnostic in itself, you can impute values using either a sophisticated statistical methods or via a simple average/default value strategy, or you can exclude them from the analysis. Every option has pros and cons, so think carefully and do not use an option whose effects you do not fully understand as it will compromise the rest of your analysis.
10.1 Making missing data explicit (completing data)
To make implicit missing data explicit, tidyr provides a function complete() that you already met. It figures out all combinations of values for columns that you specified, finds missing combinations, and adds them using NA
(or some other specified value) for other columns. Imagine a toy incomplete table (no data for Participant 2
and Face M-2
).
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 4 | 3 |
1 | M-2 | 4 | 7 | 6 |
2 | M-1 | 5 | 2 | 1 |
We can complete that table by specifying columns that define all required combinations.
complete_df <- complete(incomplete_df, Participant, Face)
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 4 | 3 |
1 | M-2 | 4 | 7 | 6 |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | NA | NA | NA |
For non-factor variables (Participant
is numeric and Face
is character/string), complete finds all unique values for each column and finds all combinations of these elements. However, if a variable is a factor, complete uses it levels, even if not all levels are present in the data. E.g., we can use Face
as a factor with three levels: “M-1”, “M-2”, and “F-1”. In this case, information is missing for both participants (neither have responses on face “F-1”) and should be filled with NAs. This approach is useful if you know all combinations that should be present in the data and need to ensure the completeness.
extended_df <-
incomplete_df |>
# converting Face to factor with THREE levels (only TWO are present in the data)
mutate(Face = factor(Face, levels = c("M-1", "M-2", "F-1"))) |>
# completing the table
complete(Participant, Face)
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 4 | 3 |
1 | M-2 | 4 | 7 | 6 |
1 | F-1 | NA | NA | NA |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | NA | NA | NA |
2 | F-1 | NA | NA | NA |
Do exercise 1.
You can also supply default values via fill
parameter that takes a named list, e.g., list(column_name = default_value)
. However, I’d like to remind you again that you should only impute values that “make sense” given the rest of your analysis. Zeros here are for illustration only and, in a real-life scenario, would ruin your inferences either by artificially lowering symmetry and attractiveness of the second face or (if you are lucky) will break and stop the analysis that expects only values within 1-7 range (rmANOVA won’t be bothered at that would be the first scenario),
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 4 | 3 |
1 | M-2 | 4 | 7 | 6 |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 0 | 0 | NA |
Do exercise 2.
The complete() is easy to use convenience function that you can easily replicate yourself. To do this, you need to create a new table that lists all combinations of variables that you are interested in (you can use either expand.grid() or expand_grid() for this) and then left joining the original table to it (why left join? Could you use another join for the same purpose?). The results is the same as with a complete() itself.
Do exercise 3.
10.2 Dropping / omitting NAs
There are two approaches for excluding missing values. You can exclude all incomplete rows which have missing values in any variable via na.omit() (base R function) or drop_na() (tidyr package function). Or you can exclude rows only if they have NA
in a specific columns by specifying their names.
For a table you see below
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | NA | 3 |
1 | M-2 | NA | 7 | NA |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
First, we can ensure only complete cases via na.omit()
na.omit(widish_df_with_NA)
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
or via drop_na()
widish_df_with_NA |>
drop_na()
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
Second, we drop rows only if Attractiveness
data is missing.
widish_df_with_NA |>
drop_na(Attractiveness)
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-2 | NA | 7 | NA |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
Practice time. Create you own table with missing values and exclude missing values using na.omit() and drop_na().
Do exercise 4.
drop_na() is a very convenient function but you can replicate it functionality using is.na() in combination with filter dplyr function or logical indexing. Implement code that excludes rows if they contain NA
in a specific column using these two approaches.
Do exercises 5 and 6.
Recall that you can write your own functions in R that you can use to create convenience wrappers like drop_na(). Implement code that uses logical indexing as a function that takes table (data.frame
) as a first argument and name a of a single column as a second, filters out rows with NA
in that column and returns the table back.
Do exercise 7.
As noted above, you can also impute values. The simplest strategy is to use either a fixed or an average (mean, median, etc.) value. tidyr function that performs a simple substitution is replace_na()67 and, as a second parameter, it takes a named list of values list(column_name = value_for_NA)
. For our toy table, we can replace missing Attractiveness
and Symmetry
values with some default value, e.g. 0
and -1
(this is very arbitrary, just to demonstrate how it works, do not do things like these for real analysis unless you know what you are doing!)
widish_df_with_NA |>
replace_na(list(Attractiveness = 0, Symmetry = -1))
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|
1 | M-1 | 6 | 0 | 3 |
1 | M-2 | -1 | 7 | NA |
2 | M-1 | 5 | 2 | 1 |
2 | M-2 | 3 | 7 | 2 |
Do exercise 8.
Unfortunately, replace_na()
works only with constant values and does not handle grouped tables very well68 So to replace an NA
with a mean value of a grouped data, we need to combine some of our old knowledge with an ifelse(conditon, value_if_true, value_if_false) function you learned about before. Recall that this function is a vectorized cousin of the if-else that takes 1) a vector of logical values (condition
), 2) a vector values that are returned if condition
is true, 3) a vector of values that are returned if condition
is false. Note that the usual rules of vector length-matching apply, so if the three vectors have different length, they will be automatically (and silently) adjusted to match the length of condition
vector. As with all computations, you can use original values themselves. Here is how to replace only negative values but keep the positive ones:
## [1] 0 3 5 0 5
We, essentially, tell the function, “if the condition is false, use the original value”. Now, your turn! Using the same vector and ifelse() function, replace negative values with a mean value of the positive values in the vector.
Do exercise 9.
Now that you know how to use ifelse()
, replacing NA
with a mean will be (relatively) easy. Use adaptation_with_na table and replace missing information using participant-specific values.
Participant | Prime | Probe | Nsame | Ntotal |
---|---|---|---|---|
ma2 | Sphere | Sphere | NA | 119 |
ma2 | Sphere | Quadro | 23 | NA |
ma2 | Sphere | Dual | NA | 120 |
ma2 | Sphere | Single | 31 | 115 |
ma2 | Quadro | Sphere | 25 | 120 |
ma2 | Quadro | Quadro | 26 | 120 |
We have missing data in different columns, so we have to use different for each case. Here is one way to approach this problem. We cannot know the number of trials for a specific Prime × Probe combination, but we can replace missing values for Ntotal
with a participant-specific median value (a “typical” and integer number of trials but do not forget about na.rm
option, see manual for details). Nsame
is trickier. For this, compute proportion of same response for each condition Psame = Nsame / Ntotal
. This will produce missing values whenever Nsame
is missing. Now, replace missing Psame
values (is.na()) with a mean Psame
per participant (again, watch our for na.rm
!) using ifelse() (you can use it inside mutate()
). Finally, compute missing values for Nsame
from Psame
and Ntotal
(do not forget to round them, so you end up with integer number of trials). This entire computation should be implemented as a single pipeline. You will end up with a following table.
Participant | Prime | Probe | Nsame | Ntotal | Psame |
---|---|---|---|---|---|
ma2 | Sphere | Sphere | 36 | 119 | 0.2983741 |
ma2 | Sphere | Quadro | 23 | 120 | 0.1916667 |
ma2 | Sphere | Dual | 36 | 120 | 0.2983741 |
ma2 | Sphere | Single | 31 | 115 | 0.2695652 |
ma2 | Quadro | Sphere | 25 | 120 | 0.2083333 |
ma2 | Quadro | Quadro | 26 | 120 | 0.2166667 |
Do exercise 10.