# 10 Missing data

Grab an exercise notebook before we start!

Sometimes data is missing. It can be missing *explicitly* with `NA`

standing for Not Available / Missing data. Or, it can be missing *implicitly* when there is no entry for a particular condition. In the latter case, the strategy is to make missing values explicit first (discussed below).

Then (once) you have missing values, represented by `NA`

in R, you must decide how to deal with them: you can use this information directly as missing data can be diagnostic in itself, you can impute values using either a sophisticated statistical methods or via a simple average/default value strategy, or you can exclude them from the analysis. Every option has pros and cons, so think carefully and do not use an option whose effects you do not fully understand as it will compromise the rest of your analysis.

## 10.1 Making missing data explicit (completing data)

To make implicit missing data explicit,*tidyr*provides a function complete() that you already met. It figures out all combinations of values for columns that you specified, finds missing combinations, and adds them using

`NA`

(or some other specified value) for other columns. Imagine a toy incomplete table (no data for Participant `2`

and Face `M-2`

).
Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | 4 | 3 |

1 | M-2 | 4 | 7 | 6 |

2 | M-1 | 5 | 2 | 1 |

We can complete that table by specifying columns that define all required combinations.

`complete_df <- complete(incomplete_df, Participant, Face)`

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | 4 | 3 |

1 | M-2 | 4 | 7 | 6 |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | NA | NA | NA |

For *non-factor* variables (`Participant`

is numeric and `Face`

is character/string), complete finds all unique values for each column and finds all combinations of these elements. However, if a variable is a factor, complete uses it levels, even if not all levels are present in the data. E.g., we can use `Face`

as a factor with three levels: “M-1”, “M-2”, and “F-1”. In this case, information is missing for both participants (neither have responses on face “F-1”) and should be filled with NAs. This approach is useful if you know all combinations that *should* be present in the data and need to ensure the completeness.

```
extended_df <-
incomplete_df |>
# converting Face to factor with THREE levels (only TWO are present in the data)
mutate(Face = factor(Face, levels = c("M-1", "M-2", "F-1"))) |>
# completing the table
complete(Participant, Face)
```

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | 4 | 3 |

1 | M-2 | 4 | 7 | 6 |

1 | F-1 | NA | NA | NA |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | NA | NA | NA |

2 | F-1 | NA | NA | NA |

Do exercise 1.

You can also supply default values via `fill`

parameter that takes a named list, e.g., `list(column_name = default_value)`

. However, I’d like to remind you again that you should only impute values that “make sense” given the rest of your analysis. Zeros here are for illustration only and, in a real-life scenario, would ruin your inferences either by artificially lowering symmetry and attractiveness of the second face or (if you are lucky) will break and stop the analysis that expects only values within 1-7 range (rmANOVA won’t be bothered at that would be the first scenario),

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | 4 | 3 |

1 | M-2 | 4 | 7 | 6 |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 0 | 0 | NA |

Do exercise 2.

The complete() is easy to use convenience function that you can easily replicate yourself. To do this, you need to create a new table that lists all combinations of variables that you are interested in (you can use either expand.grid() or expand_grid() for this) and then left joining the original table to it (why left join? Could you use another join for the same purpose?). The results is the same as with a complete() itself.

Do exercise 3.

## 10.2 Dropping / omitting NAs

There are two approaches for excluding missing values. You can exclude all incomplete rows which have missing values in *any* variable via na.omit() (base R function) or drop_na() (tidyr package function). Or you can exclude rows only if they have `NA`

in a specific columns by specifying their names.

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | NA | 3 |

1 | M-2 | NA | 7 | NA |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 3 | 7 | 2 |

First, we can ensure only complete cases via na.omit()

`na.omit(widish_df_with_NA)`

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 3 | 7 | 2 |

or via drop_na()

```
widish_df_with_NA |>
drop_na()
```

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 3 | 7 | 2 |

Second, we drop rows only if `Attractiveness`

data is missing.

```
widish_df_with_NA |>
drop_na(Attractiveness)
```

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-2 | NA | 7 | NA |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 3 | 7 | 2 |

Practice time. Create you own table with missing values and exclude missing values using na.omit() and drop_na().

Do exercise 4.

drop_na() is a very convenient function but you can replicate it functionality using is.na() in combination with filter dplyr function or logical indexing. Implement code that excludes rows if they contain `NA`

in a specific column using these two approaches.

Do exercises 5 and 6.

Recall that you can write your own functions in R that you can use to create convenience wrappers like drop_na(). Implement code that uses logical indexing as a function that takes table (`data.frame`

) as a first argument and name a of a single column as a second, filters out rows with `NA`

in that column and returns the table back.

Do exercise 7.

As noted above, you can also impute values. The simplest strategy is to use either a fixed or an average (mean, median, etc.) value. tidyr function that performs a simple substitution is replace_na()^{67} and, as a second parameter, it takes a named list of values `list(column_name = value_for_NA)`

. For our toy table, we can replace missing `Attractiveness`

and `Symmetry`

values with some default value, e.g. `0`

and `-1`

(this is very arbitrary, just to demonstrate how it works, do not do things like these for real analysis unless you know what you are doing!)

```
widish_df_with_NA |>
replace_na(list(Attractiveness = 0, Symmetry = -1))
```

Participant | Face | Symmetry | Attractiveness | Trustworthiness |
---|---|---|---|---|

1 | M-1 | 6 | 0 | 3 |

1 | M-2 | -1 | 7 | NA |

2 | M-1 | 5 | 2 | 1 |

2 | M-2 | 3 | 7 | 2 |

Do exercise 8.

Unfortunately, `replace_na()`

works only with constant values and does not handle grouped tables very well^{68} So to replace an `NA`

with a mean value of a *grouped* data, we need to combine some of our old knowledge with an ifelse(conditon, value_if_true, value_if_false) function you learned about before. Recall that this function is a vectorized cousin of the if-else that takes 1) a vector of logical values (`condition`

), 2) a vector values that are returned if `condition`

is true, 3) a vector of values that are returned if `condition`

is false. Note that the usual rules of vector length-matching apply, so if the three vectors have different length, they will be automatically (and silently) adjusted to match the length of `condition`

vector. As with all computations, you can use original values themselves. Here is how to replace only negative values but keep the positive ones:

`## [1] 0 3 5 0 5`

We, essentially, tell the function, “if the condition is false, use the original value”. Now, your turn! Using the same vector and ifelse() function, replace negative values with a mean value of the positive values in the vector.

Do exercise 9.

Now that you know how to use `ifelse()`

, replacing `NA`

with a mean will be (relatively) easy. Use adaptation_with_na table and replace missing information using participant-specific values.

Participant | Prime | Probe | Nsame | Ntotal |
---|---|---|---|---|

ma2 | Sphere | Sphere | NA | 119 |

ma2 | Sphere | Quadro | 23 | NA |

ma2 | Sphere | Dual | NA | 120 |

ma2 | Sphere | Single | 31 | 115 |

ma2 | Quadro | Sphere | 25 | 120 |

ma2 | Quadro | Quadro | 26 | 120 |

`Ntotal`

with a participant-specific median value (a “typical” and integer number of trials but do not forget about `na.rm`

option, see manual for details). `Nsame`

is trickier. For this, compute proportion of same response for each condition `Psame = Nsame / Ntotal`

. This will produce missing values whenever `Nsame`

is missing. Now, replace missing `Psame`

values (is.na()) with a mean `Psame`

*per participant*(again, watch our for

`na.rm`

!) using ifelse() (you can use it inside `mutate()`

). Finally, compute missing values for `Nsame`

from `Psame`

and `Ntotal`

(do not forget to round them, so you end up with integer number of trials). This entire computation should be implemented as a single pipeline. You will end up with a following table.
Participant | Prime | Probe | Nsame | Ntotal | Psame |
---|---|---|---|---|---|

ma2 | Sphere | Sphere | 36 | 119 | 0.2983741 |

ma2 | Sphere | Quadro | 23 | 120 | 0.1916667 |

ma2 | Sphere | Dual | 36 | 120 | 0.2983741 |

ma2 | Sphere | Single | 31 | 115 | 0.2695652 |

ma2 | Quadro | Sphere | 25 | 120 | 0.2083333 |

ma2 | Quadro | Quadro | 26 | 120 | 0.2166667 |

Do exercise 10.