Introduction

This book will (try to) teach you how to perform typical data analysis tasks: reading data, transforming and cleaning it up so you can visualize it and perform a statistical analysis. It covers most topics that you need to get you started but it cannot cover them all. One advantage of R is a sheer size of its ecosystem with new incredible libraries appearing very much on a daily basis. Covering them all is beyond the scope of any book, so instead I will concentrate on (trying to) building a solid understanding of things that you need to extend your R knowledge. Because of that some early chapters (e.g., on vectors, tables, or functions) might feel boring and too technical making you wonder why didn’t we start with some exciting and useful analysis, working our way down to finer details. I have tried that1 but, unfortunately, philosophy of R is about having many almost identical ways of achieving the same end. If you do not learn these finer details, you waste time wondering why seemingly the same code works in one case but fails in mysterious ways in the other one2. Therefore, please bear with me and struggle through vectors (which are everywhere!), oddities and inconsistencies of subsetting, and learning how to write a function before you even started to use them properly. I can only promise that, from my personal experience, this is definitely worth the effort.

An important note: this book will not teach you statistics or machine learning beyond several examples at the very end. The reason for this is that it teaches data preparation and both statistics and machine learning are 90% about data preparation. This is most obvious in machine learning where data acquisition, cleaning, feature engineering, etc. to make it suitable for analysis take most of your time. The actual machine learning part boils down to trying various3 standard machine learning methods on it and picking one that gives best out-of-sample performance. That last part is so automated by now that it requires little knowledge beyond details of a specific package. Knowing how methods work is obviously beneficial but, and I hate to write this, not that critical for machine learning (not so for statistics or deep learning!). Same is true for statistical methods, although where time is split between preparing data for statistical analysis and interpreting and comparing models. As with machine learning, running statistical models itself is easy and automatic. If you know (enough of) statistics, you will have little trouble understanding how to work with these packages and functions. If you do not, no amount of reading of manuals will make it clearer.

Goal of the book

The goal is for you to be able to program a typical analysis pipeline

  • Loading data from csv/excel/online files or other data manipulation programs like SPSS
  • Preprocessing data by in the entire table or per group (certain combination of columns)
    • manipulating text information
    • converting variables to factors
    • transforming numeric data
    • filtering data
    • handling missing data
  • Simplifying data structure by
    • selecting relevant columns
    • reshaping table between long (tidy) and wide (human-readable) formats
  • Summarizing data
  • Plotting data using ggplot2 and its extensions
  • Performing basic statistical analysis, reporting and visualizing model predictions alongside the data.
  • Putting it all together in a single notebook that can be compiled (“knitted”) into a final document: report, presentation, seminar work, thesis, etc.

The programming concepts that you learn will include

  • Variables and data types
  • Vectors, lists, and tables that are foundation of R data
  • Writing functions
  • Conditional statements
  • Loops
  • Functional programming, i.e., applying the same function to multple data points at the same time

Please note that exercises are distributed through each chapter embedded in the text and you should do them at that time point. They designed to clarify concepts and apply the knowledge that were presented before, so doing them immediately would be most helpful

Why R?

There are many software tools that allow you preprocess, plot, and analyze your data. Some cost money (SPSS, Matlab), some are free just like R (Python, Julia). Moreover, you can replicate all analyses that we will perform using Python in combination with Jupyter notebooks (for reproducible analysis), Pandas (for Excel-style table), and statmodels (for statistical analysis). R is hardly perfect. For example, its subsetting system is confusing and appears to follow “convenience over safety” approach that does not sit particularly well with me. However, R in combination with piping (an easy way to perform a series of computations on the same table) and Tidyverse family of packages makes it incredibly easy to write simple, powerful and expressive code, which is very easy to understand (a huge plus, as you will discover). I will run circles around myself trying to replicate the same analysis in Python or Matlab (or base R). In addition, R is loved by mathematicians and statisticians, so it tends to have implementations for all cutting edge statistical methods (but Python is your go to for machine and deep learning).

Why Tidyverse

The material is heavily skewed towards using Tidyverse family of packages. It looks different enough from base R to the point that one might call it a “dialect” of R4. Learning Tidyverse means that you have twice as many things to learn: I will always introduce both base R and Tidyverse version. Tidyverse is the main reason I use R (rather than Python or Julia) as it makes data analysis a breeze and makes your life so much easier. This is why I want you to learn its ways. At the same time, there is plenty of useful code that uses base R, so you need to know and understand it as well.

As a matter of fact, R is so rich and flexible that there many dialects and, therefore, plenty of opinion differences 5. For example, data.table package re-implements the same functionality as base R and Tidyverse in a very compact way. I does not fit my style but it might be something that feels natural to you, so I encourage you to take a look. There are also other packages to handle things like laying out your figures or working with summary tables that might suit you better. Point is, these material barely scratches the surfaces in terms of tools and approaches that you can use. View it as a starting point for your exploration not the complete map.

Another thing to keep in mind is that Tidyverse is under very active development. This means that parts of this material could be outdated by the time you read it. E.g., dplyr do() verb was superseded by a group_modify() function, a warning generated by readr package was adapted for humans but now requires an extra step to be used for column specification, we are now on the third set of pivoting functions, etc. None of the changes are breaking and deprecation process is deliberately slow (e.g., do() still works), so even when outdated the code in the book should still work for quite some time. However, you should keep in mind that things might have changed, so it is a good idea to check an official manual from time to time.

About the seminar itself

This is a material for Applied data analysis for psychology using the open-source software R seminar as taught at Institute of Psychology at University of Bamberg. Each chapter covers a single seminar, introducing necessary ideas and is accompanied by a notebook with exercises, which you need to complete and submit. To pass the seminar, you will need to complete all assignments. You do not need to complete or provide correct solutions for all the exercises to pass the course and information on how the points for exercises will be converted to an actual grade (if you need one) or “pass” will be available during the seminar.

The material assumes no foreknowledge of R or programming in general from a reader. Its purpose is to gradually build up your knowledge and introduce to a typical analysis pipeline. It is based on a data that is typical for the field (repeated measures, appearance, accuracy and response time measurements, Likert scale reports, etc.) and you are welcome to suggest your own data set for analysis. Even if you already performed the analysis using some other program, it would still be insightful to compare the different ways and, perhaps, you might gain a new insight. Plus, it is more engaging to work on your data.

Remember that throughout the seminar you can and should(!) always ask me whenever something is unclear, you do not understand a concept or logic behind certain code, or you simply got stuck. Do not hesitate to write me in the team or (better) directly to me in the chat (in the latter case, the notifications are harder miss and we don’t spam others with our conversation).

Thinking like a computer

In some exercises your will not be writing code but reading and understanding it. Your job in this case is “to think like a computer”. Your advantage is that computers are very dumb, so instructions for them must be written in a very simple, clear, and unambiguous way6. This means that, with practice, reading code is easy for a human. Well, reading a well-written code is easy, you will eventually encounter “spaghetti-code” which is easier to rewrite from scratch than to understand. In each case, you simply go through the code line-by-line, doing all computations by hand and writing down values stored in the variables (if there are too many to keep track of). Once you go through the code in this manner, it should be completely transparent for you. No mysteries should remain, you should have no doubts or uncertainty about any(!) line. Moreover, you then can run the code and check that the values you are getting from computer match yours. Any difference means you made a mistake and code is working differently from how you think it does. In any case, if you not 100% sure about any line of code, ask me, so we can go through it together!

In a sense, this is the most important programming skill. It is impossible to learn how to write, if you cannot read the code first! Moreover, when programming you will probably spend more time reading the code and making sure that it works correctly than writing the new code. Thus, use this opportunity to practice and never use the code that you do not understand completely. Thus, there is nothing wrong in using stackoverflow but never use the code you do not understand (do not blindly copy-paste)!

About the material

The material is free to use and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives V4.0 International License.