4 Functions! Functions everywhere!

In this chapter you will learn about functions in R, as they are the second most important concept in R and are everywhere (just like vectors and lists). You will also learn how to pipe your computation through series of functions without creating a mess of temporary variables or nested calls. Don’t forget to download the notebook.

4.1 Functions

In the previous chapters, you have learned that you can store information in variables –— “boxes with slots” —– as vectors or as tables (bundles of vectors of equal length). To use these stored values for computation you need functions. In programming, function is an isolated code that receives some (optional) input, performs some action on it, and, optionally, returns a value27. The concept of functions comes from mathematics, so it might be easier to understand them using R implementation of mathematical functions. For example, you may remember a sine function from trigonometry. It is typically abbreviated as sin, it takes a numeric value of angle (in radians) as its input and returns a corresponding value between -1 and 1 (this is its output): \(sin(0) = 0\), \(sin(\pi/2) = 1\), etc.

In R, you write a function using the following template28

name_of_the_function <- function(parameter1, parameter2, parameter3, ...){
  ...some code that computes the value...
  return(value)
}

A sin function with a single parameter angle would look something like this

sin <- function(angle){
  ...some math that actually computes sinus of angle using value of angle parameter ...
  return(sin_of_angle)
}

Once we have the function, we can use it by calling it. You simply write sin(0)29 and get the answer!

sin(0)
## [1] 0

As you hopefully remember, all simple values are (atomic) vectors, so instead of using a scalar 0 (merely a vector of length of one) you can write and apply this function to (compute sinus for) every element in the vector.

sin(seq(0, 3.141593, length.out = 5))
## [1]  0.000000e+00  7.071068e-01  1.000000e+00  7.071066e-01 -3.464102e-07

You can think of functions parameters as local function variables those values are set before the function is called. A function can have any number of parameters, including zero30, one, or many parameters. For example, an arctangent atan2 function takes 2D coordinates (y and x, in that order!) and returns a corresponding angle in radians.

atan2(c(0, 1), c(1, 1))
## [1] 0.0000000 0.7853982

A definition of this function would look something like this

atan2 <- function(y, x){
  ...magic that uses values of y and x parameters...
  ...to compute the angle_in_rad value...
  return(angle_in_rad);
}

Do exercise 1.

4.2 Writing your own function

Let us start practicing computing things in R and writing functions at the same time. We will begin by implementing a very simple function that doubles a given number. We will write this function in steps. First, think about how you would name31 this function (meaningful names are your path to readable and re-usable code!) and how many parameters it will have. Write the definition of the function without any code inside of wiggly brackets (so, no actual computation or a return statement at the end of it).

Do exercise 2.1

Next, think about the code that would double-the-value based on the parameter. This is the code that will eventually go inside the wiggly brackets. Write that code (just the code, without the bits from exercise 2.1) in exercise 2.2 and test it by creating a variable with the same name as your parameter inside the function. E.g., if my parameter name is the_number, I would test it as

the_number <- 1
...my code to double the value usign the_number variable...

Do exercise 2.2

By now you have your formal function definition (exercise 2.1) and the actual code that should go inside (exercise 2.2). Now, we just need to combine them by putting the code inside the function and returning the value. You can do this two ways:

  1. you can store the results of the computation in a separate local variable and then return that variable,
  2. return the results of the computation directly
# 1) first store in a local variable, then return it
  result <- ...some computation you perform...
  return(result)

# 2) returning results of computation directly
  return(...some computation you perform...)

Do it both ways in exercises 2.3 and 2.4. Call the function using different inputs to test that it works.

Do exercise 2.3 and 2.4

More practice is always good, so write a function that converts an angle in degrees to radians. The formula (hint: Constants) is \[rad = \frac{deg \cdot \pi}{180}\] Decide whether you want to a have an intermediate local variable inside the function or to return the results of the computation directly. My personal preference for writing function is to follow the inside-out sequence when writing functions as in exercise 2. Decide on function definition and parameter names, then create variables with same names as parameters to write and debug the code. Once I am sure the code is working, I put it into a function, restart R session or clear the workspace to get rid of temporary variables, and run the function to check that everything works as it should32.

Do exercise 3

4.3 Scopes: Global versus Local variables

I suggested that you use a variable to store the results of double-it-up computation before returning it. But why did I call it local? This is because each function has it own scope (environment with variables and functions) that is (mostly) independent from the scope of the global script. Unfortunately, environment scopes in R are different and more complicated than those in other programming languages, such as Python, C/C++, or Java, so always pay attention and be careful in the future.

The global scope/environment is the environment for the code outside of functions. All global variables and functions, the ones that you define in the code outside of functions (typing in the console, running scripts, running chunks of code in notebooks, etc.), live where. You can see what you have in the global scope at any time by looking at Environment tab (note the Global Environment tag).

In my case, it has one table (mpg, all tables go under Data), three vectors (angle, some_variable, and x, all vectors go under Values), and an example function from exercise #2 that I created (double_it, all functions go under Functions, makes sense). However, it has no access to parameters of the functions and variables that you define inside these function. When you run a function, it has it own scope/environment that includes parameters of the function (e.g., the_number for my double_it function and the value it was assigned during the call), any local variables you create inside that function (e.g., result <- ...some computation you perform... creates such local variable), and a copy(!) of all global variables33. In the code below, take a look at the comments that specify the accessibility of variables between global script and functions (ignore glue for a moment, it glues a variable’s value into the text, so I use it to make printouts easier to trace)34

# this is a GLOBAL variable
global_variable <- 1

i_am_test_function <- function(parameter){
  # parameter live is a local function scope
  # its values are set when you call a function
  print(glue("  parameter inside the function: {parameter}"))
  
  # local variable created inside the function
  # it is invisible from outside
  local_variable <- 2
  print(glue("  local_variable inside the function: {local_variable}"))
  
  # here, a_global_variable is a LOCAL COPY of the the global variable
  # of the same name. You can use this COPY
  print(glue("  COPY of global_variable inside the function: {global_variable}"))
  
  # you can modify the LOCAL COPY but that won't affect the original!
  global_variable <- 3
  print(glue("  CHANGED COPY of global_variable inside the function: {global_variable}"))
}

print(glue("global_variable before the function call: {global_variable}"))
## global_variable before the function call: 1
i_am_test_function(5)
##   parameter inside the function: 5
##   local_variable inside the function: 2
##   COPY of global_variable inside the function: 1
##   CHANGED COPY of global_variable inside the function: 3
# the variable outside is unchanged because we modify its COPY, not the variable itself
print(glue("UNCHANGED global_variable after the function call: {global_variable}"))
## UNCHANGED global_variable after the function call: 1

Do exercise 4 to build understanding of scopes.

4.4 Function with two parameters

Let us write a function that takes two parameters –— x and y –— and computes radius (distance from (0,0) to (x, y))35. The formula is \[R = \sqrt{x^2 + y^2}\] This is very similar to exercises 2 and 3, with number of parameters being the only difference. Again, I would suggest first writing and debugging the code in a separate chunk and then putting it inside of the formal function definition (and the testing it again!)

Do exercise 5.

4.5 Table as a parameter

So far, we only passed vectors (a.k.a. values) to functions but you can pass any object including tables36. Let us use mpg table from the ggplot2 package. Write a function that takes a table as a parameter. This mean that the function should not assume that a table with this name exists in the global environment. Do not use mpg as a parameter name (makes it confusing and hard to understand which table global or local you actually mean), call it something else37. The function should compute and return average miles-per-gallon efficiency of city cty and highway hwy test cycles for each car (so, you need to compute 234 values). Do it two ways. First, compute and return a vector based on the table passed as parameter. Second, do the same computation but add the result to the table that function received as a parameter (call the new column avg_mpg) and return the entire table (remember, modifying table is not enough, as you are working on a copy).

Do exercise 6.

Let us write another function that computes mean efficiency for a particular cycle, either city or highway, over all the cars (so single value as the output). For this, the function will take two parameters: 1) the table itself and 2) a string (text variable) with the name of the column. Then, you use column name to access it via simplifying [[]] subsetting. To summarize, your function takes 1) a table and 2) a string with a column name and returns a single number (mean for the specified column). E.g.

average_efficiency(mpg, "cty") # should return 16.85897

Do exercise 7.

4.6 Named versus positional arguments

Remember than we talked about the assignment statement, I noted that you should always use <- operator for assignments but warned you that you will encounter an alternative = operator when we will use functions. You also may have notice me using it, for example, in seq(0, 2*pi, length.out = 100). Why did I use name for the third parameter and not the other two? Why did I have to use length.out= at all? This is because in R you can pass arguments by position (first two arguments in that example) or by name (that would be length.out =).

Then you pass arguments by position, values are put into corresponding arguments in the order you supplied them in. I.e., the first value goes into the first parameter, the second into the second, etc. Then you specify arguments by name, you explicitly state which argument gets which value arg = value. Here, the exact order does not matter and you can put arguments in the order you need (i.e., an order that makes understanding the code easier). You can also mix these two ways and, R being R, you can really mix them interleaving positional and named arguments any way you like. That being a case, never use named arguments before positional ones. Each time you use a named argument, its position is taken out and that changes position index for remaining positional arguments. This is an almost certain way to put a value into a wrong parameter and, at the same time, create an error that is really hard to find. Thus, despite all the flexibility that R gives you to confuse yourself and others, use only all-positional-followed-by-all-named-arguments order, e.g., seq(0, 2*pi, length.out = 100).

When should you use which? That depends solely on which way is clearer or possible. For a single parameter or widely used mathematical functions, like mean or sin there is little point in using names (particularly, because the only argument is named x). At the same time, I always use named parameters for atan2 simply because I am never 100% sure about the order (i.e., x followed by y occurs far more often). At the same time, formulas for statistical models in R have a very specific look and are typically a first argument, so it is probably redundant to specify that formula = y ~ x.Returning to the seq() function, the length.out is actually its fourth argument with by (an alternative way to define how many values you will get in the sequence) being third. But because both parameters define, essentially, the same thing, it is impossible to specify length.out using positional arguments only. Finally, if a function has more than a couple of parameters, it is probably a good idea to use named arguments. In short, use you own better judgement on whatever makes understanding the code easier.

Do exercise 8.

4.7 Default values

When you write a function, you can combine simplicity of its use with flexibility via default values. I.e., some parameters can have sensible or commonly used values but, if desired, a user can specify their own values to modify functions behavior. For example, function mean has three parameters. You always need to specify the first one (x, a vector of values you are computing the mean of). But you can also specify trimming via trim and whether it should ignore NA via na.rm38. By default, you do not trim (trim = 0) and do not ignore NA (na.rm = FALSE). These defaults are sensible as they produce a typically expected behavior. At the same time their existence means that you can fine-tune your mean computation the way you require.

When writing your own function, you specify the default values when you define an argument. E.g., here, the second parameter r, the radius, has a default value of 1, so you can only specify the direction of the vector to compute (x, y) coordinates for a (default) unit vector.

polar_to_cartesian <- function(angle, r=1) {
  # ...
}

Do exercise 9.

4.8 Nested calls

What if you need to call several functions in a single chain to compute a result? Think about the function from exercise #3 that converts degree to radians. Its most likely usage scenario is to convert degrees to radians and use that to compute sine (or some other trigonometric function). There are different ways you can do this. For example, you can store the angle in radians in some temporary variable (e.g., angle_in_radians) and then pass it to the sine function during the next call.

angle_in_radians <- deg2rad(90) # function returns 1.570796, this value is stored in angle_in_radians
sin(angle_in_radians) # returns 1

Alternatively, you can use the value returned by deg2rad() directly as a parameter for function sin()

sin(deg2rad(90)) # returns 1

In this case, the computation proceeds in an inside-out order: The innermost function gets computed first, the function that uses its return value is next, etc. Kind of like assembling a Russian doll: you start with an innermost, put it inside a slightly bigger one, now take that one put it inside the next, etc. Nesting means that you do not need to pollute you memory with temporary variables39 and make your code more expressive as nesting explicitly informs the reader that intermediate results are of no value by themselves and are not saved for later use.

Do exercise 10.

4.9 Piping

Although nesting is better than having tons of intermediate variables, lots of nesting can be mightily confusing. Tidyverse has an alternative way to chain or, in Tidyverse-speak, pipe a computation through a series of function calls. The magic operator is |> (that’s the pipe). There is another pipe operator that you are very likely to encounter: %>%. This is the “original” pipe introduced by magrittr library and widely used, particularly in Tidyverse, before R version 4.1.0 provided a built-in |> solution. As far as you are concerned, these two operators are synonyms and can be used interchangably (same goes when you read somebody else’s code). However, you do need to import either tidyverse or one of its libraries to enable %>% and |> is an official future, so I strongly recommend to using |> in your code40 Here is how a pipe transforms our nested call

sin(deg2rad(90)) # returns 1

deg2rad(90) |> sin() # also returns 1

90 |> deg2rad() |> sin() # also returns 1

Do exercise 11.

All functions you used in the exercise had only one parameter because single-output |> single-input piping is very straightforward. But what if one of the functions takes more than one parameter? By default, |> puts the piped value into the first parameter. Thus 4 |> radius(3) is equivalent to radius(4, 3). Although this default is very useful (the entire Tidyverse is build around this idea of piping a value into the first parameter to streamline your experience), sometimes you need that value as some other parameter (e.g., when you are pre-processing the data before piping it into a statistical test, the latter typically takes formula as the first parameter and data as second). For this, you can use a special underscore variable _ for |> and a dot variable . for %>%. The latter is more flexible, as it can be used for both named and positional arguments many times, whereas the former is more restrictive and can be used only with named parameters. Both of these variables are hidden 41 temporary variable that holds the value you are piping through via either |> or %>%42. Thus, for a standard pipe |>

z <- 4
w <- 3
z |> radius(w) # equivalent to radius(z, w)
## [1] 5
# Note the underscore and that we use all named parameters!
z |> radius(x = _, y = w) # equivalent to radius(z, w)
## [1] 5
z |> radius(x = w, y = _) # equivalent to radius(w, z)
## [1] 5
# Using it without name of the parameter generates a mistake
z |> radius(_, w)

# Using it twice is also not allowed
z |> radius(x = _, y = _)
## Error: pipe placeholder can only be used as a named argument (<text>:2:6)

For the magritt pipe %>% you can use it anyway you want and even multiple times.

z %>% radius(w) # equivalent to radius(z, w)
## [1] 5
# Note the underscore and that we use all named parameters!
z %>% radius(x = ., y = w) # equivalent to radius(z, w)
## [1] 5
z %>% radius(x = w, y = .) # equivalent to radius(w, z)
## [1] 5
# Now positional parameters
z %>% radius(., w) # equivalent to radius(z, w)
## [1] 5
z %>% radius(w, .) # equivalent to radius(w, z)
## [1] 5
# Now same value twice
z %>% radius(., .) # equivalent to radius(z, z)
## [1] 5.656854

As you can see %>% is more flexible but, to be honest, I have realized that I never really bumped into the limitations of the standard pipe |> until I started to compile these examples. Thus, my advice is to stick to |> and use %>% only if you actually need it (R is so liberal that at least some restrictions are good.)

Do exercise 12.

4.10 Function is just a code stored in a variable

Did you spot the assignment <- operator in function definitions and wondered about this? Yes, this means that you are literally storing function code in a variable. So when you call this function by name, you are asking to run the code stored inside that variable. You can appreciate that fact by typing a name of a function you wrote without round brackets, e.g. double_the_value or radius, and, voila, you will see the code. This means that function name is not really a name (as in some other programming languages), rather it is a name of a variable function’s code is stored in. So you can use a variable with a function code inside just the way you treat any other variable. For example, you can copy it (do radius2 <- radius and then radius2 will work exactly the same way). Or pass it as an argument of another function (effectively, copy its code into a parameter), which will be very handy when your learn about bootstrapping (needs data and a function) or about applying/mapping functions to data.

4.11 Functions! Functions everywhere!

Every computation you perform in R is implemented as a function, even if it does not look like a function call. For example, + addition operator is a function. Typically, you write 2 + 3, so no round brackets, no comma-separated list of parameters, it looks different. But this is just a special implementation of a function call known as function operator that makes code more readable for humans. You can actually call + function the way you call a normal function by using backticks around its name.

2 + 3
## [1] 5
# note the `backticks` around +
`+`(2, 3)
## [1] 5

Even the assignments operator <- is, you’ve guessed it, a function

`<-`(some_variable, 1)
some_variable
## [1] 1

This does not mean that you should start using operators as functions (although, if it helps to make a particular code clearer, then, why not?), merely to stress that there is only one way to program any computation in R —– as a function —– regardless of how it may appear in the code. Later on, you will learn how to apply functions to vectors or tables (or, a Tidyverse version of that, how to use functions to map inputs to outputs), so it helps to know that you can apply any function, even the one that does not look like a function.

4.12 Using (or not using) explicit return statement

In the code above I have always used the return function. However, explicit return(some_value) can be omitted if it is the last line in a function, you just write the value (variable) itself:

some_fun <- function(){
  x <- 1
  return(x)
}

some_other_fun <- function(){
  x <- 2
  x
}

yet_another_fun <- function(){
  3
}

some_fun()
## [1] 1
some_other_fun()
## [1] 2
yet_another_fun()
## [1] 3

The lack of return function in the final line is actually an officially recommended style of Tidyverse but, personally, I am on the fence about this approach because “explicit is better than implicit”. This omission may be reasonable if it comes at the very end of a long pipe but, in general, I would recommend using an explicit return().