Overview
Teaching: 45 min
Exercises: 20 minQuestions
How can I make data-dependent choices in R?
How can I repeat operations in R?
How do I write functions in R?
Objectives
Write conditional statements with
if()
andelse()
.Write and understand
for()
loops.Be able to write functions and test their speed
Before we start
If you programmed before in other languages, you’ll be happy to finally see a section on control flow. However, before you start writing loops in R, I encourage you to look back at some of the functions we used already, particularly
split
andlapply
. These functions can replace many of the loops with a cleaner and faster code.
Often we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.
Three common control flow and looping statments are if
, for
, and while
:
# if
if (x == some_value) {
# do some stuff in here
} else if (x == other_value) {
# elseif is optional
} else {
#else is optional
}
#for
for (element in some_vector) {
# iteration happens here
}
#while
while (something_is_true) {
# do some stuff
}
You can break out of for and while loops with a break
statement, and advance loops to the next iteration with next
.
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
[1] 8
Tip: pseudo-random numbers
In the above case, the function
rpois()
generates a random number following a Poisson distribution with a mean (i.e. lambda) of 8. We can use functionset.seed()
guarantees that all machines will generate the exact same ‘pseudo-random’ number (more about pseudo-random numbers).
for(i in 1:10){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for(i in 1:3){
for(j in c('a', 'b', 'c')){
print(paste(i,j))
}
}
[1] "1 a"
[1] "1 b"
[1] "1 c"
[1] "2 a"
[1] "2 b"
[1] "2 c"
[1] "3 a"
[1] "3 b"
[1] "3 c"
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
[1] 0.3067685
[1] 0.4269077
[1] 0.6931021
[1] 0.08513597
ifelse
R also has a vectorized version of if: the ifelse
function. Rather than control program flow, ifelse(test, yes, no)
returns the yes value for all TRUE cases of test, and no for all FALSE cases. For example:
x <-c(-3,1,-5,2)
ifelse(x < 0, -1, 1)
[1] -1 1 -1 1
Note, that although ifelse()
is readable and clear, it can be slower than the following:
x <- c(-3, 1, -5, 2)
y <- rep(1, length(x))
y[x < 0] <- -1
y
[1] -1 1 -1 1
Pre-allocating vectors
One point worth mentioning is that if you do need to loop over a structure with a
for
orwhile
loop, always preallocate your results vector. You can create empty numeric vectors withnumeric(len)
and empty integer vectors withinteger(len)
, wherelen
is the length. For example, if you were to loop over too vectors to sum their values pairwise, do not use:x <- rnorm(10) y <- rnorm(10) res <- NULL for (i in 1:length(x)) { res <- c(res, x[i] + y[i]) }
First, this is completely unecessary since we could calculate this with
res <- x + y
. Second, this code would be needlessly slow because appending to vectors withres <- c(res, ...)
is extremely computationally expensive in R.R’s vectors are allocated in memory to the exact length that the need to be. So each iteration of the above code requires (1) allocating a new
res
vector one element longer, and (2) copying all ofres
’s current elements over to this new vector.The correct way to do this (assuming in your real code something in the
for
loop can’t be parallelized!) is:x <- rnorm(10) y <- rnorm(10) res <- numeric(10) # preallocation! for (i in 1:length(x)) { res[i] <- x[i] + y[i] }
R, at its heart, is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions.
The most important thing to understand about functions in R
is that they are objects in their own right. You can work
with them exactly the same way you work with any other type
of object.
All R functions have three parts:
body()
, the code inside the function.formals()
, the list of arguments which controls how you can call the function.environment()
, the “map” of the location of the function’s variables.When you print a function in R, it shows you these three components. If the environment isn’t displayed, it means that the function was created in the global environment.
R makes writing functions very easy, both because you should be writing lots of them to organize your code and applying functions is such a common operation in R.
The general syntax for R functions is:
fun_name <- function(args) {
# body, containing R expressions
return(value)
}
In some cases you can use a simplified format:
fun_name <- function(args) #body
For example,
meanRemoveNA <- function(x) mean(x, na.rm=TRUE)
Functions that contain only one line in their body can omit
the braces (as the meanRemoveNA()
function does). Similarly,
using return()
to specify the return value is optional;
R’s functions will automatically return the last evaluated
expression in the body.
We could forgo creating a function named meanRemoveNA()
in
our global environment altogether and instead use an anonymous
function. Anonymous functions are useful when we only
need a function once for a specific task. For example, if we
wrote meanRemoveNA()
to use with lapply, we could write:
lapply(mydata, function(x) mean(x, na.rm=TRUE))
# lapply(mydata, mean, na.rm=TRUE) should also work!
The importance of preallocation
Now that we know how to write functions, we can look at the importance of preallocation. Let’s start by converting our previous code into functions:
sumNoPreallocation <- function(x,y) { res <- NULL for (i in 1:length(x)) { res <- c(res, x[i] + y[i]) } } sumWithPreallocation <- function(x,y) { res <- numeric(length(x)) # preallocation! for (i in 1:length(x)) { res[i] <- x[i] + y[i] } } # Let's test these functions using the microbenchmark package! x <- rnorm(1000) y <- rnorm(1000) library(microbenchmark) microbenchmark(sumNoPreallocation(x,y), sumWithPreallocation(x,y), x+y)
Unit: microseconds expr min lq mean median uq sumNoPreallocation(x, y) 2368.230 2643.744 3998.80823 3086.970 3555.788 sumWithPreallocation(x, y) 1135.581 1222.821 1390.45093 1287.870 1391.619 x + y 1.386 2.245 4.05743 4.441 5.200 max neval 20102.300 100 2180.758 100 14.239 100
Although we’ve used R interactively, in practice your analyses should be kept in scripts that can be run many times throughout development. Scripts can be organized into project directories and checked into Git repositories. There’s also a host of excellent R tools (e.g., Knitr and Rmarkdown) to help in creating well-documented, reproducible projects in R.
You can run R scripts from R using the function source()
. For example, to execute an R script named my_analysis.R
use:
source("my_analysis.R")
Alternatively, we can execute a script in batch mode from the command line with:
$ Rscript --vanilla my_analysis.R
This comes in handy when you need to rerun an analysis on different files or with different arguments
provided on the command line. It’s a good practice to use --vanilla
option because by default, Rscript
will restore any past saved environments and save its current environment after the execution completes.
Usually we don’t want R to restore any past state from previous runs, as this can lead to irreproducible
results. Additionally, saved environments can make it a nightmare to debug a script. See R --help
for
more information.
Lastly, if you want to retrieve command-line arguments passed to your script, use R’s commandArgs()
with trailingOnly=TRUE
. For example, this simple R script just prints all arguments:
## args.R -- a simple script to show command line args
args <- commandArgs(TRUE)
print(args)
We run this with:
$ Rscript --vanilla args.R arg1 arg2 arg3
Working directory
It’s important to mind R’s working directory when writing scripts. Scripts should not use
setwd()
to set their working directory, as this is not portable to other systems (which won’t have the same directory structure). For the same reason, use relative paths like data/achievers.txt when loading in data, and not absolute paths like /Users/jlebowski/data/achievers.txt. Also, it’s a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory.
See pages 257-260 of the book if you need to load and combine multiple files
At some point during an analysis, you’ll need to export data from R. We can export dataframes to plain-text files
using the R function write.table()
. Unfortunately, write.table()
has some poorly chosen defaults that we usually
need to adjust. For example, if we wanted to write our dataframe mtfs to a tab-delimited file named hotspot_motifs.txt,
we would use:
write.table(mtfs, file="hotspot_motifs.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
write.table()
’s first two arguments are the dataframe (or matrix) to write to file and the file path to save it to.
By default, write.table()
quotes factor and character columns and includes rownames. Qe disable these with quote=FALSE
and row.names=FALSE
. We also can set the column separators to tabs with the sep
argument. A header can be included or excluded with the col.names
argument.
Given that many Unix tools work well with Gzipped files, it may be useful to write a compressed version of a
file directly from R. In addition to a string file path, write.table
’s file argument also handles open file
connections. So to write our dataframe to a compressed file, we can open a gzipped file connect with gzfile():
hs_gzf <- gzfile("hotspot_motifs.txt.gz")
write.table(mtfs, file=hs_gzf, quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
While plain-text formats are the preferable way to share tabular datasets, it’s not the best way to save
complex R data structures like lists or special R objects. In these cases, it’s usually preferable to save R
objects as R objects. Encoding and saving objects to disk in a way that allows them to be restored as the original
object is known as serialization. R’s functions for saving and loading R objects are save()
and load()
.
tmp <- list(vec=rnorm(4), df=data.frame(a=1:3, b=3:5))
save(tmp, file="example.Rdata")
rm(tmp) # remove the original 'tmp' list
load("example.Rdata") # this fully restores the 'tmp' list from file
str(tmp)
List of 2
$ vec: num [1:4] -0.639 0.173 1.229 -0.24
$ df :'data.frame': 3 obs. of 2 variables:
..$ a: int [1:3] 1 2 3
..$ b: int [1:3] 3 4 5
The function save()
has an analogous function called save.image()
that saves all objects in your workspace
rather than just the objects you specify. Combined with savehistory(), save.image() can be a quick way to store
your past work in R in a rush.
Key Points
Use
if
andelse
to make choices.Use
for
to repeat operations.