R Data Skills for Bioinformatics

Developing Workflows with R Scripts

Overview

Teaching: 45 min
Exercises: 20 min
Questions
  • How can I make data-dependent choices in R?

  • How can I repeat operations in R?

  • How do I write functions in R?

Objectives
  • Write conditional statements with if() and else().

  • Write and understand for() loops.

  • Be able to write functions and test their speed

Developing Workflows with R Scripts

Control Flow: if, for, and while

Before we start

If you programmed before in other languages, you’ll be happy to finally see a section on control flow. However, before you start writing loops in R, I encourage you to look back at some of the functions we used already, particularly split and lapply. These functions can replace many of the loops with a cleaner and faster code.

Often we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.

Three common control flow and looping statments are if, for, and while:

# if
if (x == some_value) {
            # do some stuff in here
} else if (x == other_value) {
            # elseif is optional
} else {
            #else is optional
}

#for
for (element in some_vector) {
            # iteration happens here
}

#while
while (something_is_true) {
            # do some stuff
}

You can break out of for and while loops with a break statement, and advance loops to the next iteration with next.

Examples:

# sample a random number from a Poisson distribution
# with a mean (lambda) of 8

x <- rpois(1, lambda=8)

if (x >= 10) {
  print("x is greater than or equal to 10")
}
x
[1] 8

Tip: pseudo-random numbers

In the above case, the function rpois() generates a random number following a Poisson distribution with a mean (i.e. lambda) of 8. We can use function set.seed() guarantees that all machines will generate the exact same ‘pseudo-random’ number (more about pseudo-random numbers).

for(i in 1:10){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for(i in 1:3){
  for(j in c('a', 'b', 'c')){
    print(paste(i,j))
  }
}
[1] "1 a"
[1] "1 b"
[1] "1 c"
[1] "2 a"
[1] "2 b"
[1] "2 c"
[1] "3 a"
[1] "3 b"
[1] "3 c"
z <- 1
while(z > 0.1){
  z <- runif(1)
  print(z)
}
[1] 0.3067685
[1] 0.4269077
[1] 0.6931021
[1] 0.08513597

ifelse

R also has a vectorized version of if: the ifelse function. Rather than control program flow, ifelse(test, yes, no) returns the yes value for all TRUE cases of test, and no for all FALSE cases. For example:

x <-c(-3,1,-5,2)
ifelse(x < 0, -1, 1)
[1] -1  1 -1  1

Note, that although ifelse() is readable and clear, it can be slower than the following:

x <- c(-3, 1, -5, 2)
y <- rep(1, length(x))
y[x < 0] <- -1
y
[1] -1  1 -1  1

Pre-allocating vectors

One point worth mentioning is that if you do need to loop over a structure with a for or while loop, always preallocate your results vector. You can create empty numeric vectors with numeric(len) and empty integer vectors with integer(len), where len is the length. For example, if you were to loop over too vectors to sum their values pairwise, do not use:

x <- rnorm(10)
y <- rnorm(10)
res <- NULL
for (i in 1:length(x)) {
    res <- c(res, x[i] + y[i])
}

First, this is completely unecessary since we could calculate this with res <- x + y. Second, this code would be needlessly slow because appending to vectors with res <- c(res, ...) is extremely computationally expensive in R.

R’s vectors are allocated in memory to the exact length that the need to be. So each iteration of the above code requires (1) allocating a new res vector one element longer, and (2) copying all of res’s current elements over to this new vector.

The correct way to do this (assuming in your real code something in the for loop can’t be parallelized!) is:

x <- rnorm(10)
y <- rnorm(10)
res <- numeric(10) # preallocation!
for (i in 1:length(x)) {
    res[i] <- x[i] + y[i]
}

Writing functions

R, at its heart, is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions.

The most important thing to understand about functions in R
is that they are objects in their own right. You can work with them exactly the same way you work with any other type of object.

All R functions have three parts:

When you print a function in R, it shows you these three components. If the environment isn’t displayed, it means that the function was created in the global environment.

R makes writing functions very easy, both because you should be writing lots of them to organize your code and applying functions is such a common operation in R.

The general syntax for R functions is:

fun_name <- function(args) {
# body, containing R expressions 
return(value)
}

In some cases you can use a simplified format:

fun_name <- function(args) #body

For example,

meanRemoveNA <- function(x) mean(x, na.rm=TRUE)

Functions that contain only one line in their body can omit the braces (as the meanRemoveNA() function does). Similarly, using return() to specify the return value is optional; R’s functions will automatically return the last evaluated expression in the body.

We could forgo creating a function named meanRemoveNA() in our global environment altogether and instead use an anonymous function. Anonymous functions are useful when we only need a function once for a specific task. For example, if we wrote meanRemoveNA() to use with lapply, we could write:

lapply(mydata, function(x) mean(x, na.rm=TRUE))
# lapply(mydata, mean, na.rm=TRUE) should also work!

The importance of preallocation

Now that we know how to write functions, we can look at the importance of preallocation. Let’s start by converting our previous code into functions:

sumNoPreallocation <- function(x,y) {
  res <- NULL
  for (i in 1:length(x)) {
      res <- c(res, x[i] + y[i])
  }
}
sumWithPreallocation <- function(x,y) {
  res <- numeric(length(x)) # preallocation!
  for (i in 1:length(x)) {
      res[i] <- x[i] + y[i]
  }
}
# Let's test these functions using the microbenchmark package!
x <- rnorm(1000)
y <- rnorm(1000)
library(microbenchmark)
microbenchmark(sumNoPreallocation(x,y), sumWithPreallocation(x,y), x+y)
Unit: microseconds
                       expr      min       lq       mean   median       uq
   sumNoPreallocation(x, y) 2368.230 2643.744 3998.80823 3086.970 3555.788
 sumWithPreallocation(x, y) 1135.581 1222.821 1390.45093 1287.870 1391.619
                      x + y    1.386    2.245    4.05743    4.441    5.200
       max neval
 20102.300   100
  2180.758   100
    14.239   100

Working with R Scripts

Although we’ve used R interactively, in practice your analyses should be kept in scripts that can be run many times throughout development. Scripts can be organized into project directories and checked into Git repositories. There’s also a host of excellent R tools (e.g., Knitr and Rmarkdown) to help in creating well-documented, reproducible projects in R.

You can run R scripts from R using the function source(). For example, to execute an R script named my_analysis.R use:

source("my_analysis.R")

Alternatively, we can execute a script in batch mode from the command line with:

$ Rscript --vanilla my_analysis.R

This comes in handy when you need to rerun an analysis on different files or with different arguments provided on the command line. It’s a good practice to use --vanilla option because by default, Rscript will restore any past saved environments and save its current environment after the execution completes. Usually we don’t want R to restore any past state from previous runs, as this can lead to irreproducible results. Additionally, saved environments can make it a nightmare to debug a script. See R --help for more information.

Lastly, if you want to retrieve command-line arguments passed to your script, use R’s commandArgs() with trailingOnly=TRUE. For example, this simple R script just prints all arguments:

## args.R -- a simple script to show command line args
args <- commandArgs(TRUE) 
print(args)

We run this with:

$ Rscript --vanilla args.R arg1 arg2 arg3

Working directory

It’s important to mind R’s working directory when writing scripts. Scripts should not use setwd() to set their working directory, as this is not portable to other systems (which won’t have the same directory structure). For the same reason, use relative paths like data/achievers.txt when loading in data, and not absolute paths like /Users/jlebowski/data/achievers.txt. Also, it’s a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory.

Workflows for Loading and Combining Multiple Files

See pages 257-260 of the book if you need to load and combine multiple files

Exporting Data

At some point during an analysis, you’ll need to export data from R. We can export dataframes to plain-text files using the R function write.table(). Unfortunately, write.table() has some poorly chosen defaults that we usually need to adjust. For example, if we wanted to write our dataframe mtfs to a tab-delimited file named hotspot_motifs.txt, we would use:

write.table(mtfs, file="hotspot_motifs.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

write.table()’s first two arguments are the dataframe (or matrix) to write to file and the file path to save it to. By default, write.table() quotes factor and character columns and includes rownames. Qe disable these with quote=FALSE and row.names=FALSE. We also can set the column separators to tabs with the sep argument. A header can be included or excluded with the col.names argument.

Given that many Unix tools work well with Gzipped files, it may be useful to write a compressed version of a file directly from R. In addition to a string file path, write.table’s file argument also handles open file connections. So to write our dataframe to a compressed file, we can open a gzipped file connect with gzfile():

hs_gzf <- gzfile("hotspot_motifs.txt.gz")
write.table(mtfs, file=hs_gzf, quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

While plain-text formats are the preferable way to share tabular datasets, it’s not the best way to save complex R data structures like lists or special R objects. In these cases, it’s usually preferable to save R objects as R objects. Encoding and saving objects to disk in a way that allows them to be restored as the original object is known as serialization. R’s functions for saving and loading R objects are save() and load().

tmp <- list(vec=rnorm(4), df=data.frame(a=1:3, b=3:5))
save(tmp, file="example.Rdata")
rm(tmp) # remove the original 'tmp' list
load("example.Rdata") # this fully restores the 'tmp' list from file
str(tmp)
List of 2
 $ vec: num [1:4] -0.639 0.173 1.229 -0.24
 $ df :'data.frame':	3 obs. of  2 variables:
  ..$ a: int [1:3] 1 2 3
  ..$ b: int [1:3] 3 4 5

The function save() has an analogous function called save.image() that saves all objects in your workspace rather than just the objects you specify. Combined with savehistory(), save.image() can be a quick way to store your past work in R in a rush.

Key Points