This lesson is being piloted (Beta version)

Developing Workflows with R Scripts

Overview

Teaching: 45 min
Exercises: 20 min
Questions
  • How can I make data-dependent choices in R?

  • How can I repeat operations in R?

Objectives
  • Write conditional statements with if() and else().

  • Write and understand for() loops.

Developing Workflows with R Scripts

Control Flow: if, for, and while

Often we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.

Three common control flow and looping statments are if, for, and while:

# if
if (x == some_value) {
            # do some stuff in here
} else if (x == other_value) {
            # elseif is optional
} else {
            #else is optional
}

#for
for (element in some_vector) {
            # iteration happens here
}

#while
while (something_is_true) {
            # do some stuff
}

You can break out of for and while loops with a break statement, and advance loops to the next iteration with next.

Examples:

# sample a random number from a Poisson distribution
# with a mean (lambda) of 8

x <- rpois(1, lambda=8)

if (x >= 10) {
  print("x is greater than or equal to 10")
}
x
[1] 8

Tip: pseudo-random numbers

In the above case, the function rpois() generates a random number following a Poisson distribution with a mean (i.e. lambda) of 8. We can use function set.seed() guarantees that all machines will generate the exact same ‘pseudo-random’ number (more about pseudo-random numbers).

for(i in 1:10){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for(i in 1:3){
  for(j in c('a', 'b', 'c')){
    print(paste(i,j))
  }
}
[1] "1 a"
[1] "1 b"
[1] "1 c"
[1] "2 a"
[1] "2 b"
[1] "2 c"
[1] "3 a"
[1] "3 b"
[1] "3 c"
z <- 1
while(z > 0.1){
  z <- runif(1)
  print(z)
}
[1] 0.3067685
[1] 0.4269077
[1] 0.6931021
[1] 0.08513597

ifelse

R also has a vectorized version of if: the ifelse function. Rather than control program flow, ifelse(test, yes, no) returns the yes value for all TRUE cases of test, and no for all FALSE cases. For example:

x <-c(-3,1,-5,2)
ifelse(x < 0, -1, 1)
[1] -1  1 -1  1

Note, that although ifelse() is readable and clear, it can be slower than the following:

x <- c(-3, 1, -5, 2)
y <- rep(1, length(x))
y[x < 0] <- -1
y
[1] -1  1 -1  1

Pre-allocating vectors

One point worth mentioning is that if you do need to loop over a structure with a for or while loop, always preallocate your results vector. You can create empty numeric vectors with numeric(len) and empty integer vectors with integer(len), where len is the length. For example, if you were to loop over two vectors to sum their values pairwise, do not use:

x <- rnorm(10)
y <- rnorm(10)
res <- NULL
for (i in 1:length(x)) {
    res <- c(res, x[i] + y[i])
}

First, this is completely unecessary since we could calculate this with res <- x + y. Second, this code would be needlessly slow because appending to vectors with res <- c(res, ...) is extremely computationally expensive in R.

R’s vectors are allocated in memory to the exact length that the need to be. So each iteration of the above code requires (1) allocating a new res vector one element longer, and (2) copying all of res’s current elements over to this new vector.

The correct way to do this (assuming in your real code something in the for loop can’t be parallelized!) is:

x <- rnorm(10)
y <- rnorm(10)
res <- numeric(10) # preallocation!
for (i in 1:length(x)) {
    res[i] <- x[i] + y[i]
}

Working with R Scripts

Although we’ve used R interactively, in practice your analyses should be kept in scripts that can be run many times throughout development. Scripts can be organized into project directories and checked into Git repositories. There’s also a host of excellent R tools (e.g., Knitr and Rmarkdown) to help in creating well-documented, reproducible projects in R.

You can run R scripts from R using the function source(). For example, to execute an R script named my_analysis.R use:

source("my_analysis.R")

Alternatively, we can execute a script in batch mode from the command line with:

Rscript --vanilla my_analysis.R

This comes in handy when you need to rerun an analysis on different files or with different arguments provided on the command line. It’s a good practice to use –vanilla option because by default, Rscript will restore any past saved environments and save its current environment after the execution completes. Usually we don’t want R to restore any past state from previous runs, as this can lead to irreproducible results. Additionally, saved environments can make it a nightmare to debug a script. See R --help for more information.

Lastly, if you want to retrieve command-line arguments passed to your script, use R’s commandArgs() with trailingOnly=TRUE. For example, this simple R script just prints all arguments:

echo "## args.R -- a simple script to show command line args
args <- commandArgs(TRUE) 
print(args)" > args.R

We run this with:

Rscript --vanilla args.R arg1 arg2 arg3
[1] "arg1" "arg2" "arg3"

Working directory

It’s important to mind R’s working directory when writing scripts. Scripts should not use setwd() to set their working directory, as this is not portable to other systems (which won’t have the same directory structure). For the same reason, use relative paths like ../data/achievers.txt when loading in data, and not absolute paths like /Users/jlebowski/data/achievers.txt. Also, it’s a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory.

Workflows for Loading and Combining Multiple Files

See pages 257-260 of the book if you need to load and combine multiple files

Exporting Data

At some point during an analysis, you’ll need to export data from R. We can export dataframes to plain-text files using the R function write.table(). Unfortunately, write.table() has some poorly chosen defaults that we usually need to adjust. For example, if we wanted to write our dataframe mtfs to a tab-delimited file named hotspot_motifs.txt, we would use:

write.table(mtfs, file="hotspot_motifs.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

write.table()’s first two arguments are the dataframe (or matrix) to write to file and the file path to save it to. By default, write.table() quotes factor and character columns and includes rownames. We disable these with quote=FALSE and row.names=FALSE. We also can set the column separators to tabs with the sep argument. A header can be included or excluded with the col.names argument.

Given that many Unix tools work well with Gzipped files, it may be useful to write a compressed version of a file directly from R. In addition to a string file path, write.table’s file argument also handles open file connections. So to write our dataframe to a compressed file, we can open a gzipped file connect with gzfile():

hs_gzf <- gzfile("hotspot_motifs.txt.gz")
write.table(mtfs, file=hs_gzf, quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

While plain-text formats are the preferable way to share tabular datasets, it’s not the best way to save complex R data structures like lists or special R objects. In these cases, it’s usually preferable to save R objects as R objects. Encoding and saving objects to disk in a way that allows them to be restored as the original object is known as serialization. R’s functions for saving and loading R objects are save() and load().

tmp <- list(vec=rnorm(4), df=data.frame(a=1:3, b=3:5))
save(tmp, file="example.Rdata")
rm(tmp) # remove the original 'tmp' list
load("example.Rdata") # this fully restores the 'tmp' list from file
str(tmp)
List of 2
 $ vec: num [1:4] -2.061 -0.6 -0.954 0.558
 $ df :'data.frame':	3 obs. of  2 variables:
  ..$ a: int [1:3] 1 2 3
  ..$ b: int [1:3] 3 4 5

The function save() has an analogous function called save.image() that saves all objects in your workspace rather than just the objects you specify. Combined with savehistory(), save.image() can be a quick way to store your past work in R in a rush.

Key Points

  • Use if and else to make choices.

  • Use for to repeat operations.