Developing Workflows with R Scripts
Overview
Teaching: 45 min
Exercises: 20 minQuestions
How can I make data-dependent choices in R?
How can I repeat operations in R?
Objectives
Write conditional statements with
if()
andelse()
.Write and understand
for()
loops.
Developing Workflows with R Scripts
Control Flow: if, for, and while
Often we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.
Three common control flow and looping statments are if
, for
, and while
:
# if
if (x == some_value) {
# do some stuff in here
} else if (x == other_value) {
# elseif is optional
} else {
#else is optional
}
#for
for (element in some_vector) {
# iteration happens here
}
#while
while (something_is_true) {
# do some stuff
}
You can break out of for and while loops with a break
statement, and advance loops to the next iteration with next
.
Examples:
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
[1] 8
Tip: pseudo-random numbers
In the above case, the function
rpois()
generates a random number following a Poisson distribution with a mean (i.e. lambda) of 8. We can use functionset.seed()
guarantees that all machines will generate the exact same ‘pseudo-random’ number (more about pseudo-random numbers).
for(i in 1:10){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for(i in 1:3){
for(j in c('a', 'b', 'c')){
print(paste(i,j))
}
}
[1] "1 a"
[1] "1 b"
[1] "1 c"
[1] "2 a"
[1] "2 b"
[1] "2 c"
[1] "3 a"
[1] "3 b"
[1] "3 c"
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
[1] 0.3067685
[1] 0.4269077
[1] 0.6931021
[1] 0.08513597
ifelse
R also has a vectorized version of if: the ifelse
function. Rather than control program flow, ifelse(test, yes, no)
returns the yes value for all TRUE cases of test, and no for all FALSE cases. For example:
x <-c(-3,1,-5,2)
ifelse(x < 0, -1, 1)
[1] -1 1 -1 1
Note, that although ifelse()
is readable and clear, it can be slower than the following:
x <- c(-3, 1, -5, 2)
y <- rep(1, length(x))
y[x < 0] <- -1
y
[1] -1 1 -1 1
Pre-allocating vectors
One point worth mentioning is that if you do need to loop over a structure with a
for
orwhile
loop, always preallocate your results vector. You can create empty numeric vectors withnumeric(len)
and empty integer vectors withinteger(len)
, wherelen
is the length. For example, if you were to loop over two vectors to sum their values pairwise, do not use:x <- rnorm(10) y <- rnorm(10) res <- NULL for (i in 1:length(x)) { res <- c(res, x[i] + y[i]) }
First, this is completely unecessary since we could calculate this with
res <- x + y
. Second, this code would be needlessly slow because appending to vectors withres <- c(res, ...)
is extremely computationally expensive in R.R’s vectors are allocated in memory to the exact length that the need to be. So each iteration of the above code requires (1) allocating a new
res
vector one element longer, and (2) copying all ofres
’s current elements over to this new vector.The correct way to do this (assuming in your real code something in the
for
loop can’t be parallelized!) is:x <- rnorm(10) y <- rnorm(10) res <- numeric(10) # preallocation! for (i in 1:length(x)) { res[i] <- x[i] + y[i] }
Working with R Scripts
Although we’ve used R interactively, in practice your analyses should be kept in scripts that can be run many times throughout development. Scripts can be organized into project directories and checked into Git repositories. There’s also a host of excellent R tools (e.g., Knitr and Rmarkdown) to help in creating well-documented, reproducible projects in R.
You can run R scripts from R using the function source()
. For example, to execute an R script named my_analysis.R
use:
source("my_analysis.R")
Alternatively, we can execute a script in batch mode from the command line with:
Rscript --vanilla my_analysis.R
This comes in handy when you need to rerun an analysis on different files or with different arguments
provided on the command line. It’s a good practice to use –vanilla option because by default, Rscript
will restore any past saved environments and save its current environment after the execution completes.
Usually we don’t want R to restore any past state from previous runs, as this can lead to irreproducible
results. Additionally, saved environments can make it a nightmare to debug a script. See R --help
for
more information.
Lastly, if you want to retrieve command-line arguments passed to your script, use R’s commandArgs() with trailingOnly=TRUE. For example, this simple R script just prints all arguments:
echo "## args.R -- a simple script to show command line args
args <- commandArgs(TRUE)
print(args)" > args.R
We run this with:
Rscript --vanilla args.R arg1 arg2 arg3
[1] "arg1" "arg2" "arg3"
Working directory
It’s important to mind R’s working directory when writing scripts. Scripts should not use
setwd()
to set their working directory, as this is not portable to other systems (which won’t have the same directory structure). For the same reason, use relative paths like ../data/achievers.txt when loading in data, and not absolute paths like /Users/jlebowski/data/achievers.txt. Also, it’s a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory.
Workflows for Loading and Combining Multiple Files
See pages 257-260 of the book if you need to load and combine multiple files
Exporting Data
At some point during an analysis, you’ll need to export data from R. We can export dataframes to plain-text files
using the R function write.table()
. Unfortunately, write.table()
has some poorly chosen defaults that we usually
need to adjust. For example, if we wanted to write our dataframe mtfs to a tab-delimited file named hotspot_motifs.txt,
we would use:
write.table(mtfs, file="hotspot_motifs.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
write.table()
’s first two arguments are the dataframe (or matrix) to write to file and the file path to save it to.
By default, write.table()
quotes factor and character columns and includes rownames. We disable these with quote=FALSE
and row.names=FALSE
. We also can set the column separators to tabs with the sep
argument. A header can be included or excluded with the col.names
argument.
Given that many Unix tools work well with Gzipped files, it may be useful to write a compressed version of a
file directly from R. In addition to a string file path, write.table
’s file argument also handles open file
connections. So to write our dataframe to a compressed file, we can open a gzipped file connect with gzfile():
hs_gzf <- gzfile("hotspot_motifs.txt.gz")
write.table(mtfs, file=hs_gzf, quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
While plain-text formats are the preferable way to share tabular datasets, it’s not the best way to save
complex R data structures like lists or special R objects. In these cases, it’s usually preferable to save R
objects as R objects. Encoding and saving objects to disk in a way that allows them to be restored as the original
object is known as serialization. R’s functions for saving and loading R objects are save()
and load()
.
tmp <- list(vec=rnorm(4), df=data.frame(a=1:3, b=3:5))
save(tmp, file="example.Rdata")
rm(tmp) # remove the original 'tmp' list
load("example.Rdata") # this fully restores the 'tmp' list from file
str(tmp)
List of 2
$ vec: num [1:4] -2.061 -0.6 -0.954 0.558
$ df :'data.frame': 3 obs. of 2 variables:
..$ a: int [1:3] 1 2 3
..$ b: int [1:3] 3 4 5
The function save()
has an analogous function called save.image()
that saves all objects in your workspace
rather than just the objects you specify. Combined with savehistory()
, save.image()
can be a quick way to store
your past work in R in a rush.
Key Points
Use
if
andelse
to make choices.Use
for
to repeat operations.