This lesson is being piloted (Beta version)

R Data Skills for Bioinformatics

Welcome to the R part of the BCB546 course!

Our goal here is to teach you the basics of R and best practices of using R for data analysis. Although we are using the template of the Software Carpentry workshop R for Reproducible Scientific Analysis, most lessons are based on the Bioinformatics Data Skills book by Vince Buffalo, the Advanced R book by Hadley Wickham, and the R for Data Science book by Garrett Grolemund and Hadley Wickham.

What is R?

More technical aspects of R

As a programming language, R adopts syntax and grammar that differ from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy-on-modify’ and ‘call-by-value’ semantics; common paradigms in other languages, such as the ‘for’ loop, are encountered less commonly in R.

Why R?

IEEE Spectrum ranking

Brief Introduction to R

(from www.r-project.org)

R is a language and environment for statistical computing and graphics. It is similar to the S language and environment, which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different (GNU) implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical and machine-learning methods, and is highly extensible. In addition, R is a data programming language that can be used to explore and understand data in an open-ended, interactive, and iterative way.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

Note that in this course we will focus on the subset of the R language that allows you to conduct exploratory data analysis (EDA). There are many resources on campus and elsewhere to explore the other aspects of R.

Additional resources

Here are some additional resources that we found very useful:

Schedule

Setup Download files required for the lesson
00:00 1. Getting Started with R and RStudio How to find your way around RStudio?
How to start and organize a new project from RStudio
How to put the new project under version control and integrate with GitHub?
00:30 2. R Language Basics How to do simple calculations in R?
How to assign values to variables and call functions?
What are R’s vectors, vector data types, and vectorization?
How to install packages?
How can I get help in R?
What is an R Markdown file and how can I create one?
01:20 3. Data Structures What are the basic data types in R?
How do I represent categorical information in R?
02:40 4. Data Subsetting How can I work with subsets of data in R?
03:20 5. Data Transformation How to use dplyr to transform data in R?
How do I select columns
How can I filter rows
How can I join two data frames?
04:50 6. Visualizing Data How can I explore my data by visualization in R?
06:20 7. Writing and Applying Functions to Data What are apply functions in R?
How can I write a new function in R?
06:50 8. Developing Workflows with R Scripts How can I make data-dependent choices in R?
How can I repeat operations in R?
07:55 9. Data Import How to read data in R?
What are some potential problems with reading data in R?
How to write data in files?
08:25 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.