R Data Skills for Bioinformatics

Welcome to the R part of the BCB546 course!

Our goal here is to teach you the basics of R and best practices of using R for data analysis. Although we are using the template of the Software Carpentry workshop R for Reproducible Scientific Analysis, most lessons are based on the Bioinformatics Data Skills book by Vince Buffalo and the Advanced R book by Hadley Wickham.

Why R?

IEEE Spectrum ranking

Brief Introduction to R

(from www.r-project.org)

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical and machine-learning methods, and is highly extensible. In addition, R is a data programming language that can be used to explore and understand data in an open-ended, interactive, and iterative way.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

Note that in this course we will focus on the subset of the R language that allows you to conduct exploratory data analysis (EDA). There are many resources on campus and elsewhere to explore the aother aspects of R.

Additional resources

Here are some additional resources that we found very useful:

Prerequisites

Understand that computers store data and instructions (programs, scripts etc.) in files. Files are organised in directories (folders). Know how to access files not in the working directory by specifying the path.

Schedule

00:00 Getting Started with R and RStudio How to find your way around RStudio?
How to start and organize a new project from RStudio
How to put the new project under version control and integrate with GitHub?
00:30 R Language Basics How to do simple calculations in R
How to assign values to variables and call functions?
What are R’s vectors, vector data types, and vectorization?
How to install packages?
How can I get help in R?
01:20 Data Structures How can I read data in R?
What are the basic data types in R?
How do I represent categorical information in R?
How can I work with subsets of data in R?
02:40 Exploring Data Frames How can I manipulate a data frame?
How can I join two data frames?
04:30 Split-Apply-Combine How can I do different calculations on different sets of data?
How can I manipulate dataframes without repeating myself?
05:20 Visualizing Data How can I explore my data by visualization in R?
06:40 Developing Workflows with R Scripts How can I make data-dependent choices in R?
How can I repeat operations in R?
How do I write functions in R?
07:45 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.