Getting Started with R and RStudio

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How to find your way around RStudio?

How to start and organize a new project from RStudio

How to put the new project under version control and integrate with GitHub?

Objectives

To gain familiarity with the various panes in the RStudio IDE

To gain familiarity with the buttons, short cuts and options in the RStudio IDE

To understand variables and how to assign to them

To be able to manage your workspace in an interactive R session

To be able to create self-contained projects in RStudio

To be able to use git from within RStudio

Motivation

Science is a multi-step process: once you’ve designed an experiment and collected data, the real fun begins (but remember R.A.Fisher)! Here we will explore R as a tool to organize raw data, perform exploratory analyses, and learn how to plot results graphically. But to start, we will review how to interact with R and use the RStudio.

Before Starting This Lesson

Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages may not install correctly (or at all) if R is not up to date.

Download and install the latest version of R here

Download and install RStudio here

Introduction to RStudio

Throughout this lesson, we’ll be using RStudio: a free, open source R integrated development environment. It provides a built in editor, works on all platforms (including on servers) and supports many features useful for working in R: syntax highlighting, quick access to R’s help system, plots visible alongside code, and integration with version control.

How to use R?

Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the unix lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.

Work flow within RStudio

There are three main ways one can work within RStudio.

Test and play within the interactive R console then copy code into a .R file to run later.
- This works well when doing small tests and initially starting off.
- You can use the history panel to copy all the commands to the source file.
Start writing in an .R file and use RStudio’s command / short cut to push current line, selected lines or modified lines to the interactive R console.
- This works well if you are more advanced in writing R code;
- All your code is saved for later
- You will be able to run the file you create from within RStudio or using R’s source() function.
Better yet, you can write your code in R markdown file. An R Markdown file allows you to incorporate your code in a narration that a reader needs to understand it, run it, and display your results. You can use multiple languages including R, Python, and SQL.

Tip: Creating an R Markdown document

You can create a new R Markdown document in RStudio with the menu command File -> New File -> RMarkdown_. A new windows opens and asks you to enter title, author, and default output format for your R Markdown document. After you enter this information, a new R Markdown file opens in a new tab.
Notice that the file contains three types of content:

An (optional) YAML header surrounded by —s;

R code chunks surrounded by ```s;

text mixed with simple text formatting.

You can use the “Knit” button in the RStudio IDE to render the file and preview the output with a single click or keyboard shortcut (Shift + Cmd + K). Note that you can quickly insert chunks of code into your file with the keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I) or the Add Chunk command in the editor toolbar. You can run the code in a chunk by pressing the green triangle on the right. To run the current line, you can:

click on the Run button above the editor panel, or

select “Run Lines” from the “Code” menu, or

hit Ctrl-Enter in Windows or Linux or Command-Enter on OS X.

Basic layout

When you first open RStudio, you will be greeted by three panels:

The interactive R console (entire left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)

RStudio layout

Once you open files, such as R scripts, an editor panel will also open in the top left.

RStudio layout with .R file open

Tip: Do not save your working directory

When you exit R/RStudio, you probably get a prompt about saving your workspace image - don’t do this! In fact, it is recommended that you turn this feature off so you are not tempted: Go to Tools > Global Options and click “Never” in the dropdown next to “Save workspace to .RData on exit”

Creating a new project

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.

Challenge: Creating a self-contained project

We’re going to create a new project in RStudio:

Click the “File” menu button, then “New Project”.

Click “New Directory”.

Click “Empty Project”.

Type in the name of the directory to store your project, e.g. “EEOB546_R_lesson”.

Make sure that the checkbox for “Create a git repository” is selected.

Click the “Create Project” button.

Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.

Best practices for project organization

There are several general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning

In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. It can be useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: you should be able to regenerate it from your scripts.

Tip: Good Enough Practices for Scientific Computing

Good Enough Practices for Scientific Computing gives the following recommendations for project organization:

Put each project in its own directory, which is named after the project.

Put text documents associated with the project in the doc directory.

Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.

Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.

Name all files to reflect their content or function.

Separate function definition and application

When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these into separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.

Tip: avoiding duplication

You may find yourself using data or analysis scripts across several projects. Typically you want to avoid duplication to save space and avoid having to make updates to code in multiple places.

For this, you can use “symbolic links”, which are essentially shortcuts to files somewhere else on a filesystem. On Linux and OS X you can use the ln -s command, and on Windows you can either create a shortcut or use the mklink command from the windows terminal.

Challenge 1

Create a new readme file in the project directory you created and and some information about the project.

Version Control

We also set up our project to integrate with git, putting it under version control. RStudio has a nicer interface to git than shell, but is somewhat limited in what it can do, so you will find yourself occasionally needing to use the shell. Let’s go through and make an initial commit of our template files.

The workspace/history pane has a tab for “Git”. We can stage each file by checking the box: you will see a green “A” next to stage files and folders, and yellow question marks next to files or folders git doesn’t know about yet. RStudio also nicely shows you the difference between files from different commits.

Tip: versioning disposable output

Generally you do not want to version disposable output (or read-only data). You should modify the .gitignore file to tell git to ignore these files and directories.

Challenge 2

Create a directory within your project called graphs.

Modify the .gitignore file to contain graphs/ so that this disposable output isn’t versioned.

Add the newly created folders to version control using the git interface.
Solution to Challenge 2

This can be done with the command line:
$ mkdir graphs
$ echo "graphs/" >> .gitignore

Now we want to push the contents of this commit to GitHub, so it is also backed-up off site and available to collaborators.

Challenge 3
In GitHub, create a New repository, called here BCB546-R-Exercise. Don’t initialize it with a README file because we’ll be importing an existing repository…

Make sure you have a proper public key in your GitHub settings
In RStudio, again click Tools -> Shell … . Enter:
$ echo "# BCB546-R-Exercise" >> README.md
$ git add README.md
$ git commit -m "first commit"
$ git remote add origin git@github.com/[path to your directory]
$ git push -u origin master
Change README.md file to indicate the changes you made

Commit and push it from the Git tab on the Environment/history panel of Rstudio.

Key Points

Use RStudio to create and manage projects with consistent layout.

Treat raw data as read-only.

Treat generated output as disposable.

Separate function definition and application.

Use version control.

lesson home

R Data Skills for Bioinformatics

next episode

Getting Started with R and RStudio

Overview

Motivation

Before Starting This Lesson

Introduction to RStudio

How to use R?

Work flow within RStudio

Tip: Creating an R Markdown document

Tip: Do not save your working directory

Creating a new project

Challenge: Creating a self-contained project

Best practices for project organization

Treat data as read only

Data Cleaning

Treat generated output as disposable

Tip: Good Enough Practices for Scientific Computing

Separate function definition and application

Tip: avoiding duplication

Challenge 1

Version Control

Tip: versioning disposable output

Challenge 2

Solution to Challenge 2

Challenge 3

Key Points

lesson home

next episode