R Data Skills for Bioinformatics

Data Transformation

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • How to use dplyr to transform data in R?

  • How do I select columns

  • How can I filter rows

  • How can I join two data frames?

Objectives
  • Be able to explore a dataframe

  • Be able to add, rename, and remove columns.

  • Be able to filter rows

  • Be able to append two data frames

  • Be able to join two data frames

Data transformation

Introduction

Every data analysis you conduct will likely involve manipulating dataframes at some point. Quickly extracting, transforming, and summarizing information from dataframes is an essential R skill. In this part we’ll use Hadley Wickham’s dplyr package, which consolidates and simplifies many of the common operations we perform on dataframes. In addition, much of dplyr key functionality is written in C++ for speed.

if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
library(tidyr)

Dataset

Following chapter 8 of the Buffalo’s book, we will use the dataset Dataset_S1.txt from the paper “The Influence of Recombination on Human Genetic Diversity”. The dataset contains estimates of population genetics statistics such as nucleotide diversity (e.g., the columns Pi and Theta), recombination (column Recombination), and sequence divergence as estimated by percent identity between human and chimpanzee genomes (column Divergence). Other columns contain information about the sequencing depth (depth), and GC content (percent.GC) for 1kb windows in human chromosome 20. We’ll only work with a few columns in our examples; see the description of Dataset_S1.txt in the original paper for more detail.

Downloading the dataset

download.file("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt", destfile = "data/Dataset_S1.txt")
dvst <- read_csv("data/Dataset_S1.txt")
dvst <- read_csv("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt")

So, let’s read the file directly into a tbl_df or tibble:

Notice the parsing of columns in the file. Also notice that some of the columns names contain spaces and special characters. We’ll replace them soon to facilitate typing. Type dvst to look at the dataframe:

Because it’s common to work with dataframes with more rows and columns than fit in your screen, tibble wraps dataframes so that they don’t fill your screen when you print them (similar to using head()). Notice that the tibble only shows the first few rows and all the columns that fit on one screen. Run View(dvst), which will open the dataset in the RStudio viewer, to see the whole dataframe. Also notice the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:

Abbreviation Variable type
int integers
dbl doubles (real numbers)
fctr factors (categorical)
Variables not in the dataset
chr character vectors (strings)
dttm date-time
lgl logical
date dates

Although the standard R operations work with tibbles (see below), we’ll learn how to use dplyr functions instead

colnames(dvst)[12] <- "percent.GC" #rename X.GC columns
dvst$GC.binned <- cut(dvst$percent.GC, 5) # create a new column from existing data

dplyr basics

There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:

Function Action
filter() Pick observations by their values
arrange() Reorder the rows
select() Pick variables by their names
mutate() Create new variables with functions of existing variables
summarise() Collapse many values down to a single summary

None of these functions perform tasks you can’t accomplish with R’s base functions. But dplyr’s advantage is in the added consistency, speed, and versatility of its data manipulation interface.

dplyr functions can be used in conjunction with group_by(), which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

  1. The first argument is a data frame (actually, a tibble).

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

===

Use filter() to filter rows

Let’s say we used summary to get an idea about the total number of SNPs in different windows:

summary(dvst$`total SNPs`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   7.000   8.906  12.000  93.000 
# or with dplyr
summary(select(dvst,`total SNPs`)) #this does not look simpler, but wait...
   total SNPs    
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 7.000  
 Mean   : 8.906  
 3rd Qu.:12.000  
 Max.   :93.000  

We see that this number varies considerably across all windows on chromosome 20 and the data are right-skewed: the third quartile is 12 SNPs, but the maximum is 93 SNPs. Often we want to investigate such outliers more closely. We can use filter() to extract the rows of our dataframe based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

filter(dvst,`total SNPs` >= 85)
# A tibble: 3 x 17
   start    end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int>  <int>        <int>         <int> <dbl>         <int>  <int>
1 2.62e6 2.62e6           93         11337  11.3            13     10
2 1.30e7 1.30e7           88         11784  11.8            11      1
3 4.74e7 4.74e7           87         12505  12.5             9      7
# … with 10 more variables: `reference Bases` <int>, Theta <dbl>,
#   Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>, Recombination <dbl>,
#   Divergence <dbl>, Constraint <int>, SNPs <int>, GC.binned <fct>

We can build more elaborate queries by using multiple filters. For example, suppose we wanted to see all windows where Pi (nucleotide diversity) is greater than 16 and percent GC is greater than 80. We’d use:

filter(dvst, Pi > 16, percent.GC > 80)
# A tibble: 3 x 17
   start    end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int>  <int>        <int>         <int> <dbl>         <int>  <int>
1 6.31e7 6.31e7            5           947  2.39             2      1
2 6.32e7 6.32e7            2          1623  3.21             2      0
3 6.32e7 6.32e7            5          1395  1.89             3      2
# … with 10 more variables: `reference Bases` <int>, Theta <dbl>,
#   Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>, Recombination <dbl>,
#   Divergence <dbl>, Constraint <int>, SNPs <int>, GC.binned <fct>

When you run that line of code, dplyr executes the filtering operation and returns a new data frame. Note, dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-:

new_df <- filter(dvst, Pi > 16, percent.GC > 80)

Miscellaneous Tips

R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:

(new_df <- filter(dvst, Pi > 16, percent.GC > 80))
# A tibble: 3 x 17
   start    end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int>  <int>        <int>         <int> <dbl>         <int>  <int>
1 6.31e7 6.31e7            5           947  2.39             2      1
2 6.32e7 6.32e7            2          1623  3.21             2      0
3 6.32e7 6.32e7            5          1395  1.89             3      2
# … with 10 more variables: `reference Bases` <int>, Theta <dbl>,
#   Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>, Recombination <dbl>,
#   Divergence <dbl>, Constraint <int>, SNPs <int>, GC.binned <fct>

Comparisons

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal). Note, that because == does not work well with real numbers, use dplyr near function instead!

Logical operators

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”. The figure below shows the complete set of Boolean operations.

Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.

If you are looking for multiple values within a certain column, use the x %in% y shortcut. This will select every row where x is one of the values in y:

filter(dvst, total.SNPs %in% c(0,1,2))

Finally, whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. We’ll learn how to create new variables shortly.

## Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

 NA > 5
 [1] NA
 10 == NA
 [1] NA
 NA + 10
 [1] NA
 NA / 2
 [1] NA
 NA == NA
 [1] NA

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

filter(dvst, is.na(Divergence))

Your turn!

  1. Find all windows …
    • with the lowest/highest GC content
    • that have 0 total SNPs
    • that are incomplete (<1000 bp)
    • that have the highest divergence with the chimp
    • that have less than 5% GC difference from the mean
  2. One useful dplyr filtering helper is between().
    • Can you figure out what does it do?
    • Use it to simplify the code needed to answer the previous challenges.

===

Use mutate() to add new variables

It’s often useful to add new columns that are functions of existing columns (notice, how we used dvst$GC.binned <- cut(dvst$percent.GC, 5) above). That’s the job of mutate().

mutate() always adds new columns at the end of your dataset. Remember that when you’re in RStudio, the easiest way to see all the columns is with View().

Here we’ll add an additional column that indicates whether a window is in the centromere region (nucleotides 25,800,000 to 29,700,000, based on Giemsa banding; see README for details).

We’ll call it cent and make it logical (with TRUE/FALSE values):

mutate(dvst, cent = start >= 25800000 & end <= 29700000)
# A tibble: 59,140 x 18
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int> <int>        <int>         <int> <dbl>         <int>  <int>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 11 more variables: `reference Bases` <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>, cent <lgl>
dvst
# A tibble: 59,140 x 17
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int> <int>        <int>         <int> <dbl>         <int>  <int>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 10 more variables: `reference Bases` <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>

Challenge 1

Solution

As discussed above, dplyr functions never modify their inputs

Solution

Use filter()

filter(d_df,cent==TRUE)
Error in filter(d_df, cent == TRUE): object 'd_df' not found

or sum()

sum(d_df$cent)
Error in eval(expr, envir, enclos): object 'd_df' not found

In the dataset we are using, the diversity estimate Pi is measured per sampling window (1kb) and scaled up by 10x (see supplementary Text S1 for more details). It would be useful to have this scaled as per basepair nucleotide diversity (so as to make the scale more intuitive).

mutate(dvst, diversity = Pi / (10*1000)) # rescale, removing 10x and making per bp
# A tibble: 59,140 x 18
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <int> <int>        <int>         <int> <dbl>         <int>  <int>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 11 more variables: `reference Bases` <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>, diversity <dbl>

Notice, that we can combine the two operations in a single command (and also save the result):

dvst <- mutate(dvst, 
       diversity = Pi / (10*1000), 
       cent = start >= 25800000 & end <= 29700000
)

If you only want to keep the new variables, use transmute():

transmute(dvst,
  diversity = Pi / (10*1000), 
  cent = start >= 25800000 & end <= 29700000
)
# A tibble: 59,140 x 2
   diversity cent 
       <dbl> <lgl>
 1  0        FALSE
 2  0.00104  FALSE
 3  0.000199 FALSE
 4  0.000956 FALSE
 5  0.000851 FALSE
 6  0.000912 FALSE
 7  0.000806 FALSE
 8  0.000206 FALSE
 9  0.000188 FALSE
10  0.000541 FALSE
# … with 59,130 more rows

Useful creation functions

There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

===

Use arrange() to arrange rows

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(dvst, cent, percent.GC)
# A tibble: 59,140 x 19
    start    end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
    <int>  <int>        <int>         <int> <dbl>         <int>  <int>
 1 4.34e7 4.34e7            5           690  2.66             5      0
 2 1.49e7 1.49e7           19          3620  4.57             7      1
 3 3.78e7 3.78e7            3           615  1.12             3      0
 4 5.49e7 5.49e7            1          2298  3.69             1      0
 5 5.49e7 5.49e7           15          3823  3.87            11      1
 6 4.18e7 4.18e7           15          6950  6.95            10      4
 7 5.05e7 5.05e7           10          7293  7.29             2      2
 8 3.96e7 3.96e7           13          8254  8.25             5      3
 9 5.50e7 5.50e7            7          7028  7.03             7      0
10 2.21e7 2.21e7            4           281  1.2              4      0
# … with 59,130 more rows, and 12 more variables: `reference Bases` <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>, diversity <dbl>, cent <lgl>

Use desc() to re-order by a column in descending order:

arrange(dvst, desc(cent), percent.GC)
# A tibble: 59,140 x 19
    start    end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
    <int>  <int>        <int>         <int> <dbl>         <int>  <int>
 1 2.59e7 2.59e7           10          7292  7.29             6      2
 2 2.93e7 2.93e7            2          6192  6.19             2      0
 3 2.94e7 2.94e7            5          7422  7.42             3      2
 4 2.62e7 2.62e7            1          7816  7.82             1      0
 5 2.62e7 2.62e7            2          6630  6.63             1      1
 6 2.59e7 2.59e7           30          7127  7.13            15      8
 7 2.62e7 2.62e7            0          9449  9.45             0      0
 8 2.59e7 2.59e7           21          8197  8.2             13      6
 9 2.95e7 2.95e7           15          8135  8.14            13      2
10 2.59e7 2.59e7            9          9126  9.13             5      2
# … with 59,130 more rows, and 12 more variables: `reference Bases` <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>, diversity <dbl>, cent <lgl>

Note, that missing values are always sorted at the end!

Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

===

Use select() to select columns and rename() to rename them

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select() is not terribly useful with our data because they only have a few variables, but you can still get the general idea:

# Select columns by name
select(dvst, start, end, Divergence)
# A tibble: 59,140 x 3
   start   end Divergence
   <int> <int>      <dbl>
 1 55001 56000    0.00301
 2 56001 57000    0.0180 
 3 57001 58000    0.00701
 4 58001 59000    0.0120 
 5 59001 60000    0.0240 
 6 60001 61000    0.0160 
 7 61001 62000    0.0120 
 8 62001 63000    0.0150 
 9 63001 64000    0.00901
10 64001 65000    0.00701
# … with 59,130 more rows
# Select all columns between depth and Pi (inclusive)
select(dvst, depth:Pi)
# A tibble: 59,140 x 6
   depth `unique SNPs` dhSNPs `reference Bases` Theta    Pi
   <dbl>         <int>  <int>             <int> <dbl> <dbl>
 1  3.41             0      0               556  0     0   
 2  6.68             2      2              1000  8.01 10.4 
 3  9.06             1      0              1000  3.51  1.99
 4 10.3              3      2              1000  9.93  9.56
 5  8.06             4      0              1000 12.9   8.51
 6  7.05             2      1              1000  7.82  9.12
 7  6.95             2      1              1000  8.60  8.06
 8  8.83             1      0              1000  4.01  2.06
 9  9.63             1      0              1000  3.37  1.88
10  8                1      1              1000  4.16  5.41
# … with 59,130 more rows
# Select all columns except those from start to percent.GC (inclusive)
select(dvst, -(start:percent.GC))
# A tibble: 59,140 x 7
   Recombination Divergence Constraint  SNPs GC.binned   diversity cent 
           <dbl>      <dbl>      <int> <int> <fct>           <dbl> <lgl>
 1       0.00960    0.00301          0     0 (51.6,68.5]  0        FALSE
 2       0.00960    0.0180           0     0 (34.7,51.6]  0.00104  FALSE
 3       0.00960    0.00701          0     0 (34.7,51.6]  0.000199 FALSE
 4       0.00960    0.0120           0     0 (34.7,51.6]  0.000956 FALSE
 5       0.00960    0.0240           0     0 (34.7,51.6]  0.000851 FALSE
 6       0.00960    0.0160           0     0 (34.7,51.6]  0.000912 FALSE
 7       0.00960    0.0120           0     0 (34.7,51.6]  0.000806 FALSE
 8       0.00960    0.0150           0     0 (34.7,51.6]  0.000206 FALSE
 9       0.00960    0.00901         58     1 (34.7,51.6]  0.000188 FALSE
10       0.00958    0.00701          0     1 (17.7,34.7]  0.000541 FALSE
# … with 59,130 more rows

There are a number of helper functions you can use within select():

See ?select for more details.

select() can also be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

dvst <- rename(dvst, total.SNPs = `total SNPs`,
       total.Bases = `total Bases`,
       unique.SNPs = `unique SNPs`,
       reference.Bases = `reference Bases`) #renaming all the columns with spaces!
colnames(dvst)
 [1] "start"           "end"             "total.SNPs"     
 [4] "total.Bases"     "depth"           "unique.SNPs"    
 [7] "dhSNPs"          "reference.Bases" "Theta"          
[10] "Pi"              "Heterozygosity"  "percent.GC"     
[13] "Recombination"   "Divergence"      "Constraint"     
[16] "SNPs"            "GC.binned"       "diversity"      
[19] "cent"           

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(dvst, cent, everything())
# A tibble: 59,140 x 19
   cent  start   end total.SNPs total.Bases depth unique.SNPs dhSNPs
   <lgl> <int> <int>      <int>       <int> <dbl>       <int>  <int>
 1 FALSE 55001 56000          0        1894  3.41           0      0
 2 FALSE 56001 57000          5        6683  6.68           2      2
 3 FALSE 57001 58000          1        9063  9.06           1      0
 4 FALSE 58001 59000          7       10256 10.3            3      2
 5 FALSE 59001 60000          4        8057  8.06           4      0
 6 FALSE 60001 61000          6        7051  7.05           2      1
 7 FALSE 61001 62000          7        6950  6.95           2      1
 8 FALSE 62001 63000          1        8834  8.83           1      0
 9 FALSE 63001 64000          1        9629  9.63           1      0
10 FALSE 64001 65000          3        7999  8              1      1
# … with 59,130 more rows, and 11 more variables: reference.Bases <int>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <int>, SNPs <int>,
#   GC.binned <fct>, diversity <dbl>

===

Use summarise() to make summaries

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(dvst, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs, na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 1 x 3
     GC averageSNPs allSNPs
  <dbl>       <dbl>   <int>
1  44.1        8.91  526693

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by position related to centromere:

by_cent <- group_by(dvst, cent)
summarise(by_cent, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs, na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 2 x 4
  cent     GC averageSNPs allSNPs
  <lgl> <dbl>       <dbl>   <int>
1 FALSE  44.2        8.87  518743
2 TRUE   39.7       11.6     7950

Subsetting columns can be a useful way to summarize data across two different conditions. For example, we might be curious if the average depth in a window (the depth column) differs between very high GC content windows (greater than 80%) and all other windows:

by_GC <- group_by(dvst, percent.GC >= 80)
summarize(by_GC, depth=mean(depth))
# A tibble: 2 x 2
  `percent.GC >= 80` depth
  <lgl>              <dbl>
1 FALSE               8.18
2 TRUE                2.24

This is a fairly large difference, but it’s important to consider how many windows this includes. Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data:

summarize(by_GC, mean_depth=mean(depth), n_rows=n())
# A tibble: 2 x 3
  `percent.GC >= 80` mean_depth n_rows
  <lgl>                   <dbl>  <int>
1 FALSE                    8.18  59131
2 TRUE                     2.24      9

Challenge 6

As another example, consider looking at Pi by windows that fall in the centromere and those that do not. Does the centromer have higher nucleotide diversity than other regions in these data?

Solutions to challenge 6

Because cent is a logical vector, we can group by it directly:

by_cent <- group_by(dvst, cent)
summarize(by_cent, nt_diversity=mean(diversity), min=which.min(diversity), n_rows=n())
# A tibble: 2 x 4
  cent  nt_diversity   min n_rows
  <lgl>        <dbl> <int>  <int>
1 FALSE      0.00123     1  58455
2 TRUE       0.00204     1    685

Indeed, the centromere does appear to have higher nucleotide diversity than other regions in this data.

Useful summary functions

Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations with the pipe

It looks like we have been creating and naming several intermediate files in our anlysis, even though we didn’t care about them. Naming things is hard, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%:

dvst %>%
  rename(GC.percent = percent.GC) %>%
  group_by(GC.percent >= 80) %>%
  summarize(mean_depth=mean(depth, na.rm = TRUE), n_rows=n())
# A tibble: 2 x 3
  `GC.percent >= 80` mean_depth n_rows
  <lgl>                   <dbl>  <int>
1 FALSE                    8.18  59131
2 TRUE                     2.24      9

This focuses on the transformations, not what’s being transformed. You can read it as a series of imperative statements: group, then summarise, then filter, where %>% stands for “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. However, the next iteration of ggplot2, ggvis, will use the pipe.

RStudio Tip

A shortcut for %>% is available in the newest RStudio releases under the keybinding CTRL + SHIFT + M (or CMD + SHIFT + M for OSX).
You can also review the set of available keybindings when within RStudio with ALT + SHIFT + K. ```

Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it? We may get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation: We could have also removed all the rows that have uknown position value prior to the analysis with the filter(!is.na()) command

Finally, one of the best features of dplyr is that all of these same methods also work with database connections. For example, you can manipulate a SQLite database with all of the same verbs we’ve used here.

Extra reading: tidy data

The collection tidyverse that we’ve been using is a universe of operation on tidy data. But what is tidy data?

Tidy data is a standard way of storing data where:

If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette(“tidy-data”).

tidyr package within tidyverse helps you create tidy data

tidyr functions fall into five main categories:

Pivot Longer

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_longer() is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.

Here is an example of an untidy dataset:

relig_income
# A tibble: 18 x 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Agnostic      27        34        60        81        76       137
 2 Atheist       12        27        37        52        35        70
 3 Buddhist      27        21        30        34        33        58
 4 Catholic     418       617       732       670       638      1116
 5 Don’t k…      15        14        15        11        10        35
 6 Evangel…     575       869      1064       982       881      1486
 7 Hindu          1         9         7         9        11        34
 8 Histori…     228       244       236       238       197       223
 9 Jehovah…      20        27        24        24        21        30
10 Jewish        19        19        25        25        30        95
11 Mainlin…     289       495       619       655       651      1107
12 Mormon        29        40        48        51        56       112
13 Muslim         6         7         9        10         9        23
14 Orthodox      13        17        23        32        32        47
15 Other C…       9         7        11        13        13        14
16 Other F…      20        33        40        46        49        63
17 Other W…       5         2         3         4         2         7
18 Unaffil…     217       299       374       365       341       528
# … with 4 more variables: `$75-100k` <dbl>, `$100-150k` <dbl>,
#   `>150k` <dbl>, `Don't know/refused` <dbl>

This dataset contains three variables: religion, stored in the rows, income spread across the column names, and count stored in the cell values.

To tidy it we use pivot_longer():

relig_income %>% pivot_longer(-religion, names_to = "income", values_to = "count")
# A tibble: 180 x 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# … with 170 more rows

See https://tidyr.tidyverse.org/articles/pivot.html for more info.

Key Points