This lesson is being piloted (Beta version)

Data Transformation

Overview

Teaching: 60 min
Exercises: 30 min
Questions
  • How to use dplyr to transform data in R?

  • How do I select columns

  • How can I filter rows

  • How can I join two data frames?

Objectives
  • Be able to explore a dataframe

  • Be able to add, rename, and remove columns.

  • Be able to filter rows

  • Be able to append two data frames

  • Be able to join two data frames

Data transformation

Introduction

Every data analysis you conduct will likely involve manipulating dataframes at some point. Quickly extracting, transforming, and summarizing information from dataframes is an essential R skill. In this part we’ll use Hadley Wickham’s dplyr package, which consolidates and simplifies many of the common operations we perform on dataframes. In addition, much of dplyr key functionality is written in C++ for speed.

Preliminaries

Inatall (if needed) and/or load two libraries, tidyverse and tidyr:

Dataset

Following chapter 8 of the Buffalo’s book, we will use the dataset Dataset_S1.txt from the paper “The Influence of Recombination on Human Genetic Diversity”. The dataset contains estimates of population genetics statistics such as nucleotide diversity (e.g., the columns Pi and Theta), recombination (column Recombination), and sequence divergence as estimated by percent identity between human and chimpanzee genomes (column Divergence). Other columns contain information about the sequencing depth (depth), and GC content (percent.GC) for 1kb windows in human chromosome 20. We’ll only work with a few columns in our examples; see the description of Dataset_S1.txt in the original paper for more details.

Download the dataset

We can download the Dataset_S1.txt file into a local folder of your choice using the download.file function. Instead, we will read file directly from the Internet with the read_csv function:

dvst <- read_csv("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt")

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  start = col_double(),
  end = col_double(),
  `total SNPs` = col_double(),
  `total Bases` = col_double(),
  depth = col_double(),
  `unique SNPs` = col_double(),
  dhSNPs = col_double(),
  `reference Bases` = col_double(),
  Theta = col_double(),
  Pi = col_double(),
  Heterozygosity = col_double(),
  `%GC` = col_double(),
  Recombination = col_double(),
  Divergence = col_double(),
  Constraint = col_double(),
  SNPs = col_double()
)

Notice the parsing of columns in the file. Also notice that some of the columns names contain spaces and special characters. We’ll replace them soon to facilitate typing.

We can check the data structure creates with either the str(dvst) command or by typing dvst:

dvst
# A tibble: 59,140 x 16
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <dbl> <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 9 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>

It’s a tibble!. Notice that the tibble only shows the first few rows and all the columns that fit on one screen.

Run View(dvst), which will open the dataset in the RStudio viewer, to see the whole dataframe.

Also notice that there are some additional variable types available:

Abbreviation Variable type
int integers
dbl doubles (real numbers)
fctr factors (categorical)
Variables not in the dataset
chr character vectors (strings)
dttm date-time
lgl logical
date dates

dplyr basics

There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:

Key dplyr functions

Function Action
filter() Pick observations by their values
arrange() Reorder the rows
select() Pick variables by their names
mutate() Create new variables with functions of existing variables
summarise() Collapse many values down to a single summary

None of these functions perform tasks you can’t accomplish with R’s base functions. But dplyr’s advantage is in the added consistency, speed, and versatility of its data manipulation interface. Get this cheatsheet to keep as a reminder of these and several additional functions.

group_by function

dplyr functions can be used in conjunction with group_by(), which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

Overall command structure

All dplyr functions work similarly:

  1. The first argument is a data frame (actually, a tibble).

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result.

These six functions provide the verbs for a language of data manipulation. Let’s dive in and see how these verbs work.


Use filter() to filter rows

Let’s say we used summary to get an idea about the total number of SNPs in different windows:

summary(select(dvst,`total SNPs`)) # Here we also used `select` function. We'll talk about it soon.
   total SNPs    
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 7.000  
 Mean   : 8.906  
 3rd Qu.:12.000  
 Max.   :93.000  

The number varies considerably across all windows on chromosome 20 and the data are right-skewed: the third quartile is 12 SNPs, but the maximum is 93 SNPs. Often we want to investigate such outliers more closely.

We can use filter() to extract the rows of our dataframe based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

filter(dvst,`total SNPs` >= 85)
# A tibble: 3 x 16
     start      end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
     <dbl>    <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
1  2621001  2622000           93         11337  11.3            13     10
2 13023001 13024000           88         11784  11.8            11      1
3 47356001 47357000           87         12505  12.5             9      7
# … with 9 more variables: reference Bases <dbl>, Theta <dbl>, Pi <dbl>,
#   Heterozygosity <dbl>, %GC <dbl>, Recombination <dbl>, Divergence <dbl>,
#   Constraint <dbl>, SNPs <dbl>

Questions:

  • How many windows have >= 85 SNPs?
  • What is a SNP?

We can build more elaborate queries by using multiple filters:

filter(dvst, Pi > 16, `%GC` > 80)
# A tibble: 3 x 16
     start      end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
     <dbl>    <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
1 63097001 63098000            5           947  2.39             2      1
2 63188001 63189000            2          1623  3.21             2      0
3 63189001 63190000            5          1395  1.89             3      2
# … with 9 more variables: reference Bases <dbl>, Theta <dbl>, Pi <dbl>,
#   Heterozygosity <dbl>, %GC <dbl>, Recombination <dbl>, Divergence <dbl>,
#   Constraint <dbl>, SNPs <dbl>

Note, dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-:

new_df <- filter(dvst, Pi > 16, `%GC` > 80)

Tip

R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:

(new_df <- filter(dvst, Pi > 16, `%GC` > 80))

Comparisons

Use comparison operators to filter effectively: >, >=, <, <=, != (not equal), and == (equal). Remember, that == does not work well with real numbers! Use the dplyr near function instead!

Logical operators

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”. The figure below shows the complete set of Boolean operations.

Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.

x %in% y operator

If you are looking for multiple values within a certain column, use the x %in% y operator. This will select every row where x is one of the values in y:

filter(dvst, `total SNPs` %in% c(0,1,2))
# A tibble: 12,446 x 16
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <dbl> <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 55001 56000            0          1894  3.41             0      0
 2 57001 58000            1          9063  9.06             1      0
 3 62001 63000            1          8834  8.83             1      0
 4 63001 64000            1          9629  9.63             1      0
 5 65001 66000            0          8376  8.38             0      0
 6 66001 67000            2          8431  8.43             2      0
 7 68001 69000            1          5381  5.38             1      0
 8 72001 73000            1          5276  5.28             1      0
 9 73001 74000            1          6648  6.65             1      0
10 80001 81000            1          6230  6.23             1      0
# … with 12,436 more rows, and 9 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>

Finally, whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. We’ll learn how to create new variables shortly.

Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

Examples:

 NA > 5
 [1] NA
 10 == NA
 [1] NA
 NA + 10
 [1] NA
 NA / 2
 [1] NA
 NA == NA
 [1] NA

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

filter(dvst, is.na(Divergence))

Use mutate() to add new variables

It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate(). mutate() always adds new columns at the end of your dataset. Remember that when you’re in RStudio, the easiest way to see all the columns is with View().

Here we’ll add an additional column that indicates whether a window is in the centromere region (nucleotides 25,800,000 to 29,700,000, based on Giemsa banding; see README for details).

We’ll call it cent and make it logical (with TRUE/FALSE values):

mutate(dvst, cent = start >= 25800000 & end <= 29700000)
# A tibble: 59,140 x 17
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <dbl> <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 10 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
#   cent <lgl>
dvst
# A tibble: 59,140 x 16
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <dbl> <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 9 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>

Question:

Why don’t we see the column in the dataset?

In the dataset we are using, the diversity estimate Pi is measured per sampling window (1kb) and scaled up by 10x (see supplementary Text S1 for more details). It would be useful to have this scaled as per basepair nucleotide diversity (so as to make the scale more intuitive).

mutate(dvst, diversity = Pi / (10*1000)) # rescale, removing 10x and making per bp
# A tibble: 59,140 x 17
   start   end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
   <dbl> <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 55001 56000            0          1894  3.41             0      0
 2 56001 57000            5          6683  6.68             2      2
 3 57001 58000            1          9063  9.06             1      0
 4 58001 59000            7         10256 10.3              3      2
 5 59001 60000            4          8057  8.06             4      0
 6 60001 61000            6          7051  7.05             2      1
 7 61001 62000            7          6950  6.95             2      1
 8 62001 63000            1          8834  8.83             1      0
 9 63001 64000            1          9629  9.63             1      0
10 64001 65000            3          7999  8                1      1
# … with 59,130 more rows, and 10 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
#   diversity <dbl>

Notice, that we can combine the two operations in a single command (and also save the result):

dvst <- mutate(dvst, 
       diversity = Pi / (10*1000), 
       cent = start >= 25800000 & end <= 29700000
)

If you only want to keep the new variables, use transmute():

transmute(dvst,
  diversity = Pi / (10*1000), 
  cent = start >= 25800000 & end <= 29700000
)
# A tibble: 59,140 x 2
   diversity cent 
       <dbl> <lgl>
 1  0        FALSE
 2  0.00104  FALSE
 3  0.000199 FALSE
 4  0.000956 FALSE
 5  0.000851 FALSE
 6  0.000912 FALSE
 7  0.000806 FALSE
 8  0.000206 FALSE
 9  0.000188 FALSE
10  0.000541 FALSE
# … with 59,130 more rows

Other creation functions

There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

  • Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”.

  • Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces.

  • Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive.

  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.

  • Logical comparisons, <, <=, >, >=, !=, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

  • Ranking: min_rank() does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks. If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

Your turn!

  1. Find all sampling windows …
    • with the lowest/highest GC content
    • that have 0 total SNPs
    • that are incomplete (<1000 bp)
    • that have the highest divergence with the chimp
    • that have less than 5% GC difference from the mean
  2. One useful dplyr filtering helper is between().
    • Can you figure out what does it do?
    • Use it to simplify the code needed to answer the previous challenges.
  3. How many windows fall into the centromeric region?

Use arrange() to arrange rows

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(dvst, cent, `%GC`)
# A tibble: 59,140 x 18
      start      end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
      <dbl>    <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 43448001 43449000            5           690  2.66             5      0
 2 14930001 14931000           19          3620  4.57             7      1
 3 37777001 37778000            3           615  1.12             3      0
 4 54885001 54886000            1          2298  3.69             1      0
 5 54888001 54889000           15          3823  3.87            11      1
 6 41823001 41824000           15          6950  6.95            10      4
 7 50530001 50531000           10          7293  7.29             2      2
 8 39612001 39613000           13          8254  8.25             5      3
 9 55012001 55013000            7          7028  7.03             7      0
10 22058001 22059000            4           281  1.2              4      0
# … with 59,130 more rows, and 11 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
#   diversity <dbl>, cent <lgl>

Use desc() to re-order by a column in descending order:

arrange(dvst, desc(cent), `%GC`)
# A tibble: 59,140 x 18
      start      end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
      <dbl>    <dbl>        <dbl>         <dbl> <dbl>         <dbl>  <dbl>
 1 25869001 25870000           10          7292  7.29             6      2
 2 29318001 29319000            2          6192  6.19             2      0
 3 29372001 29373000            5          7422  7.42             3      2
 4 26174001 26175000            1          7816  7.82             1      0
 5 26182001 26183000            2          6630  6.63             1      1
 6 25924001 25925000           30          7127  7.13            15      8
 7 26155001 26156000            0          9449  9.45             0      0
 8 25926001 25927000           21          8197  8.2             13      6
 9 29519001 29520000           15          8135  8.14            13      2
10 25901001 25902000            9          9126  9.13             5      2
# … with 59,130 more rows, and 11 more variables: reference Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
#   diversity <dbl>, cent <lgl>

Note, that missing values are always sorted at the end!

Use select() to select columns and rename() to rename them

select() is not terribly useful with our data because they only have a few variables, but you can still get the general idea:

# Select columns by name
select(dvst, start, end, Divergence)
# A tibble: 59,140 x 3
   start   end Divergence
   <dbl> <dbl>      <dbl>
 1 55001 56000    0.00301
 2 56001 57000    0.0180 
 3 57001 58000    0.00701
 4 58001 59000    0.0120 
 5 59001 60000    0.0240 
 6 60001 61000    0.0160 
 7 61001 62000    0.0120 
 8 62001 63000    0.0150 
 9 63001 64000    0.00901
10 64001 65000    0.00701
# … with 59,130 more rows
# Select all columns between depth and Pi (inclusive)
select(dvst, depth:Pi)
# A tibble: 59,140 x 6
   depth `unique SNPs` dhSNPs `reference Bases` Theta    Pi
   <dbl>         <dbl>  <dbl>             <dbl> <dbl> <dbl>
 1  3.41             0      0               556  0     0   
 2  6.68             2      2              1000  8.01 10.4 
 3  9.06             1      0              1000  3.51  1.99
 4 10.3              3      2              1000  9.93  9.56
 5  8.06             4      0              1000 12.9   8.51
 6  7.05             2      1              1000  7.82  9.12
 7  6.95             2      1              1000  8.60  8.06
 8  8.83             1      0              1000  4.01  2.06
 9  9.63             1      0              1000  3.37  1.88
10  8                1      1              1000  4.16  5.41
# … with 59,130 more rows
# Select all columns except those from start to percent.GC (inclusive)
select(dvst, -(start:`%GC`))
# A tibble: 59,140 x 6
   Recombination Divergence Constraint  SNPs diversity cent 
           <dbl>      <dbl>      <dbl> <dbl>     <dbl> <lgl>
 1       0.00960    0.00301          0     0  0        FALSE
 2       0.00960    0.0180           0     0  0.00104  FALSE
 3       0.00960    0.00701          0     0  0.000199 FALSE
 4       0.00960    0.0120           0     0  0.000956 FALSE
 5       0.00960    0.0240           0     0  0.000851 FALSE
 6       0.00960    0.0160           0     0  0.000912 FALSE
 7       0.00960    0.0120           0     0  0.000806 FALSE
 8       0.00960    0.0150           0     0  0.000206 FALSE
 9       0.00960    0.00901         58     1  0.000188 FALSE
10       0.00958    0.00701          0     1  0.000541 FALSE
# … with 59,130 more rows

select() can also be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

dvst <- rename(dvst, total.SNPs = `total SNPs`,
       total.Bases = `total Bases`,
       unique.SNPs = `unique SNPs`,
       reference.Bases = `reference Bases`,
       percent.GC = `%GC`) #renaming all the columns that require ` `!
colnames(dvst)
 [1] "start"           "end"             "total.SNPs"      "total.Bases"    
 [5] "depth"           "unique.SNPs"     "dhSNPs"          "reference.Bases"
 [9] "Theta"           "Pi"              "Heterozygosity"  "percent.GC"     
[13] "Recombination"   "Divergence"      "Constraint"      "SNPs"           
[17] "diversity"       "cent"           

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame:

select(dvst, cent, everything())
# A tibble: 59,140 x 18
   cent  start   end total.SNPs total.Bases depth unique.SNPs dhSNPs
   <lgl> <dbl> <dbl>      <dbl>       <dbl> <dbl>       <dbl>  <dbl>
 1 FALSE 55001 56000          0        1894  3.41           0      0
 2 FALSE 56001 57000          5        6683  6.68           2      2
 3 FALSE 57001 58000          1        9063  9.06           1      0
 4 FALSE 58001 59000          7       10256 10.3            3      2
 5 FALSE 59001 60000          4        8057  8.06           4      0
 6 FALSE 60001 61000          6        7051  7.05           2      1
 7 FALSE 61001 62000          7        6950  6.95           2      1
 8 FALSE 62001 63000          1        8834  8.83           1      0
 9 FALSE 63001 64000          1        9629  9.63           1      0
10 FALSE 64001 65000          3        7999  8              1      1
# … with 59,130 more rows, and 10 more variables: reference.Bases <dbl>,
#   Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
#   Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
#   diversity <dbl>

Use summarise() to make summaries

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(dvst, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs, 
na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 1 x 3
     GC averageSNPs allSNPs
  <dbl>       <dbl>   <dbl>
1  44.1        8.91  526693

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by position related to centromere:

by_cent <- group_by(dvst, cent)
summarise(by_cent, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs, na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 2 x 4
  cent     GC averageSNPs allSNPs
  <lgl> <dbl>       <dbl>   <dbl>
1 FALSE  44.2        8.87  518743
2 TRUE   39.7       11.6     7950

Subsetting columns can be a useful way to summarize data across two different conditions. For example, we might be curious if the average depth in a window (the depth column) differs between very high GC content windows (greater than 80%) and all other windows:

by_GC <- group_by(dvst, percent.GC >= 80)
summarize(by_GC, depth=mean(depth))
# A tibble: 2 x 2
  `percent.GC >= 80` depth
  <lgl>              <dbl>
1 FALSE               8.18
2 TRUE                2.24

This is a fairly large difference, but it’s important to consider how many windows this includes. Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data:

summarize(by_GC, mean_depth=mean(depth), n_rows=n())
# A tibble: 2 x 3
  `percent.GC >= 80` mean_depth n_rows
  <lgl>                   <dbl>  <int>
1 FALSE                    8.18  59131
2 TRUE                     2.24      9

Your turn

As another example, consider looking at Pi by windows that fall in the centromere and those that do not. Does the centromer have higher nucleotide diversity than other regions in these data?

Solutions

Because cent is a logical vector, we can group by it directly:

by_cent <- group_by(dvst, cent)
summarize(by_cent, nt_diversity=mean(diversity), min=which.min(diversity), n_rows=n())
# A tibble: 2 x 4
  cent  nt_diversity   min n_rows
  <lgl>        <dbl> <int>  <int>
1 FALSE      0.00123     1  58455
2 TRUE       0.00204     1    685

Indeed, the centromere does appear to have higher nucleotide diversity than other regions in this data.

Useful summary functions

Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

  • Measures of location: we’ve used mean(x), but median(x) is also useful.
  • Measures of spread: sd(x), IQR(x), mad(x). The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.
  • Measures of rank: min(x), quantile(x, 0.25), max(x). Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.
  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).
  • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).
  • Counts and proportions of logical values: sum(x > 10), mean(y == 0). When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations with the pipe

It looks like we have been creating and naming several intermediate files in our analysis, even though we didn’t care about them. Naming things is hard, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%:

dvst %>%
  rename(GC.percent = percent.GC) %>%
  group_by(GC.percent >= 80) %>%
  summarize(mean_depth=mean(depth, na.rm = TRUE), n_rows=n())
# A tibble: 2 x 3
  `GC.percent >= 80` mean_depth n_rows
  <lgl>                   <dbl>  <int>
1 FALSE                    8.18  59131
2 TRUE                     2.24      9

This focuses on the transformations, not what’s being transformed. You can read it as a series of imperative statements: group, then summarise, then filter, where %>% stands for “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. However, the next iteration of ggplot2, ggvis, is supposed to use the pipe.

RStudio Tip

A shortcut for %>% is available in the newest RStudio releases under the keybinding CTRL + SHIFT + M (or CMD + SHIFT + M for OSX).
You can also review the set of available keybindings when within RStudio with the ALT + SHIFT + K.

Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it? We may get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation: We could have also removed all the rows that have uknown position value prior to the analysis with the filter(!is.na()) command

Finally, one of the best features of dplyr is that all of these same methods also work with database connections. For example, you can manipulate a SQLite database with all of the same verbs we’ve used here.

Extra reading: tidy data

The collection tidyverse that we’ve been using is a universe of operation on tidy data. But what is tidy data?

Tidy data is a standard way of storing data where:

  • Every column is variable.
  • Every row is an observation.
  • Every cell is a single value.

If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette(“tidy-data”).

tidyr package within tidyverse helps you create tidy data

tidyr functions fall into five main categories:

  • Pivotting which converts between long and wide forms. tidyr 1.0.0 introduces pivot_longer() and pivot_wider(), replacing the older spread() and gather() functions.
  • Rectangling, which turns deeply nested lists (as from JSON) into tidy tibbles.
  • Nesting converts grouped data to a form where each group becomes a single row containing a nested data frame, and unnesting does the opposite.
  • Splitting and combining character columns. Use separate() and extract() to pull a single character column into multiple columns; use unite() to combine multiple columns into a single character column.
  • Handling missing values: Make implicit missing values explicit with complete(); make explicit missing values implicit with drop_na(); replace missing values with next/previous value with fill(), or a known value with replace_na().
    See https://tidyr.tidyverse.org/ for more info.

Pivot Longer

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_longer() is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.

Here is an example of an untidy dataset:

relig_income
# A tibble: 18 x 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# … with 3 more variables: $100-150k <dbl>, >150k <dbl>,
#   Don't know/refused <dbl>

This dataset contains three variables: religion, stored in the rows, income spread across the column names, and count stored in the cell values.

To tidy it we use pivot_longer():

relig_income %>% pivot_longer(-religion, names_to = "income", values_to = "count")
# A tibble: 180 x 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# … with 170 more rows
  • The first argument is the dataset to reshape, relig_income.
  • The second argument describes which columns need to be reshaped. In this case, it’s every column apart from religion.
  • The names_to gives the name of the variable that will be created from the data stored in the column names, i.e. income.
  • The values_to gives the name of the variable that will be created from the data stored in the cell value, i.e. count.

See https://tidyr.tidyverse.org/articles/pivot.html for more info.

Key Points

  • Read in a csv file using read_csv()

  • View a dataframe with View

  • Use filter() to pick observations by their values

  • Use arrange() to order the rows

  • Use select() to pick variables by their names

  • Use mutate() to create new variables with functions of existing variables

  • Use summarize() to collapse many values down to a single summary