Data Transformation
Overview
Teaching: 60 min
Exercises: 30 minQuestions
How to use dplyr to transform data in R?
How do I select columns
How can I filter rows
How can I join two data frames?
Objectives
Be able to explore a dataframe
Be able to add, rename, and remove columns.
Be able to filter rows
Be able to append two data frames
Be able to join two data frames
Data transformation
Introduction
Every data analysis you conduct will likely involve manipulating dataframes at
some point. Quickly extracting, transforming, and summarizing information from
dataframes is an essential R skill. In this part we’ll use Hadley Wickham’s
dplyr
package, which consolidates and simplifies many of the common operations
we perform on dataframes. In addition, much of dplyr
key functionality is
written in C++ for speed.
Preliminaries
Inatall (if needed) and/or load two libraries, tidyverse
and tidyr
:
Dataset
Following chapter 8 of the Buffalo’s book, we will use the dataset Dataset_S1.txt from the paper “The Influence of Recombination on Human Genetic Diversity”. The dataset contains estimates of population genetics statistics such as nucleotide diversity (e.g., the columns Pi and Theta), recombination (column Recombination), and sequence divergence as estimated by percent identity between human and chimpanzee genomes (column Divergence). Other columns contain information about the sequencing depth (depth), and GC content (percent.GC) for 1kb windows in human chromosome 20. We’ll only work with a few columns in our examples; see the description of Dataset_S1.txt in the original paper for more details.
Download the dataset
We can download the Dataset_S1.txt file into a local folder of your choice using
the download.file
function. Instead, we will read file directly from the
Internet with the read_csv
function:
dvst <- read_csv("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt")
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
start = col_double(),
end = col_double(),
`total SNPs` = col_double(),
`total Bases` = col_double(),
depth = col_double(),
`unique SNPs` = col_double(),
dhSNPs = col_double(),
`reference Bases` = col_double(),
Theta = col_double(),
Pi = col_double(),
Heterozygosity = col_double(),
`%GC` = col_double(),
Recombination = col_double(),
Divergence = col_double(),
Constraint = col_double(),
SNPs = col_double()
)
Notice the parsing of columns in the file. Also notice that some of the columns names contain spaces and special characters. We’ll replace them soon to facilitate typing.
We can check the data structure creates with either the str(dvst)
command or by
typing dvst
:
dvst
# A tibble: 59,140 x 16
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55001 56000 0 1894 3.41 0 0
2 56001 57000 5 6683 6.68 2 2
3 57001 58000 1 9063 9.06 1 0
4 58001 59000 7 10256 10.3 3 2
5 59001 60000 4 8057 8.06 4 0
6 60001 61000 6 7051 7.05 2 1
7 61001 62000 7 6950 6.95 2 1
8 62001 63000 1 8834 8.83 1 0
9 63001 64000 1 9629 9.63 1 0
10 64001 65000 3 7999 8 1 1
# … with 59,130 more rows, and 9 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>
It’s a tibble!. Notice that the tibble only shows the first few rows and all the columns that fit on one screen.
Run View(dvst)
, which will open the dataset in the RStudio viewer, to see the
whole dataframe.
Also notice that there are some additional variable types available:
Abbreviation | Variable type |
---|---|
int |
integers |
dbl |
doubles (real numbers) |
fctr |
factors (categorical) |
Variables | not in the dataset |
chr |
character vectors (strings) |
dttm |
date-time |
lgl |
logical |
date |
dates |
dplyr
basics
There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:
Key
dplyr
functions
Function Action filter()
Pick observations by their values arrange()
Reorder the rows select()
Pick variables by their names mutate()
Create new variables with functions of existing variables summarise()
Collapse many values down to a single summary
None of these functions perform tasks you can’t accomplish with R’s base functions. But dplyr’s advantage is in the added consistency, speed, and versatility of its data manipulation interface. Get this cheatsheet to keep as a reminder of these and several additional functions.
group_by
function
dplyr
functions can be used in conjunction with group_by()
, which changes
the scope of each function from operating on the entire dataset to operating on
it group-by-group.
Overall command structure
All dplyr functions work similarly:
-
The first argument is a data frame (actually, a
tibble
). -
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
-
The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
These six functions provide the verbs for a language of data manipulation. Let’s dive in and see how these verbs work.
Use filter()
to filter rows
Let’s say we used summary
to get an idea about the total number of SNPs in
different windows:
summary(select(dvst,`total SNPs`)) # Here we also used `select` function. We'll talk about it soon.
total SNPs
Min. : 0.000
1st Qu.: 3.000
Median : 7.000
Mean : 8.906
3rd Qu.:12.000
Max. :93.000
The number varies considerably across all windows on chromosome 20 and the data are right-skewed: the third quartile is 12 SNPs, but the maximum is 93 SNPs. Often we want to investigate such outliers more closely.
We can use filter()
to extract the rows of our dataframe based on their values.
The first argument is the name of the data frame. The second and subsequent
arguments are the expressions that filter the data frame.
filter(dvst,`total SNPs` >= 85)
# A tibble: 3 x 16
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2621001 2622000 93 11337 11.3 13 10
2 13023001 13024000 88 11784 11.8 11 1
3 47356001 47357000 87 12505 12.5 9 7
# … with 9 more variables: reference Bases <dbl>, Theta <dbl>, Pi <dbl>,
# Heterozygosity <dbl>, %GC <dbl>, Recombination <dbl>, Divergence <dbl>,
# Constraint <dbl>, SNPs <dbl>
Questions:
- How many windows have >= 85 SNPs?
- What is a SNP?
We can build more elaborate queries by using multiple filters:
filter(dvst, Pi > 16, `%GC` > 80)
# A tibble: 3 x 16
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 63097001 63098000 5 947 2.39 2 1
2 63188001 63189000 2 1623 3.21 2 0
3 63189001 63190000 5 1395 1.89 3 2
# … with 9 more variables: reference Bases <dbl>, Theta <dbl>, Pi <dbl>,
# Heterozygosity <dbl>, %GC <dbl>, Recombination <dbl>, Divergence <dbl>,
# Constraint <dbl>, SNPs <dbl>
Note, dplyr functions never modify their inputs, so if you want to save the
result, you’ll need to use the assignment operator, <-
:
new_df <- filter(dvst, Pi > 16, `%GC` > 80)
Tip
R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:
(new_df <- filter(dvst, Pi > 16, `%GC` > 80))
Comparisons
Use comparison operators to filter effectively:
>
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
Remember, that ==
does not work well with real numbers! Use the dplyr near
function instead!
Logical operators
Multiple arguments to filter()
are combined with “and”: every expression must
be true in order for a row to be included in the output. For other types of
combinations, you’ll need to use Boolean operators
yourself: &
is “and”, |
is “or”, and !
is “not”. The figure below shows
the complete set of Boolean operations.
x %in% y
operator
If you are looking for multiple values within a certain column, use the x %in% y
operator. This will select every row where x
is one of the values in y
:
filter(dvst, `total SNPs` %in% c(0,1,2))
# A tibble: 12,446 x 16
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55001 56000 0 1894 3.41 0 0
2 57001 58000 1 9063 9.06 1 0
3 62001 63000 1 8834 8.83 1 0
4 63001 64000 1 9629 9.63 1 0
5 65001 66000 0 8376 8.38 0 0
6 66001 67000 2 8431 8.43 2 0
7 68001 69000 1 5381 5.38 1 0
8 72001 73000 1 5276 5.28 1 0
9 73001 74000 1 6648 6.65 1 0
10 80001 81000 1 6230 6.23 1 0
# … with 12,436 more rows, and 9 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>
Finally, whenever you start using complicated, multipart expressions in filter()
,
consider making them explicit variables instead. That makes it much easier to
check your work. We’ll learn how to create new variables shortly.
Missing values
One important feature of R that can make comparison tricky are missing values, or
NA
s (“not availables”).NA
represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.Examples:
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
NA / 2
[1] NA
NA == NA
[1] NA
filter()
only includes rows where the condition is TRUE
; it excludes both
FALSE
and NA
values. If you want
to preserve missing values, ask for them explicitly:
filter(dvst, is.na(Divergence))
Use mutate() to add new variables
It’s often useful to add new columns that are functions of existing columns.
That’s the job of mutate()
. mutate()
always adds new columns at the end of
your dataset. Remember that when you’re in RStudio, the easiest way to see all
the columns is with View()
.
Here we’ll add an additional column that indicates whether a window is in the centromere region (nucleotides 25,800,000 to 29,700,000, based on Giemsa banding; see README for details).
We’ll call it cent
and make it logical (with TRUE/FALSE values):
mutate(dvst, cent = start >= 25800000 & end <= 29700000)
# A tibble: 59,140 x 17
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55001 56000 0 1894 3.41 0 0
2 56001 57000 5 6683 6.68 2 2
3 57001 58000 1 9063 9.06 1 0
4 58001 59000 7 10256 10.3 3 2
5 59001 60000 4 8057 8.06 4 0
6 60001 61000 6 7051 7.05 2 1
7 61001 62000 7 6950 6.95 2 1
8 62001 63000 1 8834 8.83 1 0
9 63001 64000 1 9629 9.63 1 0
10 64001 65000 3 7999 8 1 1
# … with 59,130 more rows, and 10 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
# cent <lgl>
dvst
# A tibble: 59,140 x 16
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55001 56000 0 1894 3.41 0 0
2 56001 57000 5 6683 6.68 2 2
3 57001 58000 1 9063 9.06 1 0
4 58001 59000 7 10256 10.3 3 2
5 59001 60000 4 8057 8.06 4 0
6 60001 61000 6 7051 7.05 2 1
7 61001 62000 7 6950 6.95 2 1
8 62001 63000 1 8834 8.83 1 0
9 63001 64000 1 9629 9.63 1 0
10 64001 65000 3 7999 8 1 1
# … with 59,130 more rows, and 9 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>
Question:
Why don’t we see the column in the dataset?
In the dataset we are using, the diversity estimate Pi is measured per sampling window (1kb) and scaled up by 10x (see supplementary Text S1 for more details). It would be useful to have this scaled as per basepair nucleotide diversity (so as to make the scale more intuitive).
mutate(dvst, diversity = Pi / (10*1000)) # rescale, removing 10x and making per bp
# A tibble: 59,140 x 17
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55001 56000 0 1894 3.41 0 0
2 56001 57000 5 6683 6.68 2 2
3 57001 58000 1 9063 9.06 1 0
4 58001 59000 7 10256 10.3 3 2
5 59001 60000 4 8057 8.06 4 0
6 60001 61000 6 7051 7.05 2 1
7 61001 62000 7 6950 6.95 2 1
8 62001 63000 1 8834 8.83 1 0
9 63001 64000 1 9629 9.63 1 0
10 64001 65000 3 7999 8 1 1
# … with 59,130 more rows, and 10 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
# diversity <dbl>
Notice, that we can combine the two operations in a single command (and also save the result):
dvst <- mutate(dvst,
diversity = Pi / (10*1000),
cent = start >= 25800000 & end <= 29700000
)
If you only want to keep the new variables, use transmute()
:
transmute(dvst,
diversity = Pi / (10*1000),
cent = start >= 25800000 & end <= 29700000
)
# A tibble: 59,140 x 2
diversity cent
<dbl> <lgl>
1 0 FALSE
2 0.00104 FALSE
3 0.000199 FALSE
4 0.000956 FALSE
5 0.000851 FALSE
6 0.000912 FALSE
7 0.000806 FALSE
8 0.000206 FALSE
9 0.000188 FALSE
10 0.000541 FALSE
# … with 59,130 more rows
Other creation functions
There are many functions for creating new variables that you can use with
mutate()
. The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
Arithmetic operators:
+
,-
,*
,/
,^
. These are all vectorised, using the so called “recycling rules”.Modular arithmetic:
%/%
(integer division) and%%
(remainder), wherex == y * (x %/% y) + (x %% y)
. Modular arithmetic is a handy tool because it allows you to break integers up into pieces.Logs:
log()
,log2()
,log10()
. Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive.Offsets:
lead()
andlag()
allow you to refer to leading or lagging values. This allows you to compute running differences (e.g.x - lag(x)
) or find when values change (x != lag(x))
. They are most useful in conjunction withgroup_by()
, which you’ll learn about shortly.Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means.Logical comparisons,
<
,<=
,>
,>=
,!=
, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.Ranking:
min_rank()
does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; usedesc(x)
to give the largest values the smallest ranks. Ifmin_rank()
doesn’t do what you need, look at the variantsrow_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
. See their help pages for more details.
Your turn!
- Find all sampling windows …
- with the lowest/highest GC content
- that have 0 total SNPs
- that are incomplete (<1000 bp)
- that have the highest divergence with the chimp
- that have less than 5% GC difference from the mean
- One useful dplyr filtering helper is
between()
.
- Can you figure out what does it do?
- Use it to simplify the code needed to answer the previous challenges.
- How many windows fall into the centromeric region?
Use arrange()
to arrange rows
arrange()
works similarly to filter()
except that instead of selecting rows,
it changes their order. It takes a data frame and a set of column names (or more
complicated expressions) to order by. If you provide more than one column name,
each additional column will be used to break ties in the values of preceding
columns:
arrange(dvst, cent, `%GC`)
# A tibble: 59,140 x 18
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 43448001 43449000 5 690 2.66 5 0
2 14930001 14931000 19 3620 4.57 7 1
3 37777001 37778000 3 615 1.12 3 0
4 54885001 54886000 1 2298 3.69 1 0
5 54888001 54889000 15 3823 3.87 11 1
6 41823001 41824000 15 6950 6.95 10 4
7 50530001 50531000 10 7293 7.29 2 2
8 39612001 39613000 13 8254 8.25 5 3
9 55012001 55013000 7 7028 7.03 7 0
10 22058001 22059000 4 281 1.2 4 0
# … with 59,130 more rows, and 11 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
# diversity <dbl>, cent <lgl>
Use desc()
to re-order by a column in descending order:
arrange(dvst, desc(cent), `%GC`)
# A tibble: 59,140 x 18
start end `total SNPs` `total Bases` depth `unique SNPs` dhSNPs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25869001 25870000 10 7292 7.29 6 2
2 29318001 29319000 2 6192 6.19 2 0
3 29372001 29373000 5 7422 7.42 3 2
4 26174001 26175000 1 7816 7.82 1 0
5 26182001 26183000 2 6630 6.63 1 1
6 25924001 25925000 30 7127 7.13 15 8
7 26155001 26156000 0 9449 9.45 0 0
8 25926001 25927000 21 8197 8.2 13 6
9 29519001 29520000 15 8135 8.14 13 2
10 25901001 25902000 9 9126 9.13 5 2
# … with 59,130 more rows, and 11 more variables: reference Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, %GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
# diversity <dbl>, cent <lgl>
Note, that missing values are always sorted at the end!
Use select()
to select columns and rename()
to rename them
select()
is not terribly useful with our data because they only have a few
variables, but you can still get the general idea:
# Select columns by name
select(dvst, start, end, Divergence)
# A tibble: 59,140 x 3
start end Divergence
<dbl> <dbl> <dbl>
1 55001 56000 0.00301
2 56001 57000 0.0180
3 57001 58000 0.00701
4 58001 59000 0.0120
5 59001 60000 0.0240
6 60001 61000 0.0160
7 61001 62000 0.0120
8 62001 63000 0.0150
9 63001 64000 0.00901
10 64001 65000 0.00701
# … with 59,130 more rows
# Select all columns between depth and Pi (inclusive)
select(dvst, depth:Pi)
# A tibble: 59,140 x 6
depth `unique SNPs` dhSNPs `reference Bases` Theta Pi
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.41 0 0 556 0 0
2 6.68 2 2 1000 8.01 10.4
3 9.06 1 0 1000 3.51 1.99
4 10.3 3 2 1000 9.93 9.56
5 8.06 4 0 1000 12.9 8.51
6 7.05 2 1 1000 7.82 9.12
7 6.95 2 1 1000 8.60 8.06
8 8.83 1 0 1000 4.01 2.06
9 9.63 1 0 1000 3.37 1.88
10 8 1 1 1000 4.16 5.41
# … with 59,130 more rows
# Select all columns except those from start to percent.GC (inclusive)
select(dvst, -(start:`%GC`))
# A tibble: 59,140 x 6
Recombination Divergence Constraint SNPs diversity cent
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 0.00960 0.00301 0 0 0 FALSE
2 0.00960 0.0180 0 0 0.00104 FALSE
3 0.00960 0.00701 0 0 0.000199 FALSE
4 0.00960 0.0120 0 0 0.000956 FALSE
5 0.00960 0.0240 0 0 0.000851 FALSE
6 0.00960 0.0160 0 0 0.000912 FALSE
7 0.00960 0.0120 0 0 0.000806 FALSE
8 0.00960 0.0150 0 0 0.000206 FALSE
9 0.00960 0.00901 58 1 0.000188 FALSE
10 0.00958 0.00701 0 1 0.000541 FALSE
# … with 59,130 more rows
select()
can also be used to rename variables, but it’s rarely useful because
it drops all of the variables not explicitly mentioned. Instead, use rename()
,
which is a variant of select()
that keeps all the variables that aren’t
explicitly mentioned:
dvst <- rename(dvst, total.SNPs = `total SNPs`,
total.Bases = `total Bases`,
unique.SNPs = `unique SNPs`,
reference.Bases = `reference Bases`,
percent.GC = `%GC`) #renaming all the columns that require ` `!
colnames(dvst)
[1] "start" "end" "total.SNPs" "total.Bases"
[5] "depth" "unique.SNPs" "dhSNPs" "reference.Bases"
[9] "Theta" "Pi" "Heterozygosity" "percent.GC"
[13] "Recombination" "Divergence" "Constraint" "SNPs"
[17] "diversity" "cent"
Another option is to use select()
in conjunction with the everything()
helper.
This is useful if you have a handful of variables you’d like to move to the start
of the data frame:
select(dvst, cent, everything())
# A tibble: 59,140 x 18
cent start end total.SNPs total.Bases depth unique.SNPs dhSNPs
<lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 55001 56000 0 1894 3.41 0 0
2 FALSE 56001 57000 5 6683 6.68 2 2
3 FALSE 57001 58000 1 9063 9.06 1 0
4 FALSE 58001 59000 7 10256 10.3 3 2
5 FALSE 59001 60000 4 8057 8.06 4 0
6 FALSE 60001 61000 6 7051 7.05 2 1
7 FALSE 61001 62000 7 6950 6.95 2 1
8 FALSE 62001 63000 1 8834 8.83 1 0
9 FALSE 63001 64000 1 9629 9.63 1 0
10 FALSE 64001 65000 3 7999 8 1 1
# … with 59,130 more rows, and 10 more variables: reference.Bases <dbl>,
# Theta <dbl>, Pi <dbl>, Heterozygosity <dbl>, percent.GC <dbl>,
# Recombination <dbl>, Divergence <dbl>, Constraint <dbl>, SNPs <dbl>,
# diversity <dbl>
Use summarise()
to make summaries
The last key verb is summarise()
. It collapses a data frame to a single row:
summarise(dvst, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs,
na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 1 x 3
GC averageSNPs allSNPs
<dbl> <dbl> <dbl>
1 44.1 8.91 526693
summarise()
is not terribly useful unless we pair it with group_by()
.
This changes the unit of analysis from the complete dataset to individual groups.
Then, when you use the dplyr verbs on a grouped data frame they’ll be
automatically applied “by group”. For example, if we applied exactly the same
code to a data frame grouped by position related to centromere:
by_cent <- group_by(dvst, cent)
summarise(by_cent, GC = mean(percent.GC, na.rm = TRUE), averageSNPs=mean(total.SNPs, na.rm = TRUE), allSNPs=sum(total.SNPs))
# A tibble: 2 x 4
cent GC averageSNPs allSNPs
<lgl> <dbl> <dbl> <dbl>
1 FALSE 44.2 8.87 518743
2 TRUE 39.7 11.6 7950
Subsetting columns can be a useful way to summarize data across two different conditions. For example, we might be curious if the average depth in a window (the depth column) differs between very high GC content windows (greater than 80%) and all other windows:
by_GC <- group_by(dvst, percent.GC >= 80)
summarize(by_GC, depth=mean(depth))
# A tibble: 2 x 2
`percent.GC >= 80` depth
<lgl> <dbl>
1 FALSE 8.18
2 TRUE 2.24
This is a fairly large difference, but it’s important to consider how many windows this includes.
Whenever you do any aggregation, it’s always a good idea to include either a count (n()
),
or a count of non-missing values (sum(!is.na(x))
). That way you can check that you’re not
drawing conclusions based on very small amounts of data:
summarize(by_GC, mean_depth=mean(depth), n_rows=n())
# A tibble: 2 x 3
`percent.GC >= 80` mean_depth n_rows
<lgl> <dbl> <int>
1 FALSE 8.18 59131
2 TRUE 2.24 9
Your turn
As another example, consider looking at Pi by windows that fall in the centromere and those that do not. Does the centromer have higher nucleotide diversity than other regions in these data?
Solutions
Because cent is a logical vector, we can group by it directly:
by_cent <- group_by(dvst, cent) summarize(by_cent, nt_diversity=mean(diversity), min=which.min(diversity), n_rows=n())
# A tibble: 2 x 4 cent nt_diversity min n_rows <lgl> <dbl> <int> <int> 1 FALSE 0.00123 1 58455 2 TRUE 0.00204 1 685
Indeed, the centromere does appear to have higher nucleotide diversity than other regions in this data.
Useful summary functions
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
- Measures of location: we’ve used
mean(x)
, butmedian(x)
is also useful.- Measures of spread:
sd(x)
,IQR(x)
,mad(x)
. The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile rangeIQR()
and median absolute deviationmad(x)
are robust equivalents that may be more useful if you have outliers.- Measures of rank:
min(x)
,quantile(x, 0.25)
,max(x)
. Quantiles are a generalisation of the median. For example,quantile(x, 0.25)
will find a value ofx
that is greater than 25% of the values, and less than the remaining 75%.- Measures of position:
first(x)
,nth(x, 2)
,last(x)
. These work similarly tox[1]
,x[2]
, andx[length(x)]
but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).- Counts: You’ve seen
n()
, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, usesum(!is.na(x))
. To count the number of distinct (unique) values, usen_distinct(x)
.- Counts and proportions of logical values:
sum(x > 10)
,mean(y == 0)
. When used with numeric functions,TRUE
is converted to 1 andFALSE
to 0. This makessum()
andmean()
very useful:sum(x)
gives the number ofTRUE
s inx
, andmean(x)
gives the proportion.
Together group_by()
and summarise()
provide one of the tools that you’ll use
most commonly when working with dplyr: grouped summaries. But before we go any
further with this, we need to introduce a powerful new idea: the pipe.
Combining multiple operations with the pipe
It looks like we have been creating and naming several intermediate files in our analysis, even though we didn’t care about them. Naming things is hard, so this slows down our analysis.
There’s another way to tackle the same problem with the pipe, %>%
:
dvst %>%
rename(GC.percent = percent.GC) %>%
group_by(GC.percent >= 80) %>%
summarize(mean_depth=mean(depth, na.rm = TRUE), n_rows=n())
# A tibble: 2 x 3
`GC.percent >= 80` mean_depth n_rows
<lgl> <dbl> <int>
1 FALSE 8.18 59131
2 TRUE 2.24 9
This focuses on the transformations, not what’s being transformed.
You can read it as a series of imperative statements: group, then summarise, then
filter,
where %>%
stands for “then”.
Behind the scenes, x %>% f(y)
turns into f(x, y)
, and x %>% f(y) %>% g(z)
turns into g(f(x, y), z)
and so on.
Working with the pipe is one of the key criteria for belonging to the tidyverse.
The only exception is ggplot2
: it was written before the pipe was discovered.
However, the next iteration of ggplot2
, ggvis
, is supposed to use the pipe.
RStudio Tip
A shortcut for
%>%
is available in the newest RStudio releases under the keybinding CTRL + SHIFT + M (or CMD + SHIFT + M for OSX).
You can also review the set of available keybindings when within RStudio with the ALT + SHIFT + K.
Missing values
You may have wondered about the
na.rm
argument we used above. What happens if we don’t set it? We may get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have anna.rm
argument which removes the missing values prior to computation: We could have also removed all the rows that have uknown position value prior to the analysis with thefilter(!is.na())
command
Finally, one of the best features of dplyr is that all of these same methods also work with database connections. For example, you can manipulate a SQLite database with all of the same verbs we’ve used here.
Extra reading: tidy data
The collection tidyverse that we’ve been using is a universe of operation on tidy data. But what is tidy data?
Tidy data is a standard way of storing data where:
- Every column is variable.
- Every row is an observation.
- Every cell is a single value.
If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette(“tidy-data”).
tidyr
package withintidyverse
helps you create tidy datatidyr functions fall into five main categories:
- Pivotting which converts between long and wide forms. tidyr 1.0.0 introduces
pivot_longer()
andpivot_wider()
, replacing the olderspread()
andgather()
functions.- Rectangling, which turns deeply nested lists (as from JSON) into tidy tibbles.
- Nesting converts grouped data to a form where each group becomes a single row containing a nested data frame, and unnesting does the opposite.
- Splitting and combining character columns. Use
separate()
andextract()
to pull a single character column into multiple columns; useunite()
to combine multiple columns into a single character column.- Handling missing values: Make implicit missing values explicit with
complete()
; make explicit missing values implicit withdrop_na()
; replace missing values with next/previous value withfill()
, or a known value withreplace_na()
.
See https://tidyr.tidyverse.org/ for more info.Pivot Longer
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_longer()
is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.Here is an example of an untidy dataset:
relig_income
# A tibble: 18 x 11 religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Agnostic 27 34 60 81 76 137 122 2 Atheist 12 27 37 52 35 70 73 3 Buddhist 27 21 30 34 33 58 62 4 Catholic 418 617 732 670 638 1116 949 5 Don’t k… 15 14 15 11 10 35 21 6 Evangel… 575 869 1064 982 881 1486 949 7 Hindu 1 9 7 9 11 34 47 8 Histori… 228 244 236 238 197 223 131 9 Jehovah… 20 27 24 24 21 30 15 10 Jewish 19 19 25 25 30 95 69 11 Mainlin… 289 495 619 655 651 1107 939 12 Mormon 29 40 48 51 56 112 85 13 Muslim 6 7 9 10 9 23 16 14 Orthodox 13 17 23 32 32 47 38 15 Other C… 9 7 11 13 13 14 18 16 Other F… 20 33 40 46 49 63 46 17 Other W… 5 2 3 4 2 7 3 18 Unaffil… 217 299 374 365 341 528 407 # … with 3 more variables: $100-150k <dbl>, >150k <dbl>, # Don't know/refused <dbl>
This dataset contains three variables:
religion
, stored in the rows,income
spread across the column names, andcount
stored in the cell values.To tidy it we use
pivot_longer()
:relig_income %>% pivot_longer(-religion, names_to = "income", values_to = "count")
# A tibble: 180 x 3 religion income count <chr> <chr> <dbl> 1 Agnostic <$10k 27 2 Agnostic $10-20k 34 3 Agnostic $20-30k 60 4 Agnostic $30-40k 81 5 Agnostic $40-50k 76 6 Agnostic $50-75k 137 7 Agnostic $75-100k 122 8 Agnostic $100-150k 109 9 Agnostic >150k 84 10 Agnostic Don't know/refused 96 # … with 170 more rows
- The first argument is the dataset to reshape, relig_income.
- The second argument describes which columns need to be reshaped. In this case, it’s every column apart from religion.
- The names_to gives the name of the variable that will be created from the data stored in the column names, i.e. income.
- The values_to gives the name of the variable that will be created from the data stored in the cell value, i.e. count.
See https://tidyr.tidyverse.org/articles/pivot.html for more info.
Key Points
Read in a csv file using
read_csv()
View a dataframe with
View
Use
filter()
to pick observations by their valuesUse
arrange()
to order the rowsUse
select()
to pick variables by their namesUse
mutate()
to create new variables with functions of existing variablesUse
summarize()
to collapse many values down to a single summary