+ - 0:00:00
Notes for current slide

Notes for the first slide!

Notes for next slide

Working with Range Data

Buffalo Chapter 9

1 / 139

Notes for the first slide!

Introduction

2 / 139

Introduction

Chromosomes and Coordinate Systems:

3 / 139

Introduction

Chromosomes and Coordinate Systems:

  • because chromosomes are linear strings of nucleotides, they can be represented with a coordinate system
4 / 139

Introduction

Chromosomes and Coordinate Systems:

  • because chromosomes are linear strings of nucleotides, they can be represented with a coordinate system

  • the coordinate system allows us to reference particular regions of a chromosome:

    • a gene
    • an exon
    • a transposable element
5 / 139

Introduction

Chromosomes and Coordinate Systems:

  • because chromosomes are linear strings of nucleotides, they can be represented with a coordinate system

  • the coordinate system allows us to reference particular regions of a chromosome:

    • a gene
    • an exon
    • a transposable element
  • once genomic datasets are represented as genomic ranges, a series of operations, particularly those involving overlap and proximity, can be carried out

6 / 139

Components of a Genomic Range

7 / 139

Components of a Genomic Range

Three pieces of information are necessary to specify a genomic region:

8 / 139

Components of a Genomic Range

Three pieces of information are necessary to specify a genomic region:

  • Chromosome (or scaffold/contig) name: naming conventions can vary (e.g., chr4, 4, scaffold_627)
9 / 139

Components of a Genomic Range

Three pieces of information are necessary to specify a genomic region:

  • Chromosome (or scaffold/contig) name: naming conventions can vary (e.g., chr4, 4, scaffold_627)

  • Range: the start and end site of a genomic region (e.g., 123,654,834 to 123,654,985)

10 / 139

Components of a Genomic Range

Three pieces of information are necessary to specify a genomic region:

  • Chromosome (or scaffold/contig) name: naming conventions can vary (e.g., chr4, 4, scaffold_627)

  • Range: the start and end site of a genomic region (e.g., 123,654,834 to 123,654,985)

  • Strand: DNA is double-stranded and features can be on the forward (positive) or reverse (negative) strand. Features such as proteins can be strand-specific.

11 / 139

Components of a Genomic Range

Three pieces of information are necessary to specify a genomic region:

  • Chromosome (or scaffold/contig) name: naming conventions can vary (e.g., chr4, 4, scaffold_627)

  • Range: the start and end site of a genomic region (e.g., 123,654,834 to 123,654,985)

  • Strand: DNA is double-stranded and features can be on the forward (positive) or reverse (negative) strand. Features such as proteins can be strand-specific.

Important Note: Ranges are always tied to a particular version of a genome

12 / 139

An example of genomic ranges:

13 / 139

An example of genomic ranges:

 

 

grep
14 / 139

Flavors of range systems:

15 / 139

Flavors of range systems:

  • 0-based systems: half-closed, half-open intervals
16 / 139

Flavors of range systems:

  • 0-based systems: half-closed, half-open intervals

    • Range width = end - start
17 / 139

Flavors of range systems:

  • 0-based systems: half-closed, half-open intervals

    • Range width = end - start
  • 1-based systems: closed intervals

18 / 139

Flavors of range systems:

  • 0-based systems: half-closed, half-open intervals

    • Range width = end - start
  • 1-based systems: closed intervals

    • Range width = end - start + 1
19 / 139

Flavors of range systems:

  • 0-based systems: half-closed, half-open intervals

    • Range width = end - start
  • 1-based systems: closed intervals

    • Range width = end - start + 1
grep
20 / 139

The Buffalo tutorials on genomic ranges use packages from the Bioconductor open source software project:

  • IRanges

  • GenomicRanges

21 / 139

The Buffalo tutorials on genomic ranges use packages from the Bioconductor open source software project:

  • IRanges

  • GenomicRanges

To Install Bioconductor's Primary Packages in R:

> source("http://bioconductor.org/biocLite.R")
> biocLite()
22 / 139

The Buffalo tutorials on genomic ranges use packages from the Bioconductor open source software project:

  • IRanges

  • GenomicRanges

To Install Bioconductor's Primary Packages in R:

> source("http://bioconductor.org/biocLite.R")
> biocLite()

GenomicRanges can be installed with:

> biocLite("GenomicRanges")
23 / 139

The Buffalo tutorials on genomic ranges use packages from the Bioconductor open source software project:

  • IRanges

  • GenomicRanges

To Install Bioconductor's Primary Packages in R:

> source("http://bioconductor.org/biocLite.R")
> biocLite()

GenomicRanges can be installed with:

> biocLite("GenomicRanges")

Bioconductor packages have great documentation:

http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

24 / 139

Let's begin working with ranges in the IRanges program which was installed as a dependency of GenomicRanges:

> library(IRanges)
25 / 139

Let's begin working with ranges in the IRanges program which was installed as a dependency of GenomicRanges:

> library(IRanges)

Ranges are created as "IRange Objects" by specifying start and end sites:

> rng <- IRanges(start=4, end=13)
> rng
IRanges of length 1
start end width
[1] 4 13 10
26 / 139

Let's begin working with ranges in the IRanges program which was installed as a dependency of GenomicRanges:

> library(IRanges)

Ranges are created as "IRange Objects" by specifying start and end sites:

> rng <- IRanges(start=4, end=13)
> rng
IRanges of length 1
start end width
[1] 4 13 10

Is this 1-based or 0-based?

27 / 139

Let's begin working with ranges in the IRanges program which was installed as a dependency of GenomicRanges:

> library(IRanges)

Ranges are created as "IRange Objects" by specifying start and end sites:

> rng <- IRanges(start=4, end=13)
> rng
IRanges of length 1
start end width
[1] 4 13 10

Is this 1-based or 0-based?

Now try creating a range by specifying start site and width:

rng2 <- IRanges(start=4, width=3)
rng2
28 / 139

Ranges can also be created using vector arguments:

> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x
29 / 139

Ranges can also be created using vector arguments:

> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x

And we can give these ranges names:

> names(x) <- letters[1:4]
> x
30 / 139

Ranges can also be created using vector arguments:

> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x

And we can give these ranges names:

> names(x) <- letters[1:4]
> x

Note that the ranges we create are a special object with class IRanges:

> class(x)
31 / 139

Ranges can also be created using vector arguments:

> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x

And we can give these ranges names:

> names(x) <- letters[1:4]
> x

Note that the ranges we create are a special object with class IRanges:

> class(x)

Components of this object can be accessed with start(x), end(x), width(x)

32 / 139

Ranges can also be created using vector arguments:

> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x

And we can give these ranges names:

> names(x) <- letters[1:4]
> x

Note that the ranges we create are a special object with class IRanges:

> class(x)

Components of this object can be accessed with start(x), end(x), width(x)

This can by handy when we want to increment one of these components:

> end(x) <- end(x) + 4
> x
33 / 139

The range() function can also be used to inspect the entire length of all ranges in an IRanges object:

> range(x)
34 / 139

The range() function can also be used to inspect the entire length of all ranges in an IRanges object:

> range(x)

We can also take subsets of ranges in an IRanges object using numeric, logical, and character indices

35 / 139

The range() function can also be used to inspect the entire length of all ranges in an IRanges object:

> range(x)

We can also take subsets of ranges in an IRanges object using numeric, logical, and character indices

For example, try the following:

> x[2:3]
> start(x) < 5
> x[start(x) < 5]
> x[width(x) > 8]
36 / 139

The range() function can also be used to inspect the entire length of all ranges in an IRanges object:

> range(x)

We can also take subsets of ranges in an IRanges object using numeric, logical, and character indices

For example, try the following:

> x[2:3]
> start(x) < 5
> x[start(x) < 5]
> x[width(x) > 8]

Ranges can also be merged, just as we've done with vectors using the c() command:

> a <- IRanges(start=7, width=4)
> b <- IRanges(start=2, end=5)
> c <- c(a, b)
> c
37 / 139

Basic Range Operations:

38 / 139

Basic Range Operations:

IRanges objects can be grown or shrunk using arithmetic operations such as +, -, and * (division is not supported since it makes little sense with ranges)

39 / 139

Basic Range Operations:

IRanges objects can be grown or shrunk using arithmetic operations such as +, -, and * (division is not supported since it makes little sense with ranges)

For example, try:

> x <- IRanges(start=c(40, 80), end=c(67, 114))
> x + 4L
> x - 10L
40 / 139

Basic Range Operations:

IRanges objects can be grown or shrunk using arithmetic operations such as +, -, and * (division is not supported since it makes little sense with ranges)

For example, try:

> x <- IRanges(start=c(40, 80), end=c(67, 114))
> x + 4L
> x - 10L

What's going on here?

41 / 139

Basic Range Operations:

IRanges objects can be grown or shrunk using arithmetic operations such as +, -, and * (division is not supported since it makes little sense with ranges)

For example, try:

> x <- IRanges(start=c(40, 80), end=c(67, 114))
> x + 4L
> x - 10L

What's going on here?

grep
42 / 139

Sometimes, rather than growing or shrinking ranges, we want to restrict them within particular bounds

43 / 139

Sometimes, rather than growing or shrinking ranges, we want to restrict them within particular bounds

In this situation, the function restrict() comes in handy:

> y <- IRanges(start=c(4, 6, 10, 12), width=13)
> y
> restrict(y, 5, 10)
44 / 139

Sometimes, rather than growing or shrinking ranges, we want to restrict them within particular bounds

In this situation, the function restrict() comes in handy:

> y <- IRanges(start=c(4, 6, 10, 12), width=13)
> y
> restrict(y, 5, 10)

What's happening here?

45 / 139

Sometimes, rather than growing or shrinking ranges, we want to restrict them within particular bounds

In this situation, the function restrict() comes in handy:

> y <- IRanges(start=c(4, 6, 10, 12), width=13)
> y
> restrict(y, 5, 10)

What's happening here?

grep
46 / 139

Perhaps you would like to create entirely new ranges based on their relative position to ranges you already have

47 / 139

Perhaps you would like to create entirely new ranges based on their relative position to ranges you already have

When might this be handy?

48 / 139

Perhaps you would like to create entirely new ranges based on their relative position to ranges you already have

When might this be handy?

For this application, you would use the flank() command:

> flank(x, width=7)
> flank(x, width=7, start=FALSE)
> promoters <- flank(x, width=20)
49 / 139

Perhaps you would like to create entirely new ranges based on their relative position to ranges you already have

When might this be handy?

For this application, you would use the flank() command:

> flank(x, width=7)
> flank(x, width=7, start=FALSE)
> promoters <- flank(x, width=20)

And here's a visual representation of these new ranges:

grep
50 / 139

We may also be interested in collapsing several reads into "super reads" that show the coverage extent of our ranges

51 / 139

We may also be interested in collapsing several reads into "super reads" that show the coverage extent of our ranges

For example, perhaps we have mapped sequence reads to a reference and we would like to see the proportion of genomic coverage

52 / 139

We may also be interested in collapsing several reads into "super reads" that show the coverage extent of our ranges

For example, perhaps we have mapped sequence reads to a reference and we would like to see the proportion of genomic coverage

Let's try a simple example:

> set.seed(0) # set the random number generator seed
> alns <- IRanges(start=sample(seq_len(50), 20), width=5)
> head(alns, 10)
53 / 139

We may also be interested in collapsing several reads into "super reads" that show the coverage extent of our ranges

For example, perhaps we have mapped sequence reads to a reference and we would like to see the proportion of genomic coverage

Let's try a simple example:

> set.seed(0) # set the random number generator seed
> alns <- IRanges(start=sample(seq_len(50), 20), width=5)
> head(alns, 10)

and now let's collapse these into super reads using the reduce() command:

> reduce(alns)
54 / 139

We may also be interested in collapsing several reads into "super reads" that show the coverage extent of our ranges

For example, perhaps we have mapped sequence reads to a reference and we would like to see the proportion of genomic coverage

Let's try a simple example:

> set.seed(0) # set the random number generator seed
> alns <- IRanges(start=sample(seq_len(50), 20), width=5)
> head(alns, 10)

and now let's collapse these into super reads using the reduce() command:

> reduce(alns)

Here's a visual representation of what reduce() is doing...

55 / 139
grep
56 / 139

Rather than extent of coverage, we may be interested in gaps between our ranges

57 / 139

Rather than extent of coverage, we may be interested in gaps between our ranges

Here, the gaps() command will come in handy:

> gaps(alns)
58 / 139

Rather than extent of coverage, we may be interested in gaps between our ranges

Here, the gaps() command will come in handy:

> gaps(alns)

Which produces new ranges that span gaps in an indicated IRanges object:

59 / 139

Rather than extent of coverage, we may be interested in gaps between our ranges

Here, the gaps() command will come in handy:

> gaps(alns)

Which produces new ranges that span gaps in an indicated IRanges object:

grep
60 / 139

IRange ranges can also be treated as a set of consecutive integers, so IRange(start=4, end=7) would be the integers 4,5,6,7

61 / 139

IRange ranges can also be treated as a set of consecutive integers, so IRange(start=4, end=7) would be the integers 4,5,6,7

Given this property, we can use various R commands on our IRange objects

62 / 139

IRange ranges can also be treated as a set of consecutive integers, so IRange(start=4, end=7) would be the integers 4,5,6,7

Given this property, we can use various R commands on our IRange objects

Create the following IRange objects:

> a <- IRanges(start=4, end=13)
> b <- IRanges(start=12, end=17)
63 / 139

IRange ranges can also be treated as a set of consecutive integers, so IRange(start=4, end=7) would be the integers 4,5,6,7

Given this property, we can use various R commands on our IRange objects

Create the following IRange objects:

> a <- IRanges(start=4, end=13)
> b <- IRanges(start=12, end=17)

and see if you can understand how the following R commands can be applied to these objects: setdiff(), union(), intersect()

64 / 139

IRange ranges can also be treated as a set of consecutive integers, so IRange(start=4, end=7) would be the integers 4,5,6,7

Given this property, we can use various R commands on our IRange objects

Create the following IRange objects:

> a <- IRanges(start=4, end=13)
> b <- IRanges(start=12, end=17)

and see if you can understand how the following R commands can be applied to these objects: setdiff(), union(), intersect()

grep
65 / 139

Finding Overlapping Ranges:

66 / 139

Finding Overlapping Ranges:

  • Up until now we have been manipulating ranges and creating new ranges based on relationships with existing ranges
67 / 139

Finding Overlapping Ranges:

  • Up until now we have been manipulating ranges and creating new ranges based on relationships with existing ranges

  • In many genomic applications, we may need to explicitly map overlaps between subject and query sets of ranges

68 / 139

Finding Overlapping Ranges:

  • Up until now we have been manipulating ranges and creating new ranges based on relationships with existing ranges

  • In many genomic applications, we may need to explicitly map overlaps between subject and query sets of ranges

  • There are many ways to assess overlap between a subject and query and it is important to be aware of the details of any given approach

69 / 139

Finding Overlapping Ranges:

  • Up until now we have been manipulating ranges and creating new ranges based on relationships with existing ranges

  • In many genomic applications, we may need to explicitly map overlaps between subject and query sets of ranges

  • There are many ways to assess overlap between a subject and query and it is important to be aware of the details of any given approach

Let's create subject and query IRange objects and assess overlap with the findOverlaps() function:

> qry <- IRanges(start=c(1, 26, 19, 11, 21, 7), end=c(16, 30, 19, 15, 24, 8),
names=letters[1:6])
> sbj <- IRanges(start=c(1, 19, 10), end=c(5, 29, 16), names=letters[24:26])
70 / 139

Finding Overlapping Ranges:

  • Up until now we have been manipulating ranges and creating new ranges based on relationships with existing ranges

  • In many genomic applications, we may need to explicitly map overlaps between subject and query sets of ranges

  • There are many ways to assess overlap between a subject and query and it is important to be aware of the details of any given approach

Let's create subject and query IRange objects and assess overlap with the findOverlaps() function:

> qry <- IRanges(start=c(1, 26, 19, 11, 21, 7), end=c(16, 30, 19, 15, 24, 8),
names=letters[1:6])
> sbj <- IRanges(start=c(1, 19, 10), end=c(5, 29, 16), names=letters[24:26])

Inspect the objects we've just created

71 / 139

Visually, these objects look like this:

grep
72 / 139

Visually, these objects look like this:

grep

If we use the findOverlaps() function with these, we create an object of class "hits":

> hts <- findOverlaps(qry, sbj)
73 / 139

Visually, these objects look like this:

grep

If we use the findOverlaps() function with these, we create an object of class "hits":

> hts <- findOverlaps(qry, sbj)

Inspect hts and see if you can understand its structure

74 / 139

Essentially, hts shows a mapping of query (qry) to subject (sbj):

grep
75 / 139

Essentially, hts shows a mapping of query (qry) to subject (sbj):

grep

If we want to see the names of each of these hits, we can access them in this way:

> names(qry)[queryHits(hts)]
> names(sbj)[subjectHits(hts)]
76 / 139

Essentially, hts shows a mapping of query (qry) to subject (sbj):

grep

If we want to see the names of each of these hits, we can access them in this way:

> names(qry)[queryHits(hts)]
> names(sbj)[subjectHits(hts)]

Based on Figure 9-11, how is findOverlaps() assessing overlap?

77 / 139

Essentially, hts shows a mapping of query (qry) to subject (sbj):

grep

If we want to see the names of each of these hits, we can access them in this way:

> names(qry)[queryHits(hts)]
> names(sbj)[subjectHits(hts)]

Based on Figure 9-11, how is findOverlaps() assessing overlap?

Let's tweak how findOverlaps() identifies overlap with the type argument:

hts_within <- findOverlaps(qry, sbj, type="within")
78 / 139

Essentially, hts shows a mapping of query (qry) to subject (sbj):

grep

If we want to see the names of each of these hits, we can access them in this way:

> names(qry)[queryHits(hts)]
> names(sbj)[subjectHits(hts)]

Based on Figure 9-11, how is findOverlaps() assessing overlap?

Let's tweak how findOverlaps() identifies overlap with the type argument:

hts_within <- findOverlaps(qry, sbj, type="within")

What has changed?

79 / 139

We can also limit our hits to a single match between subject and query using the argument select

80 / 139

We can also limit our hits to a single match between subject and query using the argument select

Try the following options for select and see if you can understand how findOverlaps() has changed:

> findOverlaps(qry, sbj, select="first")
> findOverlaps(qry, sbj, select="last")
> findOverlaps(qry, sbj, select="arbitrary")
81 / 139

We can also limit our hits to a single match between subject and query using the argument select

Try the following options for select and see if you can understand how findOverlaps() has changed:

> findOverlaps(qry, sbj, select="first")
> findOverlaps(qry, sbj, select="last")
> findOverlaps(qry, sbj, select="arbitrary")

While we are working with very small query and subject objects, imagine how computationally intensive this could become with larger data sets

82 / 139

We can also limit our hits to a single match between subject and query using the argument select

Try the following options for select and see if you can understand how findOverlaps() has changed:

> findOverlaps(qry, sbj, select="first")
> findOverlaps(qry, sbj, select="last")
> findOverlaps(qry, sbj, select="arbitrary")

While we are working with very small query and subject objects, imagine how computationally intensive this could become with larger data sets

We can limit the amount of hits that have to be checked by creating an "interval tree" and not checking queries if their end points are smaller than a subject range's start point

83 / 139

We can also limit our hits to a single match between subject and query using the argument select

Try the following options for select and see if you can understand how findOverlaps() has changed:

> findOverlaps(qry, sbj, select="first")
> findOverlaps(qry, sbj, select="last")
> findOverlaps(qry, sbj, select="arbitrary")

While we are working with very small query and subject objects, imagine how computationally intensive this could become with larger data sets

We can limit the amount of hits that have to be checked by creating an "interval tree" and not checking queries if their end points are smaller than a subject range's start point

> sbj_it <- IntervalTree(sbj)
> class(sbj_it)
> findOverlaps(qry, sbjit)
84 / 139

There is more we can do with the hits object as well

85 / 139

There is more we can do with the hits object as well

Try the following functions and see if you can understand what each is doing:

> as.matrix(hts)
> countQueryHits(hts)
> setNames(countQueryHits(hts), names(qry))
> countSubjectHits(hts)
> ranges(hts, qry, sbj)
86 / 139

There is more we can do with the hits object as well

Try the following functions and see if you can understand what each is doing:

> as.matrix(hts)
> countQueryHits(hts)
> setNames(countQueryHits(hts), names(qry))
> countSubjectHits(hts)
> ranges(hts, qry, sbj)
grep
87 / 139

Additional overlap functions can also be applied to ranges. For example:

> countOverlaps(qry, sbj)
> subsetByOverlaps(qry, sbj)
88 / 139

Additional overlap functions can also be applied to ranges. For example:

> countOverlaps(qry, sbj)
> subsetByOverlaps(qry, sbj)

What are these doing?

89 / 139

Finding Nearest Ranges and Calculating Distances:

90 / 139

Finding Nearest Ranges and Calculating Distances:

We can also use range functions to find subject ranges close to our query ranges

91 / 139

Finding Nearest Ranges and Calculating Distances:

We can also use range functions to find subject ranges close to our query ranges

For example, try the following:

> qry <- IRanges(start=6, end=13, name='query')
> sbj <- IRanges(start=c(2, 4, 18, 19), end=c(4, 7, 21, 24), names=1:4)
> nearest(qry, sbj)
> precede(qry, sbj)
> follow(qry, sbj)
92 / 139

Finding Nearest Ranges and Calculating Distances:

We can also use range functions to find subject ranges close to our query ranges

For example, try the following:

> qry <- IRanges(start=6, end=13, name='query')
> sbj <- IRanges(start=c(2, 4, 18, 19), end=c(4, 7, 21, 24), names=1:4)
> nearest(qry, sbj)
> precede(qry, sbj)
> follow(qry, sbj)

We can also directly calculate the distance between query and subject ranges:

> qry <- IRanges(sample(seq_len(1000), 5), width=10)
> sbj <- IRanges(sample(seq_len(1000), 5), width=10)
> distanceToNearest(qry, sbj)
> distance(qry, sbj)
93 / 139

Storing Genomic Ranges with GenomicRanges:

GRanges are very similar to IRanges but include information on chromosome, strand, and, optionally, metadata

94 / 139

Storing Genomic Ranges with GenomicRanges:

GRanges are very similar to IRanges but include information on chromosome, strand, and, optionally, metadata

Creation of GRanges is very similar to IRanges:

> library(GenomicRanges)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"))
> gr
95 / 139

Storing Genomic Ranges with GenomicRanges:

GRanges are very similar to IRanges but include information on chromosome, strand, and, optionally, metadata

Creation of GRanges is very similar to IRanges:

> library(GenomicRanges)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"))
> gr

And here with an example of metadata:

> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"), ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"), gc=round(runif(4), 3))
> gr
96 / 139

Storing Genomic Ranges with GenomicRanges:

GRanges are very similar to IRanges but include information on chromosome, strand, and, optionally, metadata

Creation of GRanges is very similar to IRanges:

> library(GenomicRanges)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"))
> gr

And here with an example of metadata:

> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"), ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"), gc=round(runif(4), 3))
> gr

Essentially any type of data can be stored in metadata columns which is what makes GRanges so powerful

97 / 139

We can also specify the lengths of each chromosome which are necessary for calculating coverage, gaps, etc...

> seqlens <- c(chr1=152, chr2=432, chr3=903)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc=round(runif(4), 3),
seqlengths=seqlens)
> gr
98 / 139

We can also specify the lengths of each chromosome which are necessary for calculating coverage, gaps, etc...

> seqlens <- c(chr1=152, chr2=432, chr3=903)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc=round(runif(4), 3),
seqlengths=seqlens)
> gr

Like IRanges objects, we can use accessor functions to pull out specific types of data from Granges:

> start(gr)
> end(gr)
> width(gr)
> seqnames(gr)
> strand(gr)
99 / 139

Names can also be given to individual elements within a GRanges object:

> names(gr) <- letters[1:length(gr)]
100 / 139

Names can also be given to individual elements within a GRanges object:

> names(gr) <- letters[1:length(gr)]

And GRanges also supports subsetting and additional R functions:

> start(gr) > 7
> gr[start(gr) > 7]
> table(seqnames(gr))
> gr[seqnames(gr) == "chr1"]
101 / 139

Names can also be given to individual elements within a GRanges object:

> names(gr) <- letters[1:length(gr)]

And GRanges also supports subsetting and additional R functions:

> start(gr) > 7
> gr[start(gr) > 7]
> table(seqnames(gr))
> gr[seqnames(gr) == "chr1"]

Finally, the mcols() function can be used to access metadata columns:

> mcols(gr)
> mcols(gr)$gc
102 / 139

Names can also be given to individual elements within a GRanges object:

> names(gr) <- letters[1:length(gr)]

And GRanges also supports subsetting and additional R functions:

> start(gr) > 7
> gr[start(gr) > 7]
> table(seqnames(gr))
> gr[seqnames(gr) == "chr1"]

Finally, the mcols() function can be used to access metadata columns:

> mcols(gr)
> mcols(gr)$gc

and can be combined with subsetting and other functions for advanced queries of data:

> mcols(gr[seqnames(gr) == "chr1"])$gc
> mean(mcols(gr[seqnames(gr) == "chr1"])$gc)
103 / 139

Grouping Data with GRangesList:

104 / 139

Grouping Data with GRangesList:

Very similar to the lists we learned about previously in R, GRanges has a data structure called GRangesList

105 / 139

Grouping Data with GRangesList:

Very similar to the lists we learned about previously in R, GRanges has a data structure called GRangesList

These can be built manually:

> gr1 <- GRanges(c("chr1", "chr2"), IRanges(start=c(32, 95), width=c(24, 123)))
> gr2 <- GRanges(c("chr8", "chr2"), IRanges(start=c(27, 12), width=c(42, 34)))
> grl <- GRangesList(gr1, gr2)
106 / 139

Grouping Data with GRangesList:

Very similar to the lists we learned about previously in R, GRanges has a data structure called GRangesList

These can be built manually:

> gr1 <- GRanges(c("chr1", "chr2"), IRanges(start=c(32, 95), width=c(24, 123)))
> gr2 <- GRanges(c("chr8", "chr2"), IRanges(start=c(27, 12), width=c(42, 34)))
> grl <- GRangesList(gr1, gr2)

And behave very similarly to typical lists in R:

> unlist(grl)
> doubled_grl <- c(grl, grl)
107 / 139

While above we've created GRangesLists manually, they are often produced by splitting GRanges:

> chrs <- c("chr3", "chr1", "chr2", "chr2", "chr3", "chr1")
> gr <- GRanges(chrs, IRanges(sample(1:100, 6, replace=TRUE),
width=sample(3:30, 6, replace=TRUE)))
> gr
> gr_split <- split(gr, seqnames(gr))
> gr_split[[1]]
> gr_split[[2]]
108 / 139

While above we've created GRangesLists manually, they are often produced by splitting GRanges:

> chrs <- c("chr3", "chr1", "chr2", "chr2", "chr3", "chr1")
> gr <- GRanges(chrs, IRanges(sample(1:100, 6, replace=TRUE),
width=sample(3:30, 6, replace=TRUE)))
> gr
> gr_split <- split(gr, seqnames(gr))
> gr_split[[1]]
> gr_split[[2]]

Splitting GRanges into GRangesLists facilitates easier access to subsets of GRanges

109 / 139

While above we've created GRangesLists manually, they are often produced by splitting GRanges:

> chrs <- c("chr3", "chr1", "chr2", "chr2", "chr3", "chr1")
> gr <- GRanges(chrs, IRanges(sample(1:100, 6, replace=TRUE),
width=sample(3:30, 6, replace=TRUE)))
> gr
> gr_split <- split(gr, seqnames(gr))
> gr_split[[1]]
> gr_split[[2]]

Splitting GRanges into GRangesLists facilitates easier access to subsets of GRanges

Splitting is also useful because we can then use lapply and sapply across individual elements of the list:

> lapply(gr_split, function(x) order(width(x)))
> sapply(gr_split, function(x) min(start(x)))
> sapply(gr_split, length)
110 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

111 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

  • Now let's work with GRanges in the context of real data while learning a few new Bioconductor packages
112 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

  • Now let's work with GRanges in the context of real data while learning a few new Bioconductor packages

  • GenomicFeatures is a package for creating and working with transcript-based annotations in the form of TranscriptDb objects.

113 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

  • Now let's work with GRanges in the context of real data while learning a few new Bioconductor packages

  • GenomicFeatures is a package for creating and working with transcript-based annotations in the form of TranscriptDb objects.

  • rtracklayer is designed for importing and exporting annotation data into a variety of different formats

114 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

  • Now let's work with GRanges in the context of real data while learning a few new Bioconductor packages

  • GenomicFeatures is a package for creating and working with transcript-based annotations in the form of TranscriptDb objects.

  • rtracklayer is designed for importing and exporting annotation data into a variety of different formats

First, let's install GenomicFeatures:

> library(BiocInstaller)
> biocLite("GenomicFeatures")
115 / 139

Working with Annotation Data: GenomicFeatures and rtracklayer:

  • Now let's work with GRanges in the context of real data while learning a few new Bioconductor packages

  • GenomicFeatures is a package for creating and working with transcript-based annotations in the form of TranscriptDb objects.

  • rtracklayer is designed for importing and exporting annotation data into a variety of different formats

First, let's install GenomicFeatures:

> library(BiocInstaller)
> biocLite("GenomicFeatures")

And then, let's install an annotation package for the house mouse (Mus musculus):

> biocLite("TxDb.Mmusculus.UCSC.mm10.ensGene")
> library(TxDb.Mmusculus.UCSC.mm10.ensGene)
> txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
116 / 139

We've just built an SQLite database for Mus musculus transcripts within R

117 / 139

We've just built an SQLite database for Mus musculus transcripts within R

Various types of annotations can be extracted using accessor functions:

> genes(txdb)
> transcripts(txdb)
> exons(txdb)
> promoters(txdb)
118 / 139

We've just built an SQLite database for Mus musculus transcripts within R

Various types of annotations can be extracted using accessor functions:

> genes(txdb)
> transcripts(txdb)
> exons(txdb)
> promoters(txdb)

Inspect a few of these using head(genes(txdb)) for example...what type of objects are produced?

119 / 139

We've just built an SQLite database for Mus musculus transcripts within R

Various types of annotations can be extracted using accessor functions:

> genes(txdb)
> transcripts(txdb)
> exons(txdb)
> promoters(txdb)

Inspect a few of these using head(genes(txdb)) for example...what type of objects are produced?

We can group these GRanges objects into GRangesLists using a number of different functions such as transcriptsBy(), exonsBy(), cdsBy(), intronsBy(), fiveUTRsByTranscript(), and threeUTRsByTranscript()

120 / 139

We've just built an SQLite database for Mus musculus transcripts within R

Various types of annotations can be extracted using accessor functions:

> genes(txdb)
> transcripts(txdb)
> exons(txdb)
> promoters(txdb)

Inspect a few of these using head(genes(txdb)) for example...what type of objects are produced?

We can group these GRanges objects into GRangesLists using a number of different functions such as transcriptsBy(), exonsBy(), cdsBy(), intronsBy(), fiveUTRsByTranscript(), and threeUTRsByTranscript()

For example:

> mm_exons_by_gn <- exonsBy(txdb, by="gene")
121 / 139

We can also control what portions of a genome are queried by limiting to a particular chromosome:

> seqlevels(txdb, force=TRUE) <- "chr1"
> chr1_exons <- exonsBy(txdb, by="gene")
> txdb <- restoreSeqlevels(txdb) # restore txdb so it queries all sequences
122 / 139

We can also control what portions of a genome are queried by limiting to a particular chromosome:

> seqlevels(txdb, force=TRUE) <- "chr1"
> chr1_exons <- exonsBy(txdb, by="gene")
> txdb <- restoreSeqlevels(txdb) # restore txdb so it queries all sequences

or by searching within a particular interval by using the "x"ByOverlaps family of functions:

> candidate_region <- GRanges("chr8", IRanges(123250562, 123567264))
> transcriptsByOverlaps(txdb, candidate_region)
123 / 139

We can also control what portions of a genome are queried by limiting to a particular chromosome:

> seqlevels(txdb, force=TRUE) <- "chr1"
> chr1_exons <- exonsBy(txdb, by="gene")
> txdb <- restoreSeqlevels(txdb) # restore txdb so it queries all sequences

or by searching within a particular interval by using the "x"ByOverlaps family of functions:

> candidate_region <- GRanges("chr8", IRanges(123250562, 123567264))
> transcriptsByOverlaps(txdb, candidate_region)

While the GenomicsFeatures package and transcriptDb objects are straight-forward, convenient and consistent across genome versions and species, this comes at a cost to flexibility

124 / 139

We can also control what portions of a genome are queried by limiting to a particular chromosome:

> seqlevels(txdb, force=TRUE) <- "chr1"
> chr1_exons <- exonsBy(txdb, by="gene")
> txdb <- restoreSeqlevels(txdb) # restore txdb so it queries all sequences

or by searching within a particular interval by using the "x"ByOverlaps family of functions:

> candidate_region <- GRanges("chr8", IRanges(123250562, 123567264))
> transcriptsByOverlaps(txdb, candidate_region)

While the GenomicsFeatures package and transcriptDb objects are straight-forward, convenient and consistent across genome versions and species, this comes at a cost to flexibility

The rtracklayer package is not quite as user-friendly, but is more flexible

125 / 139

Data can be imported and exported from/to a variety of formats using rtracklayer (e.g.,, GTF/GFF, BED, BED Graph, etc...)

126 / 139

Data can be imported and exported from/to a variety of formats using rtracklayer (e.g.,, GTF/GFF, BED, BED Graph, etc...)

Imported data are converted into GRanges objects and missling data and metadata columns are automatically dealt with

127 / 139

Data can be imported and exported from/to a variety of formats using rtracklayer (e.g.,, GTF/GFF, BED, BED Graph, etc...)

Imported data are converted into GRanges objects and missling data and metadata columns are automatically dealt with

As an example, let's try importing a file from the GitHub repository of the Buffalo book:

> library(rtracklayer)
> mm_gtf <- import('Mus_musculus.GRCm38.75_chr1.gtf.gz')
> colnames(mcols(mm_gtf)) # metadata columns read in
128 / 139

Data can be imported and exported from/to a variety of formats using rtracklayer (e.g.,, GTF/GFF, BED, BED Graph, etc...)

Imported data are converted into GRanges objects and missling data and metadata columns are automatically dealt with

As an example, let's try importing a file from the GitHub repository of the Buffalo book:

> library(rtracklayer)
> mm_gtf <- import('Mus_musculus.GRCm38.75_chr1.gtf.gz')
> colnames(mcols(mm_gtf)) # metadata columns read in

rtracklayer has automatically detected the imported file type and has brought this in as a GRanges object

129 / 139

We can also use rtracklayer to export subsets of the this data file to files of any format we choose:

> set.seed(0)
> pseudogene_i <- which(mm_gtf$gene_biotype == "pseudogene" &
mm_gtf$type == "gene")
> pseudogene_sample <- sample(pseudogene_i, 5)
> export(mm_gtf[pseudogene_sample], con="five_random_pseudogene.gtf",
format="GTF")
130 / 139

Now let's try using these packages and the commands we've learned in an example:

131 / 139

Now let's try using these packages and the commands we've learned in an example:

Let's first import a file with variants (SNPs, indels, etc...) from chr1 of Mus musculus:

> dbsnp137 <- import("mm10_snp137_chr1_trunc.bed.gz")
132 / 139

Now let's try using these packages and the commands we've learned in an example:

Let's first import a file with variants (SNPs, indels, etc...) from chr1 of Mus musculus:

> dbsnp137 <- import("mm10_snp137_chr1_trunc.bed.gz")

We want to find all variants within exons on this mouse chromosome. Let's first collapse all overlapping exons in the mouse TranscriptDb object we created earlier and create an object with only exons from chr1:

> collapsed_exons <- reduce(exons(txdb), ignore.strand=TRUE)
> chr1_collapsed_exons <- collapsed_exons[seqnames(collapsed_exons) == "chr1"]
133 / 139

Now let's try using these packages and the commands we've learned in an example:

Let's first import a file with variants (SNPs, indels, etc...) from chr1 of Mus musculus:

> dbsnp137 <- import("mm10_snp137_chr1_trunc.bed.gz")

We want to find all variants within exons on this mouse chromosome. Let's first collapse all overlapping exons in the mouse TranscriptDb object we created earlier and create an object with only exons from chr1:

> collapsed_exons <- reduce(exons(txdb), ignore.strand=TRUE)
> chr1_collapsed_exons <- collapsed_exons[seqnames(collapsed_exons) == "chr1"]

Before extracting variants in exons, let's first inspect our variant file:

> summary(width(dbsnp137))
134 / 139

If a variant has a width of 0, we cannot find its overlap with exon ranges, so we must adjust its width to do this:

> dbsnp137_resized <- dbsnp137
> zw_i <- width(dbsnp137_resized) == 0
> dbsnp137_resized[zw_i] <- resize(dbsnp137_resized[zw_i], width=1)
135 / 139

If a variant has a width of 0, we cannot find its overlap with exon ranges, so we must adjust its width to do this:

> dbsnp137_resized <- dbsnp137
> zw_i <- width(dbsnp137_resized) == 0
> dbsnp137_resized[zw_i] <- resize(dbsnp137_resized[zw_i], width=1)

We can now pull out those variants that overlap exons on chromosome 1 by creating a hits object:

> hits <- findOverlaps(dbsnp137_resized, chr1_collapsed_exons,
ignore.strand=TRUE)
136 / 139

If a variant has a width of 0, we cannot find its overlap with exon ranges, so we must adjust its width to do this:

> dbsnp137_resized <- dbsnp137
> zw_i <- width(dbsnp137_resized) == 0
> dbsnp137_resized[zw_i] <- resize(dbsnp137_resized[zw_i], width=1)

We can now pull out those variants that overlap exons on chromosome 1 by creating a hits object:

> hits <- findOverlaps(dbsnp137_resized, chr1_collapsed_exons,
ignore.strand=TRUE)

and determine the number of variants and the proportion of variants that are exonic:

> length(unique(queryHits(hits)))
> length(unique(queryHits(hits)))/length(dbsnp137_resized)
137 / 139

We can also use the countOverlaps() function to find the number of variants per exon (note we have to reverse the order of the query since we're finding values per exon now)

> var_counts <- countOverlaps(chr1_collapsed_exons, dbsnp137_resized, ignore.strand=TRUE)
138 / 139

We can also use the countOverlaps() function to find the number of variants per exon (note we have to reverse the order of the query since we're finding values per exon now)

> var_counts <- countOverlaps(chr1_collapsed_exons, dbsnp137_resized, ignore.strand=TRUE)

and we can append this to our GRanges object that includes exons:

> chr1_collapsed_exons$num_vars <- var_counts
139 / 139

Introduction

2 / 139
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow