Overview
Questions
What does Biopython do?
How does Biopython handle sequences?
How can I access sequences and data from Genbank?
Objectives
Learn about the
Seq
andSeqRecord
objects.Read in sequences from FASTA files.
Download a sequence record directly from Genbank using the NCBI E-utilities.
Biopython is a freely available package for working with molecular biological data. In this lesson, we will just cover some basics of working with Biopython. The developers of this package have written a comprehensive tutorial and cookbook.
The tutorial we are working with today was written by Dr. Iddo Friedberg and Dr. Stuart Brown
The documentation page for Biopython provides a list of the many different tools in the package:
This lesson will use the example files in the
biopython
folder of the course
files in the BCB546-Spring2022
repository.
Download these files and make sure they are in the same directory
where you are creating your Jupyter notebook.
The easiest way to install the Biopython tools is to use conda
. From your terminal, you simply need to execute the following:
$ conda install biopython
Now create a new Jupyter notebook for this lesson.
Seq
ObjectThe Seq
object class is simple and fundamental for a lot of
Biopython work. A Seq object can contain DNA, RNA, or protein.
It contains a string (the sequence) and a defined alphabet for that string.
The alphabets are actually defined objects such as IUPACAmbiguousDNA
or
IUPACProtein
. A Seq object with a DNA alphabet has some different methods than one with an Amino Acid alphabet.
First, import the Seq
object from Biopython
from Bio.Seq import Seq
Now we can create a Seq
object:
my_seq = Seq("AGTACACTGGT")
my_seq
Seq('AGTACACTGGT')
The nice thing about the sequence object is that it can be treated just like a Python string object.
print(my_seq[:3])
AGT
Seq
objects also have string methods like .count()
my_seq.count('AC')
2
And you can use functions that act on strings like len()
len(my_seq)
11
Seq
objects also have special methods. For example, you can get the reverse complement of a sequence:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print(my_seq.reverse_complement())
GCGATTTTCGATCCTATATAGGCCCATCGATC
Just like strings in Python, the Seq
object is immutable,
meaning you cannot change it. If you try to change one of the sites in this
sequence, you will get an error. If you want an editable sequence object, you
will need to create a MutableSeq
object.
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
Now you can try changing the nucleotide at index 3 to 'G'
.
SeqRecord
ObjectBiopython’s SeqRecord
is a complex object that contains a
Seq
object as well as other fields for attributes of that
sequence (i.e., metadata). These attributes are also called
“annotation fields”:
.seq
.id
.name
.description
.letter_annotations
.annotations
.features
.dbxrefs
You can create a SeqRecord
by giving the constructor a Seq
object:
from Bio.SeqRecord import SeqRecord
simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)
And you can provide attributes:
simple_seq_r.id = "AC12345"
simple_seq_r.description = "This sequence is pretend."
print(simple_seq_r)
ID: AC12345
Name: <unknown name>
Description: This sequence is pretend.
Number of features: 0
Seq('GATC')
SeqIO
enables reading in sequences from FASTA files and storing the data in a SeqRecord
. Addtionally SeqIO
provides tools for writing sequence data to a file.
We will read in the example file NC_005816.fna
using SeqIO
.
from Bio import SeqIO
record = SeqIO.read("NC_005816.fna", "fasta")
Find out more about this sequence
Use string methods and the
SeqRecord
attributes to get the length of the sequence and the species name.Solution
Get the length using
len()
.len(record.seq)
9609
The species name is given in the description of this FASTA file
record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
Using SeqIO
we can read in several sequences from a file and store
them in a list of SeqRecord
objects from a file. The file
example.fasta
looks like this:
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
With Biopython, we can use the SeqIO.parse()
function to obtain the three sequences in this file
handle = open("example.fasta", "r")
seq_list = list(SeqIO.parse(handle, "fasta"))
handle.close()
print(seq_list[0].seq)
CCCTTCTTGTCTTCAGCGTTTCTCC
In the example above, we open the file and assign it to the variable handle
which acts as a pointer to the file contents.
BioPython has modules that can directly access databases over the Internet using
the Entrez
module. This uses the NCBI Efetch service,
which works on many NCBI databases including protein and PubMed
literature citations.
With a few tweaks, this code could be used to download a list of
GenBank ID’s and save them as FASTA or GenBank files.
Before using the online NCBI resources, it is important to be aware of the user requirements. If you abuse their system (whether on purpose or on accident), they will block your access for some time. You can find the requirements in the NCBI E-utilities guide.
First, you are required to provide NCBI with your identity so that you can be contacted if there is a problem. This also limits abuse of this system so that their servers aren’t overwhelmed. If you are identified as someone who is excessively using the E-utilities, NCBI will contact you before you are blocked.
The quote below from the NCBI guide gives you a sense of what constitutes appropriate usage of the E-utility servers:
In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays.
Enter your own email address in place of <enter your email address>
:
from Bio import Entrez
Entrez.email = "<enter your email address>"
Now we can fetch a Genbank record:
handle = Entrez.efetch(db="nucleotide", id="DQ137224", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
print(record)
ID: DQ137224.1
Name: DQ137224
Description: Megadyptes antipodes voucher JD64A cytochrome b (cytb) gene, partial cds; mitochondrial
Number of features: 3
/molecule_type=DNA
/topology=linear
/data_file_division=VRT
/date=26-JUL-2016
/accessions=['DQ137224']
/sequence_version=1
/keywords=['']
/source=mitochondrion Megadyptes antipodes (Yellow-eyed penguin)
/organism=Megadyptes antipodes
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Archelosauria', 'Archosauria', 'Dinosauria', 'Saurischia', 'Theropoda', 'Coelurosauria', 'Aves', 'Neognathae', 'Sphenisciformes', 'Spheniscidae', 'Megadyptes']
/references=[Reference(title='Multiple gene evidence for expansion of extant penguins out of Antarctica due to global cooling', ...), Reference(title='Direct Submission', ...)]
Seq('ACACAAATTCTAACTGGCCTCCTACTGGCCGCCCACTACACTGCAGACACAACC...AGC', IUPACAmbiguousDNA())
BioPython makes it easy to work with NCBI’s BLAST. To run
blast over the internet, we can use qblast()
.
For this we must import the NCBIWWW
module:
from Bio.Blast import NCBIWWW
You can call the help()
function on NCBIWWW.qblast
to inspect how this
function works. This will return all of the parameters of qblast
so that
you can understand how to specify your query correctly.
help(NCBIWWW.qblast)
Next we can read in a sequence that is stored in a FASTA file called test.fasta
.
query = SeqIO.read("test.fasta", format="fasta")
The variable we created called query
is a sequence stored in a SeqRecord
object.
To run a BLAST search on the sequence from our FASTA file, we simply
have to specify the search program (blastn
) and the database (nt
).
The last argument is the Seq
object stored in our query
.
result_handle = NCBIWWW.qblast("blastn", "nt", query.seq)
Note that this might not work for everyone in class. It is possible for NCBI to throttle non-interactive users.
Once we have the results of our BLAST, we can store them in an XML file.
blast_file = open("my_blast.xml", "w")
blast_file.write(result_handle.read())
Once we have stored the results, it’s best to close all of our open file handles:
blast_file.close()
result_handle.close()
We created an XML file containing our BLAST results. This is now easy to
parse using the NCBIXML
tools:
from Bio.Blast import NCBIXML
handle = open("my_blast.xml")
blast_record = NCBIXML.read(handle)
Now that we have read in the file, we can print each of the hits:
for hit in blast_record.descriptions:
print(hit.title)
print(hit.e)
gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera cold-regulated 413 plasma membrane protein 2 (LOC100248690), mRNA
0.0
gi|1420088022|gb|MG722853.1| Vitis vinifera cold-regulated 413 inner membrane protein 2 mRNA, complete cds
0.0
gi|123704572|emb|AM483681.1| Vitis vinifera, whole genome shotgun sequence, contig VV78X045699.9, clone ENTAV 115
0.0
...
gi|1027107741|ref|XM_008238505.2| PREDICTED: Prunus mume cold-regulated 413 plasma membrane protein 1-like (LOC103335494), mRNA
1.04564e-118
We can also view the alignments for each of the BLAST hits:
for hit in blast_record.alignments:
for hsp in hit.hsps:
print(hit.title)
print(hsp.expect)
print(hsp.query[0:75] + '...')
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...')
gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera cold-regulated 413 plasma membrane protein 2 (LOC100248690), mRNA
0.0
TACTCTACAGTCTCTGACTTTGTAAGCTTCGCGCTTCTTCTCCTTTTTCTCTCTGGGGAAAGATTTTCCCTTTCT...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
TACTCTACAGTCTCTGACTTTGTAAGCTTCGCGCTTCTTCTCCTTTTTCTCTCTGGGGAAAGATTTTCCCTTTCT...
...
gi|1027107741|ref|XM_008238505.2| PREDICTED: Prunus mume cold-regulated 413 plasma membrane protein 1-like (LOC103335494), mRNA
1.04564e-118
ATTGAAGATGGGGAAAAAGGGTTACTTGGCGATGAGGACTGACACTGATACTACTGATTTGATCAGTTCTGATCT...
|||| |||||| ||| || | |||||| |||| |||||| | ||| | | ||| |||||| || |||||...
ATTGGAGATGGCAAAACAGAGCTACTTGAAAATGATGACTGAACCAGATGCAAATGAATTGATCCACTCCGATCT...
Often a BLAST search will return many matches for a single query, as is the
case for this example. This is why it is best to save them in an XML file.
Using NCBIXML.parse()
enables us to evaluate each of the BLAST records.
We can specify a threshold so that we can easily inspect the closest
matches.
E_VALUE_THRESH = 1e-150
for record in NCBIXML.parse(open("my_blast.xml")):
for align in record.alignments:
for hsp in align.hsps:
if hsp.expect < E_VALUE_THRESH:
print("MATCH: %s " % align.title[:60])
print(hsp.expect)
MATCH: gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera
0.0
MATCH: gi|1420088022|gb|MG722853.1| Vitis vinifera cold-regulated 4
0.0
MATCH: gi|123704572|emb|AM483681.1| Vitis vinifera, whole genome sh
0.0
MATCH: gi|1217007653|ref|XM_021787586.1| PREDICTED: Hevea brasilien
9.7761e-151
MATCH: gi|1217007651|ref|XM_021787585.1| PREDICTED: Hevea brasilien
9.7761e-151
Our BLAST search has matched our sequence with Vitis vinifera. Let’s check to see if it got it right:
query.description
Take-Home Challenge: Biopython
Experiment with features of biopython using the sequence in the variable
query
that we loaded fromtest.fasta
.
Print the reverse complement of the sequence to screen.
Write your own function to calculate the GC content of this sequence to screen.
Key Points
Biopython is a very useful toolbox for working with sequence data.