EEOB/BCB 546: Programming with Python

Introduction to Biopython

Overview

Questions
  • What does Biopython do?

  • How does Biopython handle sequences?

  • How can I access sequences and data from Genbank?


Objectives
  • Learn about the Seq and SeqRecord objects.

  • Read in sequences from FASTA files.

  • Download a sequence record directly from Genbank using the NCBI E-utilities.

Biopython Background

Biopython is a freely available package for working with molecular biological data. In this lesson, we will just cover some basics of working with Biopython. The developers of this package have written a comprehensive tutorial and cookbook.

The tutorial we are working with today was written by Dr. Iddo Friedberg and Dr. Stuart Brown

What can Biopython do?

The documentation page for Biopython provides a list of the many different tools in the package:

Getting Started

Download Example Files

This lesson will use the example files in the biopython folder of the course files in the BCB546-Spring2022 repository. Download these files and make sure they are in the same directory where you are creating your Jupyter notebook.

Install Biopython and Create a Jupyter Notebook

The easiest way to install the Biopython tools is to use conda. From your terminal, you simply need to execute the following:

$ conda install biopython

Now create a new Jupyter notebook for this lesson.

Working with Biopython

The Seq Object

The Seq object class is simple and fundamental for a lot of Biopython work. A Seq object can contain DNA, RNA, or protein. It contains a string (the sequence) and a defined alphabet for that string. The alphabets are actually defined objects such as IUPACAmbiguousDNA or IUPACProtein. A Seq object with a DNA alphabet has some different methods than one with an Amino Acid alphabet.

First, import the Seq object from Biopython

from Bio.Seq import Seq

Now we can create a Seq object:

my_seq = Seq("AGTACACTGGT")
my_seq
Seq('AGTACACTGGT')

The nice thing about the sequence object is that it can be treated just like a Python string object.

print(my_seq[:3])
AGT

Seq objects also have string methods like .count()

my_seq.count('AC')
2

And you can use functions that act on strings like len()

len(my_seq)
11

Seq objects also have special methods. For example, you can get the reverse complement of a sequence:

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print(my_seq.reverse_complement())
GCGATTTTCGATCCTATATAGGCCCATCGATC

Just like strings in Python, the Seq object is immutable, meaning you cannot change it. If you try to change one of the sites in this sequence, you will get an error. If you want an editable sequence object, you will need to create a MutableSeq object.

from Bio.Seq import MutableSeq
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")

Now you can try changing the nucleotide at index 3 to 'G'.

The SeqRecord Object

Biopython’s SeqRecord is a complex object that contains a Seq object as well as other fields for attributes of that sequence (i.e., metadata). These attributes are also called “annotation fields”:

You can create a SeqRecord by giving the constructor a Seq object:

from Bio.SeqRecord import SeqRecord
simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)

And you can provide attributes:

simple_seq_r.id = "AC12345"
simple_seq_r.description = "This sequence is pretend."
print(simple_seq_r)
ID: AC12345
Name: <unknown name>
Description: This sequence is pretend.
Number of features: 0
Seq('GATC')

Reading Sequences from FASTA files

SeqIO enables reading in sequences from FASTA files and storing the data in a SeqRecord. Addtionally SeqIO provides tools for writing sequence data to a file.

We will read in the example file NC_005816.fna using SeqIO.

from Bio import SeqIO
record = SeqIO.read("NC_005816.fna", "fasta")

Find out more about this sequence

Use string methods and the SeqRecord attributes to get the length of the sequence and the species name.

Solution

Get the length using len().

len(record.seq)
9609

The species name is given in the description of this FASTA file

record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

Using SeqIO we can read in several sequences from a file and store them in a list of SeqRecord objects from a file. The file example.fasta looks like this:

>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG

With Biopython, we can use the SeqIO.parse() function to obtain the three sequences in this file

handle = open("example.fasta", "r") 
seq_list = list(SeqIO.parse(handle, "fasta"))
handle.close()
print(seq_list[0].seq)
CCCTTCTTGTCTTCAGCGTTTCTCC

In the example above, we open the file and assign it to the variable handle which acts as a pointer to the file contents.

Direct Access to GenBank

BioPython has modules that can directly access databases over the Internet using the Entrez module. This uses the NCBI Efetch service, which works on many NCBI databases including protein and PubMed literature citations. With a few tweaks, this code could be used to download a list of GenBank ID’s and save them as FASTA or GenBank files.

Before using the online NCBI resources, it is important to be aware of the user requirements. If you abuse their system (whether on purpose or on accident), they will block your access for some time. You can find the requirements in the NCBI E-utilities guide.

First, you are required to provide NCBI with your identity so that you can be contacted if there is a problem. This also limits abuse of this system so that their servers aren’t overwhelmed. If you are identified as someone who is excessively using the E-utilities, NCBI will contact you before you are blocked.

The quote below from the NCBI guide gives you a sense of what constitutes appropriate usage of the E-utility servers:

In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays.

Enter your own email address in place of <enter your email address>:

from Bio import Entrez
Entrez.email = "<enter your email address>"

Now we can fetch a Genbank record:

handle = Entrez.efetch(db="nucleotide", id="DQ137224", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
print(record)
ID: DQ137224.1
Name: DQ137224
Description: Megadyptes antipodes voucher JD64A cytochrome b (cytb) gene, partial cds; mitochondrial
Number of features: 3
/molecule_type=DNA
/topology=linear
/data_file_division=VRT
/date=26-JUL-2016
/accessions=['DQ137224']
/sequence_version=1
/keywords=['']
/source=mitochondrion Megadyptes antipodes (Yellow-eyed penguin)
/organism=Megadyptes antipodes
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Archelosauria', 'Archosauria', 'Dinosauria', 'Saurischia', 'Theropoda', 'Coelurosauria', 'Aves', 'Neognathae', 'Sphenisciformes', 'Spheniscidae', 'Megadyptes']
/references=[Reference(title='Multiple gene evidence for expansion of extant penguins out of Antarctica due to global cooling', ...), Reference(title='Direct Submission', ...)]
Seq('ACACAAATTCTAACTGGCCTCCTACTGGCCGCCCACTACACTGCAGACACAACC...AGC', IUPACAmbiguousDNA())

BLAST

BioPython makes it easy to work with NCBI’s BLAST. To run blast over the internet, we can use qblast(). For this we must import the NCBIWWW module:

from Bio.Blast import NCBIWWW

You can call the help() function on NCBIWWW.qblast to inspect how this function works. This will return all of the parameters of qblast so that you can understand how to specify your query correctly.

help(NCBIWWW.qblast)

Next we can read in a sequence that is stored in a FASTA file called test.fasta.

query = SeqIO.read("test.fasta", format="fasta")

The variable we created called query is a sequence stored in a SeqRecord object.

To run a BLAST search on the sequence from our FASTA file, we simply have to specify the search program (blastn) and the database (nt). The last argument is the Seq object stored in our query.

result_handle = NCBIWWW.qblast("blastn", "nt", query.seq)

Note that this might not work for everyone in class. It is possible for NCBI to throttle non-interactive users.

Once we have the results of our BLAST, we can store them in an XML file.

blast_file = open("my_blast.xml", "w")
blast_file.write(result_handle.read())

Once we have stored the results, it’s best to close all of our open file handles:

blast_file.close()
result_handle.close()

We created an XML file containing our BLAST results. This is now easy to parse using the NCBIXML tools:

from Bio.Blast import NCBIXML
handle = open("my_blast.xml")
blast_record = NCBIXML.read(handle)

Now that we have read in the file, we can print each of the hits:

for hit in blast_record.descriptions: 
    print(hit.title)
    print(hit.e)
gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera cold-regulated 413 plasma membrane protein 2 (LOC100248690), mRNA
0.0
gi|1420088022|gb|MG722853.1| Vitis vinifera cold-regulated 413 inner membrane protein 2 mRNA, complete cds
0.0
gi|123704572|emb|AM483681.1| Vitis vinifera, whole genome shotgun sequence, contig VV78X045699.9, clone ENTAV 115
0.0

...

gi|1027107741|ref|XM_008238505.2| PREDICTED: Prunus mume cold-regulated 413 plasma membrane protein 1-like (LOC103335494), mRNA
1.04564e-118

We can also view the alignments for each of the BLAST hits:

for hit in blast_record.alignments:
    for hsp in hit.hsps:
      print(hit.title) 
      print(hsp.expect)
      print(hsp.query[0:75] + '...')
      print(hsp.match[0:75] + '...') 
      print(hsp.sbjct[0:75] + '...')
gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera cold-regulated 413 plasma membrane protein 2 (LOC100248690), mRNA
0.0
TACTCTACAGTCTCTGACTTTGTAAGCTTCGCGCTTCTTCTCCTTTTTCTCTCTGGGGAAAGATTTTCCCTTTCT...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
TACTCTACAGTCTCTGACTTTGTAAGCTTCGCGCTTCTTCTCCTTTTTCTCTCTGGGGAAAGATTTTCCCTTTCT...

...

gi|1027107741|ref|XM_008238505.2| PREDICTED: Prunus mume cold-regulated 413 plasma membrane protein 1-like (LOC103335494), mRNA
1.04564e-118
ATTGAAGATGGGGAAAAAGGGTTACTTGGCGATGAGGACTGACACTGATACTACTGATTTGATCAGTTCTGATCT...
|||| ||||||  ||| || | ||||||   |||| ||||||  | ||| | | ||| ||||||   || |||||...
ATTGGAGATGGCAAAACAGAGCTACTTGAAAATGATGACTGAACCAGATGCAAATGAATTGATCCACTCCGATCT...

Often a BLAST search will return many matches for a single query, as is the case for this example. This is why it is best to save them in an XML file. Using NCBIXML.parse() enables us to evaluate each of the BLAST records. We can specify a threshold so that we can easily inspect the closest matches.

E_VALUE_THRESH = 1e-150
for record in NCBIXML.parse(open("my_blast.xml")):
    for align in record.alignments: 
        for hsp in align.hsps:
            if hsp.expect < E_VALUE_THRESH: 
                print("MATCH: %s " % align.title[:60]) 
                print(hsp.expect)
MATCH: gi|1105484513|ref|XM_002284686.3| PREDICTED: Vitis vinifera  
0.0
MATCH: gi|1420088022|gb|MG722853.1| Vitis vinifera cold-regulated 4 
0.0
MATCH: gi|123704572|emb|AM483681.1| Vitis vinifera, whole genome sh 
0.0
MATCH: gi|1217007653|ref|XM_021787586.1| PREDICTED: Hevea brasilien 
9.7761e-151
MATCH: gi|1217007651|ref|XM_021787585.1| PREDICTED: Hevea brasilien 
9.7761e-151

Our BLAST search has matched our sequence with Vitis vinifera. Let’s check to see if it got it right:

query.description

Take-Home Challenge: Biopython

Experiment with features of biopython using the sequence in the variable query that we loaded from test.fasta.

  1. Print the reverse complement of the sequence to screen.

  2. Write your own function to calculate the GC content of this sequence to screen.

Key Points