etalog
"The only difference between screwing around and science is writing it down."
Sunday, June 9, 2013
Homebrew chronicles
I've started up another blog on home brewing. The first post covers a Pumpkin Spice Porter and some important lessons learned about water.
Saturday, January 19, 2013
Overly honest methods / cheap tricks for multiple sequence alignment
What's the quickest way to get a so-so sequence alignment? Once you have this,
there are plenty of efficient ways to refine the alignment, infer a tree, build
a profile to search for related sequences, or any combination of these at once.
MUSCLE and MAFFT spend a surprising amount of time computing initial pairwise
distances and computing a guide tree; building a multiple alignment from this
point is quite efficient.
As part of my recent fascination with absurd algorithms and other unpublishable
methods, I looked into this a bit.
For each sequence, let's count the number of occurrences of each k-mer (k to be determined). The similarity between sequences is then the sum of absolute differences in k-mer counts between two sequences.
Let's set aside the evolutionary meaning of the alignment for the moment (gasp) and look for a fast way to determine a reasonable order for performing pairwise alignments of individual sequences and sub-alignments. (MUSCLE and MAFFT choose UPGMA over more "robust" NJ methods, despite or perhaps even because of long-branch attraction, with the same rationale.) Treat the collection of pairwise distances between sequences as a graph, and use Kruskal's algorithm to compute the minimum spanning tree in linear time.
Application: In the tmalign module of Fammer, I combine pairwise alignments of structures, obtained with TMalign in a subprocess, using a minimum spanning tree where the edge weights are the reciprocal of the TM-scores
Pairwise distances
Rigorously/naively, we could compute pairwise alignments with a dynamic programming algorithm like Smith-Waterman and take the pairwise score as the measure of similarity. Running all-versus-all BLAST and taking the best HSP score (similarity) or e-value (distance) would give roughly the same result. But we can do a crappier job of this if we put some thought into it.For each sequence, let's count the number of occurrences of each k-mer (k to be determined). The similarity between sequences is then the sum of absolute differences in k-mer counts between two sequences.
# Count distinct tripeptides in each sequence
from collections import Counter
from Bio import SeqIO
K = 3
def get_kmers(seqstr, k):
for i in range(len(seqstr) - k):
yield seqstr[i:i+k]
def kmer_dist(kmers1, kmers2):
"""Sum of absolute differences between k-mer counts."""
keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
return sum(abs(kmers1.get(key, 0) - kmers2.get(key, 0))
for key in keys)
records = list(SeqIO.parse(fname, 'fasta'))
lookup = dict((rec, Counter(get_kmers(str(rec.seq), K)))
for rec in records)
# Example:
rec1, rec2 = records[:2]
a_distance = kmer_distance(lookup[rec1], lookup[rec2])
If some of the sequences have repetitive regions — think of paralogs from
closely related species, or within a species — this approach will
over-emphasize the content and size/number of those repeats at the expense of
less common but more meaningful k-mers. We could down-weight abundant k-mers by
taking the square root of each k-mer count before computing pairwise
differences.
from math import sqrt
def kmer_dist(kmers1, kmers2):
keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
return sum(abs(sqrt(kmers1.get(key, 0)) - sqrt(kmers2.get(key, 0)))
for key in keys)
But why stop at the square root? Reducto ad awesome, let's drop the k-mer counts and simply track whether the k-mer is present in a sequence or not.
def kmer_dist(kmers1, kmers2):
"""Proportion of non-overlapping elements between two sets."""
return (len(kmers1.symmetric_difference(kmers2))
/ float(len(s1.intersection(s2)))
lookup = dict((rec, set(get_kmers(str(rec.seq), K)))
for rec in records)
This also opens up some bit-twiddling opportunities which I have not pursued. MUSCLE, MAFFT and ClustalW all have options to use this trick to quickly estimate pairwise distances for large sequence sets.Non-phylogenetic guide tree
The guide tree determines how the pairs of sequences will be combined. The tree is typically inferred by neighbor-joining based on the computed pairwise distances, an approach that's O(N3) using the standard algorithm (though rumor has it that quadratic time is possible). Pairs of sequences are aligned according to the tips of the tree, then sub-alignments are treated as internal nodes in the tree aligned to each other. (PRANK also uses the guide tree to infer indel states for characters, but that falls into the category of "good" algorithms which we're not interested in here.)Let's set aside the evolutionary meaning of the alignment for the moment (gasp) and look for a fast way to determine a reasonable order for performing pairwise alignments of individual sequences and sub-alignments. (MUSCLE and MAFFT choose UPGMA over more "robust" NJ methods, despite or perhaps even because of long-branch attraction, with the same rationale.) Treat the collection of pairwise distances between sequences as a graph, and use Kruskal's algorithm to compute the minimum spanning tree in linear time.
import networkx
G = networkx.Graph()
for i, rec1 in enumerate(records):
for rec2 in records[i+1:len(records)]:
dist = kmer_distance(lookup[rec1], lookup[rec2])
G.add_edge(rec1.id, rec2.id, weight=dist)
H = networkx.minimum_spanning_tree(G)
# Check: Are the sequences grouped together reasonably?
import pylab
nx.draw_graphviz(H, prog='twopi', node_size=0)
pylab.show()
The edges of the resulting graph give the order of alignment; start with nodes
of degree 1 as individual sequences to align with their only neighbor, then
treat nodes with degree >1 as sub-alignments to align to each other,
prioritizing shorter branches.Application: In the tmalign module of Fammer, I combine pairwise alignments of structures, obtained with TMalign in a subprocess, using a minimum spanning tree where the edge weights are the reciprocal of the TM-scores
Thursday, November 8, 2012
Journal article: Biopython's Bio.Phylo
We're not known for the timeliest reporting here at etalog, but FYI, our article on the Bio.Phylo module for phylogenetics in Biopython is now up in its final form on the BMC Bioinformatics:
Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython
It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:
This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.
To get started, though, I recommend the main documentation:
After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.
The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.
Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython
It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:
This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.
To get started, though, I recommend the main documentation:
- Biopython Tutorial (Phylo is currently chapter 12)
- Phylo wiki page
- Wiki cookbook
History / a bit of navel-gazing
This project stemmed from a Google Summer of Code project in 2009 to implement a phyloXML parser for Biopython, mentored by Christian Zmasek and Brad Chapman. This was administered by the National Evolutionary Synthesis Center (NESCent); the Open Bioinformatics Foundation (OBF) didn't start administering its own GSoC projects until the following year. It was a fun summer, and I got to become more involved in Biopython as a result.After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.
The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.
Saturday, October 27, 2012
Biopython project update at BOSC 2012
Not so hot off the presses, here are the slides from the talk I gave this summer at the Bioinformatics Open Source Conference (BOSC), a satellite conference of Intelligent Systems for Molecular Biology (ISMB). Since Peter Cock wasn't able to make it out to California this year, he suggested I fill in.
In addition to the usual coverage of new features, a big theme this year was the recurring successes we've had bring in new core developers via Google Summer of Code.
Jan Aerts has also posted the rest of the BOSC 2012 slides.
In addition to the usual coverage of new features, a big theme this year was the recurring successes we've had bring in new core developers via Google Summer of Code.
Jan Aerts has also posted the rest of the BOSC 2012 slides.
Labels:
biopython,
gsoc,
presentation
Wednesday, August 15, 2012
Code Harvest: The Refactoring
I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.
The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.
Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.
The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.
Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.
Labels:
biopython,
programming,
python
Subscribe to:
Posts (Atom)
