Saturday, January 19, 2013

Overly honest methods / cheap tricks for multiple sequence alignment

What's the quickest way to get a so-so sequence alignment? Once you have this, there are plenty of efficient ways to refine the alignment, infer a tree, build a profile to search for related sequences, or any combination of these at once. MUSCLE and MAFFT spend a surprising amount of time computing initial pairwise distances and computing a guide tree; building a multiple alignment from this point is quite efficient. As part of my recent fascination with absurd algorithms and other unpublishable methods, I looked into this a bit.

Pairwise distances

Rigorously/naively, we could compute pairwise alignments with a dynamic programming algorithm like Smith-Waterman and take the pairwise score as the measure of similarity. Running all-versus-all BLAST and taking the best HSP score (similarity) or e-value (distance) would give roughly the same result. But we can do a crappier job of this if we put some thought into it.

For each sequence, let's count the number of occurrences of each k-mer (k to be determined). The similarity between sequences is then the sum of absolute differences in k-mer counts between two sequences.
# Count distinct tripeptides in each sequence

from collections import Counter
from Bio import SeqIO

K = 3

def get_kmers(seqstr, k):
    for i in range(len(seqstr) - k):
        yield seqstr[i:i+k]

def kmer_dist(kmers1, kmers2):
    """Sum of absolute differences between k-mer counts."""
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(kmers1.get(key, 0) - kmers2.get(key, 0))
               for key in keys)

records = list(SeqIO.parse(fname, 'fasta'))
lookup = dict((rec, Counter(get_kmers(str(rec.seq), K)))
              for rec in records)

# Example:
rec1, rec2 = records[:2]
a_distance = kmer_distance(lookup[rec1], lookup[rec2])
If some of the sequences have repetitive regions — think of paralogs from closely related species, or within a species — this approach will over-emphasize the content and size/number of those repeats at the expense of less common but more meaningful k-mers. We could down-weight abundant k-mers by taking the square root of each k-mer count before computing pairwise differences.
from math import sqrt

def kmer_dist(kmers1, kmers2):
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(sqrt(kmers1.get(key, 0)) - sqrt(kmers2.get(key, 0)))
               for key in keys)
But why stop at the square root? Reducto ad awesome, let's drop the k-mer counts and simply track whether the k-mer is present in a sequence or not.
def kmer_dist(kmers1, kmers2):
    """Proportion of non-overlapping elements between two sets."""
    return (len(kmers1.symmetric_difference(kmers2))
            / float(len(s1.intersection(s2)))

lookup = dict((rec, set(get_kmers(str(rec.seq), K)))
              for rec in records)
This also opens up some bit-twiddling opportunities which I have not pursued. MUSCLE, MAFFT and ClustalW all have options to use this trick to quickly estimate pairwise distances for large sequence sets.

Non-phylogenetic guide tree

The guide tree determines how the pairs of sequences will be combined. The tree is typically inferred by neighbor-joining based on the computed pairwise distances, an approach that's O(N3) using the standard algorithm (though rumor has it that quadratic time is possible). Pairs of sequences are aligned according to the tips of the tree, then sub-alignments are treated as internal nodes in the tree aligned to each other. (PRANK also uses the guide tree to infer indel states for characters, but that falls into the category of "good" algorithms which we're not interested in here.)

Let's set aside the evolutionary meaning of the alignment for the moment (gasp) and look for a fast way to determine a reasonable order for performing pairwise alignments of individual sequences and sub-alignments. (MUSCLE and MAFFT choose UPGMA over more "robust" NJ methods, despite or perhaps even because of long-branch attraction, with the same rationale.) Treat the collection of pairwise distances between sequences as a graph, and use Kruskal's algorithm to compute the minimum spanning tree in linear time.
import networkx
G = networkx.Graph()

for i, rec1 in enumerate(records):
    for rec2 in records[i+1:len(records)]:
        dist = kmer_distance(lookup[rec1], lookup[rec2])
        G.add_edge(,, weight=dist)

H = networkx.minimum_spanning_tree(G)

# Check: Are the sequences grouped together reasonably?
import pylab
nx.draw_graphviz(H, prog='twopi', node_size=0)
The edges of the resulting graph give the order of alignment; start with nodes of degree 1 as individual sequences to align with their only neighbor, then treat nodes with degree >1 as sub-alignments to align to each other, prioritizing shorter branches.

Application: In the tmalign module of Fammer, I combine pairwise alignments of 3D protein structures, obtained with TMalign in a subprocess, using a minimum spanning tree where the edge weights are the reciprocal of the TM-scores