Tuesday, December 24, 2013

The blinders of peer review

Does pre-publication peer-review isolate a finding from the field during the process? Sure, and that's partly the point of it, but it can lead to some inconveniences when two related papers from separate groups undergo peer review at the same time.

Earlier this year I published a bioinformatic analysis of the rhoptry kinases (ROPK), a lineage-specific family of signaling proteins involved in the invasion mechanisms of Toxoplasma gondii, Eimeria tenella and related eukaryotic parasites. During this study I found four T. gondii proteins (and their orthologs in other species) that have the hallmarks of ROPKs, including a predicted signal peptide, a protein kinase domain more similar to other ROPKs than to any other known kinases, and mRNA expression patterns matching those of other ROPKs. I named these genes numerically starting after the highest-numbered ROPK previously published (ROP46).

To informally reserve the names ahead of the publication of my own article, I posted notes on the corresponding ToxoDB gene pages: ROP47, ROP48, ROP49 and ROP50. My professor and I made some inquiries with other T. gondii researchers to see if it would be possible to confirm the localization of these proteins to the rhoptry organelle, in order to solidify our argument. Without a peer-reviewed publication to point to, though, this seemed to be the most we could do to promote the new gene names.

In parallel, another well-regarded lab that specializes in T. gondii rhoptry proteins, including but not limited to ROPKs, investigated the localization and function of three other proteins that mRNA expression had indicated were associated with other rhoptry proteins. It's great work. However, their paper and ours both passed through peer review at roughly the same time (earlier this year); we both followed the same numerical naming scheme for rhoptry proteins, starting after ROP46; and unfortunately, we ended up assigning the names ROP47 and ROP48 to different T. gondii proteins.

Crud.

How could this confusing situation have been avoided? EuPathDB is widely used, but it's not the primary source for gene names and accessions, and a user-submitted comment alone has fairly limited visibility. I presented a poster at the 2012 Molecular Parasitology Meeting, where many of the active Toxo enthusiasts gather each year, but the choice of new gene names was a minor detail on the poster. Heck, I even had breakfast with the other group's PI, but we only talked about curious features of established rhoptry proteins, not the novel ROPs we were each about to propose.

The only way to really claim a gene name is with a peer-reviewed publication.


* * *

Until now I didn't really grasp the importance of public preprint servers like arXiv, BioRxiv and PeerJ PrePrints — at least in the life sciences where a good article can be published outside a glamor mag within a few months. (In physics and mathematics, peer review and publication typically take much longer.) It was hard enough to get people I knew to review my articles before submitting them to a journal; would anyone really leave useful comments out of the blue if I posted an unreviewed paper on a preprint server? Answer: Maybe, but there's more to preprints than that.

"Competitors" have their own projects, usually planned around their own grants. They could drop everything and copy your idea if they saw it. More likely, they will do the same thing they'll do when they see your final published paper, which is to take this new information into account as they pursue their own projects. You do want to make an impact on the field, don't you?

Pre-publication peer-review is a well-established system for gathering detailed suggestions from uninvolved colleagues, a useful stick to force authors to improve their manuscripts, and sometimes a filter for junk. F1000 has an innovative process of publishing submissions first after a cursory screening, then collecting peer reviews and allowing authors to revise the manuscript at their leisure, apparently. Once a manuscript has been reviewed, revised and approved, it receives a tag indicating that it has been properly peer-reviewed. PeerJ takes a more conservative approach, hosting a preprint server alongside but separate from their peer-reviewed articles. Are either of these the way forward?

F1000 is new on the scene, and it may be too soon to tell if this is going to be a success. For one thing, will authors be motivated enough to correct their manuscripts promptly? PLoS One once fought a mighty battle against the perception that they weren't peer-reviewed. That stigma came out of thin air, and has been overcome — but will F1000 have to fight the same battle again, since their articles really are a mix of various states of peer-review? I hope not, because many scientists could benefit from having a few holes poked in the wall of pre-publication peer review.

Tuesday, October 8, 2013

Old Cajun wisdom for the young scientist

During the summer after I started grad school, Paul Graham posted an interesting article: "Ramen Profitable." I thought it was inspiring in two ways:
  1. It made the point that a startup founder is in a much safer, more comfortable and more productive position once enough revenue is coming in to support minimal cost-of-living; raising additional funding is no longer the most important thing. Replace "founder" with "grad student"/"postdoc"/"omg it never ends," and "funding" with "funding" (well, it sounds completely different when you change the context), and you've explained why applying for long-shot grants is such a resource sink, yet we all do it anyway, and why a grim but reliable stipend is an acceptable equilibrium for many academics.
  2. The article included a vague but basically great recipe for beans and rice in the footnotes.
I cooked variations of this recipe through the rest of grad school, and now as a postdoc. It hits the sweet spots for flavor, cost, ease of prep, and sheer volume of leftovers.  Here's the specific recipe I converged on over the years.

Red and Black Beans and Rice (SEC)

— or —

Life of the Mind Beans and Rice (other conferences)

In a rice cooker, mix together:
  • 1 c. dry rice (white, brown or parboiled)
  • 1/2 c. quinoa
  • 2 1/2 to 3 c. water (see rice packaging and past experience)
  • (optional: 1 Tbsp. white vinegar)
  • (optional: 1 tsp. garlic powder)

Start the rice cooker and let it do its thing. Heat in a large pan over medium:
  • 1-2 Tbsp. vegetable oil

Add:
  • 1 medium/large or 2 small yellow onion(s), chopped
  • 4-6 cloves garlic, chopped; or 2 Tbsp. garlic paste

Cook until onions are translucent, about 3 minutes. Add:
  • 2-4 oz. andouille sausage; or spicy pork sausage; or other spicy sausage; chopped
  • 2 stalks celery, chopped
  • 1 green bell pepper, chopped
  • 1/4 c. okra, chopped
  • (optional: 1/2 to 1 jalapeno pepper, chopped)

Stir casually for 4-5 minutes. (This is a good time to grab a beer from the fridge.) Season with:
  • 1 tsp. black pepper
  • 1/2 to 1 Tbsp. paprika
  • 1/2 to 1 Tbsp. cumin
  • 1 "serving" condensed chicken broth; or 1 cube chicken boullion, crumbled
  • (optional: 1/4 to 1/2 tsp. red/cayenne pepper)

Stir for 1 minute to mix thoroughly. Open:
  • 1 14-oz can black beans
  • 1 14-oz can red beans; or kidney beans; or more black beans

Pour in the liquid and about 1/3 to 1/2 of the beans from each can into the pan, and stir to mix. Put the rest of the beans in a small bowl or sturdy (Pyrex) beaker and mash somewhat. Add the semi-mashed beans to the pan. Stir for another 3-5 minutes to let the stew thicken.

(When the rice cooker finishes, take off the glass lid and drape a cheesecloth or thin towel over it for a few minutes while the rice cools a little.)

Serve the beans over the rice/quinoa blend in a shallow bowl.

Leftovers are even better.

Sunday, June 9, 2013

Homebrew chronicles

I've started up another blog on home brewing. The first post covers a Pumpkin Spice Porter and some important lessons learned about water.

Saturday, January 19, 2013

Overly honest methods / cheap tricks for multiple sequence alignment

What's the quickest way to get a so-so sequence alignment? Once you have this, there are plenty of efficient ways to refine the alignment, infer a tree, build a profile to search for related sequences, or any combination of these at once. MUSCLE and MAFFT spend a surprising amount of time computing initial pairwise distances and computing a guide tree; building a multiple alignment from this point is quite efficient. As part of my recent fascination with absurd algorithms and other unpublishable methods, I looked into this a bit.

Pairwise distances

Rigorously/naively, we could compute pairwise alignments with a dynamic programming algorithm like Smith-Waterman and take the pairwise score as the measure of similarity. Running all-versus-all BLAST and taking the best HSP score (similarity) or e-value (distance) would give roughly the same result. But we can do a crappier job of this if we put some thought into it.

For each sequence, let's count the number of occurrences of each k-mer (k to be determined). The similarity between sequences is then the sum of absolute differences in k-mer counts between two sequences.
# Count distinct tripeptides in each sequence

from collections import Counter
from Bio import SeqIO

K = 3

def get_kmers(seqstr, k):
    for i in range(len(seqstr) - k):
        yield seqstr[i:i+k]

def kmer_dist(kmers1, kmers2):
    """Sum of absolute differences between k-mer counts."""
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(kmers1.get(key, 0) - kmers2.get(key, 0))
               for key in keys)

records = list(SeqIO.parse(fname, 'fasta'))
lookup = dict((rec, Counter(get_kmers(str(rec.seq), K)))
              for rec in records)

# Example:
rec1, rec2 = records[:2]
a_distance = kmer_distance(lookup[rec1], lookup[rec2])
If some of the sequences have repetitive regions — think of paralogs from closely related species, or within a species — this approach will over-emphasize the content and size/number of those repeats at the expense of less common but more meaningful k-mers. We could down-weight abundant k-mers by taking the square root of each k-mer count before computing pairwise differences.
from math import sqrt

def kmer_dist(kmers1, kmers2):
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(sqrt(kmers1.get(key, 0)) - sqrt(kmers2.get(key, 0)))
               for key in keys)
But why stop at the square root? Reducto ad awesome, let's drop the k-mer counts and simply track whether the k-mer is present in a sequence or not.
def kmer_dist(kmers1, kmers2):
    """Proportion of non-overlapping elements between two sets."""
    return (len(kmers1.symmetric_difference(kmers2))
            / float(len(s1.intersection(s2)))

lookup = dict((rec, set(get_kmers(str(rec.seq), K)))
              for rec in records)
This also opens up some bit-twiddling opportunities which I have not pursued. MUSCLE, MAFFT and ClustalW all have options to use this trick to quickly estimate pairwise distances for large sequence sets.

Non-phylogenetic guide tree

The guide tree determines how the pairs of sequences will be combined. The tree is typically inferred by neighbor-joining based on the computed pairwise distances, an approach that's O(N3) using the standard algorithm (though rumor has it that quadratic time is possible). Pairs of sequences are aligned according to the tips of the tree, then sub-alignments are treated as internal nodes in the tree aligned to each other. (PRANK also uses the guide tree to infer indel states for characters, but that falls into the category of "good" algorithms which we're not interested in here.)

Let's set aside the evolutionary meaning of the alignment for the moment (gasp) and look for a fast way to determine a reasonable order for performing pairwise alignments of individual sequences and sub-alignments. (MUSCLE and MAFFT choose UPGMA over more "robust" NJ methods, despite or perhaps even because of long-branch attraction, with the same rationale.) Treat the collection of pairwise distances between sequences as a graph, and use Kruskal's algorithm to compute the minimum spanning tree in linear time.
import networkx
G = networkx.Graph()

for i, rec1 in enumerate(records):
    for rec2 in records[i+1:len(records)]:
        dist = kmer_distance(lookup[rec1], lookup[rec2])
        G.add_edge(rec1.id, rec2.id, weight=dist)

H = networkx.minimum_spanning_tree(G)

# Check: Are the sequences grouped together reasonably?
import pylab
nx.draw_graphviz(H, prog='twopi', node_size=0)
pylab.show()
The edges of the resulting graph give the order of alignment; start with nodes of degree 1 as individual sequences to align with their only neighbor, then treat nodes with degree >1 as sub-alignments to align to each other, prioritizing shorter branches.

Application: In the tmalign module of Fammer, I combine pairwise alignments of structures, obtained with TMalign in a subprocess, using a minimum spanning tree where the edge weights are the reciprocal of the TM-scores

Thursday, November 8, 2012

Journal article: Biopython's Bio.Phylo

We're not known for the timeliest reporting here at etalog, but FYI, our article on the Bio.Phylo module for phylogenetics in Biopython is now up in its final form on the BMC Bioinformatics:

Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython

It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:


This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.

To get started, though, I recommend the main documentation:

History / a bit of navel-gazing

This project stemmed from a Google Summer of Code project in 2009 to implement a phyloXML parser for Biopython, mentored by Christian Zmasek and Brad Chapman. This was administered by the National Evolutionary Synthesis Center (NESCent); the Open Bioinformatics Foundation (OBF) didn't start administering its own GSoC projects until the following year. It was a fun summer, and I got to become more involved in Biopython as a result.

After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.

The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.