Thursday, November 8, 2012

Journal article: Biopython's Bio.Phylo

We're not known for the timeliest reporting here at etalog, but FYI, our article on the Bio.Phylo module for phylogenetics in Biopython is now up in its final form on the BMC Bioinformatics:

Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython

It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:


This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.

To get started, though, I recommend the main documentation:

History / a bit of navel-gazing

This project stemmed from a Google Summer of Code project in 2009 to implement a phyloXML parser for Biopython, mentored by Christian Zmasek and Brad Chapman. This was administered by the National Evolutionary Synthesis Center (NESCent); the Open Bioinformatics Foundation (OBF) didn't start administering its own GSoC projects until the following year. It was a fun summer, and I got to become more involved in Biopython as a result.

After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.

The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.