a frictionless surface: 2012

Thursday, November 8, 2012

Journal article: Biopython's Bio.Phylo

We're not known for the timeliest reporting here at etalog, but FYI, our article on the Bio.Phylo module for phylogenetics in Biopython is now up in its final form on the BMC Bioinformatics:

Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython

It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:

This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.

To get started, though, I recommend the main documentation:

Biopython Tutorial (Phylo is currently chapter 12)
Phylo wiki page
Wiki cookbook

History / a bit of navel-gazing

This project stemmed from a Google Summer of Code project in 2009 to implement a phyloXML parser for Biopython, mentored by Christian Zmasek and Brad Chapman. This was administered by the National Evolutionary Synthesis Center (NESCent); the Open Bioinformatics Foundation (OBF) didn't start administering its own GSoC projects until the following year. It was a fun summer, and I got to become more involved in Biopython as a result.

After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.

The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.

Saturday, October 27, 2012

Biopython project update at BOSC 2012

Not so hot off the presses, here are the slides from the talk I gave this summer at the Bioinformatics Open Source Conference (BOSC), a satellite conference of Intelligent Systems for Molecular Biology (ISMB). Since Peter Cock wasn't able to make it out to California this year, he suggested I fill in.

In addition to the usual coverage of new features, a big theme this year was the recurring successes we've had bring in new core developers via Google Summer of Code.

Biopython Project Update (BOSC 2012) from Eric Talevich

Jan Aerts has also posted the rest of the BOSC 2012 slides.

Wednesday, August 15, 2012

Code Harvest: The Refactoring

I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.

The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.

Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.

The well-organized data science project

Someone recently asked me about the basic setup a computational scientist needs to conduct research efficiently. I'm pretty satisfied with my current arrangement, which was inspired by this: "A Quick Guide to Organizing Computational Biology Projects"

My work is organized into individual "projects" which are each supposed to become papers at some point. I keep each project in Dropbox to ensure everything is synced and backed up remotely all the time -- no file left behind. I also use Mendeley, with a folder for each project's references. Mendeley can generate a project-specific BibTex file from a folder.

A well-organized project might look like this:

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:
http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.

a frictionless surface