Wednesday, August 15, 2012

Code Harvest: The Refactoring

I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.

The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.

Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.



BioFrills

This package contains general-purpose sequence analysis functions that I can't merge into Biopython, for one reason or another:
  • Research-grade approaches (e.g. heuristic solutions for problems with no efficient exact solution, like estimating the effective number of independent sequences in an alignment)
  • Cython versions of some functions for speed-up (currently just one, summing BLOSUM62 scores for a pairwise alignment)
  • Python 2.6 or 2.7 only; includes recent PyPy
I gave the library a silly name to keep it from being somehow mistaken for a Biopython replacement, or even a well-maintained, well-thought-out supplementary library. But in order to make the packages that depend on it easy_installable, I guess I'll need to upload it to PyPI at some point.

One function that I had in esbglib has already been independently invented and merged into Biopython as Bio.File.as_handle (Mine was esbglib.sugar.maybe_open).

Check out BioFrills on GitHub.

BioCMA

For working with MAPGAPS, CHAIN and mcBPPS alignments. These programs use an undocumented, bespoke multiple alignment format called CMA, which I assume means either "Consensus Multiple Alignment" or "CHAIN Multiple Alignment". There are a few tools by the same author for processing this format, but again, we have neither documentation nor source code for them. For years I've felt uneasy about relying on these quirky binaries to manage our most important sequence data... so, I did something about it. Ta-da.

Check out BioCMA on GitHub.

No comments: