The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.
Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.
BioFrills
This package contains general-purpose sequence analysis functions that I can't merge into Biopython, for one reason or another:- Research-grade approaches (e.g. heuristic solutions for problems with no efficient exact solution, like estimating the effective number of independent sequences in an alignment)
- Cython versions of some functions for speed-up (currently just one, summing BLOSUM62 scores for a pairwise alignment)
- Python 2.6 or 2.7 only; includes recent PyPy
One function that I had in esbglib has already been independently invented and merged into Biopython as Bio.File.as_handle (Mine was esbglib.sugar.maybe_open).
Check out BioFrills on GitHub.
BioCMA
For working with MAPGAPS, CHAIN and mcBPPS alignments. These programs use an undocumented, bespoke multiple alignment format called CMA, which I assume means either "Consensus Multiple Alignment" or "CHAIN Multiple Alignment". There are a few tools by the same author for processing this format, but again, we have neither documentation nor source code for them. For years I've felt uneasy about relying on these quirky binaries to manage our most important sequence data... so, I did something about it. Ta-da.Check out BioCMA on GitHub.
No comments:
Post a Comment