Wednesday, August 15, 2012

Code Harvest: The Refactoring

I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.

The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.

Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.

Wednesday, August 1, 2012

The well-organized data science project

Someone recently asked me about the basic setup a computational scientist needs to conduct research efficiently. I'm pretty satisfied with my current arrangement, which was inspired by this: "A Quick Guide to Organizing Computational Biology Projects"

My work is organized into individual "projects" which are each supposed to become papers at some point. I keep each project in Dropbox to ensure everything is synced and backed up remotely all the time -- no file left behind. I also use Mendeley, with a folder for each project's references. Mendeley can generate a project-specific BibTex file from a folder.

A well-organized project might look like this: