Monday, January 16, 2012

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.


The make utility is normally used to determine which portions of a program need to be recompiled, and the dependencies between them; it includes special features for managing C code, but actions are specified in terms of shell commands. You can just as well use it to specify dependencies between arbitrary tasks.

Giovanni Dall'Olio has posted some helpful instructional material:

(One of his links is broken -- here's Software Carpentry's current course on make.)

Rake, a Really Awesome Make

If your pipeline is already organized as a pipeline of stand-alone scripts, Rake is a more-or-less ideal solution:

 At some point in your project, you'll likely need to do some string processing; this is where make falls down. Ruby happens to be great for this task. Don't worry if you don't know Ruby -- the basic string methods in Ruby, Python and Perl are very similar, and regular expressions work roughly the same way. You can look up everything you need to know on the fly. (I did.)

Also, Rake's the mini-language for specifying tasks is both concise and intuitive. There's a built-in facility for maintaining short descriptions of each step, which is enticing from a "lab notebook" perspective.

How about a Python solution?

Inspired by Rake, I wrote a small module called Here's the code, plus a simple script to demonstrate:

This isn't just for the sake of Python evangelism; I have a bunch of special-purpose modules written in Python (for my lab) that I would like to use in an elaborate pipeline of my own. It's silly to put each of these steps in a separate, throwaway Python script that then gets called in a separate process by Rake. Instead, I can import into a Python script and include the dependency-scheduling features in my own programs.

Key differences from other build systems (e.g. SCons, waf, ruffus):
  • This module is not meant to be run from the command-line -- only for specific parts of your own code
  • The implementation of each processing step is separate from the dependency specification (although single-command tasks can still be defined inline with a lambda expression). Separate the algorithm from the data, I always say.
  • Cleanup is specified within the task, not in a separate area of the Rakefile/Makefile. This makes more sense for a project with heterogenous processing steps & intermediate files.
To be clear, is not better than Rake or make for arranging a set of scripts that you've already written. I didn't even implement concurrent jobs (because most of the CPU-intensive steps I use are calls to programs that are already multicore-aware; though try adding that feature yourself if you'd like).

1 comment:

navaneedh said...

Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.

Function Point Estimation Training