This is a common problem in bioinformatics:
How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.
MakeThe make utility is normally used to determine which portions of a program need to be recompiled, and the dependencies between them; it includes special features for managing C code, but actions are specified in terms of shell commands. You can just as well use it to specify dependencies between arbitrary tasks.
Giovanni Dall'Olio has posted some helpful instructional material: http://bioinfoblog.it/?p=29
(One of his links is broken -- here's Software Carpentry's current course on make.)
Rake, a Really Awesome MakeIf your pipeline is already organized as a pipeline of stand-alone scripts, Rake is a more-or-less ideal solution: http://rake.rubyforge.org/
At some point in your project, you'll likely need to do some string processing; this is where make falls down. Ruby happens to be great for this task. Don't worry if you don't know Ruby -- the basic string methods in Ruby, Python and Perl are very similar, and regular expressions work roughly the same way. You can look up everything you need to know on the fly. (I did.)
Also, Rake's the mini-language for specifying tasks is both concise and intuitive. There's a built-in facility for maintaining short descriptions of each step, which is enticing from a "lab notebook" perspective.
How about a Python solution?Inspired by Rake, I wrote a small module called tasks.py. Here's the code, plus a simple script to demonstrate:
This isn't just for the sake of Python evangelism; I have a bunch of special-purpose modules written in Python (for my lab) that I would like to use in an elaborate pipeline of my own. It's silly to put each of these steps in a separate, throwaway Python script that then gets called in a separate process by Rake. Instead, I can import tasks.py into a Python script and include the dependency-scheduling features in my own programs.
Key differences from other build systems (e.g. SCons, waf, ruffus):
- This module is not meant to be run from the command-line -- only for specific parts of your own code
- The implementation of each processing step is separate from the dependency specification (although single-command tasks can still be defined inline with a lambda expression). Separate the algorithm from the data, I always say.
- Cleanup is specified within the task, not in a separate area of the Rakefile/Makefile. This makes more sense for a project with heterogenous processing steps & intermediate files.