Monday, January 16, 2012

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:
http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.