Wednesday, August 1, 2012

The well-organized data science project

Someone recently asked me about the basic setup a computational scientist needs to conduct research efficiently. I'm pretty satisfied with my current arrangement, which was inspired by this: "A Quick Guide to Organizing Computational Biology Projects"

My work is organized into individual "projects" which are each supposed to become papers at some point. I keep each project in Dropbox to ensure everything is synced and backed up remotely all the time -- no file left behind. I also use Mendeley, with a folder for each project's references. Mendeley can generate a project-specific BibTex file from a folder.

A well-organized project might look like this:

  • projects/$project_name/
    • README -- notes on current progress, observations, to-do items, vague ideas for the future -- a starting point for "what was I doing here?" A very sloppy lab notebook by itself, but combined with the Mercurial revision history, it captures the essentials.
    • data/
      • Files sent from collaborators, downloaded and processed data sets. Big databases are kept on the lab's file server, not here.
    • src/
      • The main code base, if there's a core software component to the project. Throwaway scripts will also be littered throughout the work/ directory tree.
    • work/
      • For each new "idea", create a new directory under here, then hack away, freely generating intermediate files. Copying and modifying files is common; only a small portion of this is eventually shared or used in the manuscript.
      • If things look good and the procedure is worth remembering (i.e. redoing with updated data later), organize the idea-related scripts and Bash history into a Makefile.
    • results/
      • Interesting outputs, preliminary figures and tables that have been or will be sent to collaborators or used in a presentation (e.g. lab weekly update meetings)
    • manuscript/
      • Top level: the manuscript in progress -- in my case usually LaTeX and BibTex files and a Makefile; also potentially .doc files sent by collaborators, etc. Eventually, also a cover letter for the first submission.
      • figures/
        • The "final" images that will be used in the manuscript. These may be stitched together from several components, e.g. several plots or diagrams generated from programs.
      • response/
        • Reviewer comments and the response(s) to reviewers being drafted
      • proofs/
        • PDF copies of the manuscript exactly as it was submitted to the journal, at each stage of the publication process.

I also start a Mercurial repo at the base of each project for local snapshotting and logging. For the manuscript, it turns out that diffs of LaTeX documents are often not that helpful after an editing frenzy because of automatic line wrapping, but it's better than nothing. (Git's index is too much of a hassle for this kind of work -- the benefit of a "clean" patch series is outweighed by the distraction of explicitly adding files before each commit. If I didn't already have Dropbox to sync everything, Bzr's bound repos would be a decent alternative.)


Academic research is not like engineering, where a team sets their sights on a certain goal (perhaps refined over time), divides up the component tasks, and marches onward. We don't know that a project will "work" (in science: improve on previous knowledge or methods) even if we execute the idea the "right" way.

It's more like investigative journalism, where you latch onto a potential story, follow every promising lead, and seek out multiple lines of evidence to support your claims.

Because of this, I've adopted a development "approach" that could be called Results-Driven Development, or maybe Failure-Driven Development. The idea is that I need fast results to prove the value in pursuing a line of inquiry, i.e. whatever is going on in a subdirectory of work/. It's only worthwhile to invest in proper engineering practices once it looks like there's a "lead". This is different from typical software engineering projects because much of the time in research the first steps in a new direction quickly lead to a rethinking of the problem, and it's necessary to drop the current work and take a different approach immediately, whereas in software engineering you typically already know what you're trying to achieve from the beginning. In startups, "fail fast" means weeks or months; in research it can mean minutes or hours.

The takeaway is that technical debt is OK during the first few hours of pursuing an idea. If it turns out I've hacked up something reusable, I copy the important bits into my main utility library or script collection and manage them in a reasonable way from then on. But normally the result is a slightly gnarly Bash history and a small set of one-off Bash and Python scripts that are completely specific to that inquiry, and should just stay there.

Also, note that my style of research might be completely different from yours. My lab is fairly small, and each project starts off solo or with a few data sets from a remote collaborator. If you're providing bioinformatics services for a group driven by high-throughput experiments, your infrastructure needs are much different from mine. (But I still recommend Mendeley and version control.)

Further reading

Michael Barton has written an excellent series called Decomplected Workflows on this topic. My favorite so far is on Makefiles, where he shows how a realistic workflow would be managed in Make, something I've wrestled with in the past. I'll note that where he advises switching from Rake to Make, there are two changes happening: (1) switching from Rake to the technically very much inferior but more widespread Make, and (2) writing the workflow logic in standalone scripts rather than Ruby functions embedded in the Rakefile. If you only use your Rakefile to control the workflow and instead call out to standalone scripts from Rake (which is very easy), then I think Rake is still the better choice.


Admin said...


Thanks for taking the time to write this up, Ive found it to be extremely helpful.



Peter Zsoldos said...


thanks for writing this up - I was surprised to see how similar it is to some concepts I came across in the software field, such as lean startups/companies and financial trading experimental strategies to name a few. Just hack together a minimum viable feature, throw it out into the wild, see how it performs, and if it has proven itself, only then stabilize it (code it up properly).

However, I cannot help but to wonder when the first book will be published about the parallels between investigative journalism, research, and software startups :)

DevRabbit IT Solutions Inc. said...

The Article on The well-organized data science project is nice .It give detail information about it .Thanks for Sharing the information about it .It amazing to know about Data Science. data science consulting