Friday, October 3, 2014

On the awesomeness of the BOSC/OpenBio Codefest 2014

This summer I was in Boston for a bundle of conferences: Intelligent Systems in Molecular Biology (ISMB), the Bioinformatics Open Source Conference (BOSC) before that, and a very special Open Bioinformatics Codefest before all of it.

The Codefest was novel, so I'm writing about the highlights here.

A Galaxy Tool for CNVkit

I spent a good portion of the last year developing CNVkit, a software package for calling copy number variants from targeted DNA sequencing. At the Codefest I wanted to work on making CNVkit compatible with existing bioinformatics pipelines and workflow managements systems, in particular Galaxy and bcbio-nextgen. (I had no prior development experience with either platform.)

Galaxy is a popular open-source framework for building reusable/repeatable bioinformatic workflows through a web browser interface. In particular, existing software can be wrapped for the Galaxy platform and distributed through the Galaxy Tool Shed.  With help from Peter Cock, Matt Shirley and other members of the Galaxy team, I managed to build and successfully run a Galaxy Tool wrapping CNVkit. It's currently visible in the Test Tool Shed and in the main CNVkit source code repository on GitHub. I still need to finalize the Tool, and sometime after that it will hopefully be accepted into the main Tool Shed, making it easily available to all Galaxy users.

bcbio-nextgen

Brad Chapman, in addition to his involvement in developing Biopython and Cloud BioLinux and organizing the Codefest itself, is currently leading the development of bcbio-nextgen, a framework to implement and evaluate best-practice pipelines for high-throughput sequencing analyses. Recent work on this project considered structural variants; next steps will consider cancer samples and targeted or whole-exome sequencing, where CNVkit could be a useful component of the analysis pipeline.

I didn't produce any code for bcbio-nextgen at the Codefest, but I did get a chance to talk to Brad about it a little, and work is now progressing.  A goal of the bcbio-nextgen project is to produce a pipeline that not only works, but works as well as possible. To achieve this, we'll need to develop good benchmarks for evaluating structural variant and copy number variant calls on cancer samples, something of an open problem at the moment.

Arvados and Curoverse

Arvados is a robust, open-source platform for conducting large-scale genomic analyses. The project originated in George Church's group at Harvard and the Personal Genome Project. Curoverse is a startup that has built a user-friendly workflow management system (similar to DNAnexus and Appistry, conceptually) on top of Arvados. The Curoverse front-end can be installed and run locally, and jobs can also be seamlessly dispatched to distributed computing services (like the Amazon cloud); some of Galaxy and bcbio-nextgen run on Curoverse already.

Curoverse kindly sponsored the Codefest, and a few of the Arvados/Curoverse folks were in attendance and shared some of their work (and stickers, and free trial accounts and compute time) with the rest of us. The Codefest was also blessed with Amazon Web Services bucks, which we could use toward running Cloud BioLinux or Curoverse.  Anyway, Curoverse looks cool, and worth keeping an eye on.

Biopythoneers of the world unite

The core developers of Biopython are distributed globally, and BOSC is a fairly unique opportunity for any of us to meet in person. The Codefest provided a nice setting for Peter Cock, Wibowo "Bow" Arindrarto and I to get together, stake out a table and hack on Biopython for a couple days.

We started with a survey of the issue tracker and addressed some long-standing bugs. Bow then moved on to explore an idea for splitting the Biopython distribution into smaller, separately installable modules, while I cleaned some dark corners of Bio.Phylo and enabled automatic testing of the code examples in the Phylo chapter of the Biopython Tutorial and Cookbook.  Peter worked on his new BGZIP module and SAM/BAM branch in Biopython, and at some point stated that Biopython will have native (pure-Python) SAM/BAM support soon.

The scene

We met up at hack/reduce, a hackspace next to MIT and Kendall Square -- a fairly unassuming low-rise brick building, converted from an industrial space and retrofitted with good Wi-Fi, coffee urns and other essentials.

The environment inside was friendly and helpful.  Note the distinction between "codefest" and "hackathon": This one was collaborative, not competitive, and welcomed both newcomers and veterans of open-source projects.  In addition to the Biopythoneers, Galaxy was well represented, with John Chilton and Michael Crusoe conveniently within hollering distance of Team Biopython. Groups from Arvados/Curoverse, Cloud BioLinux, and individuals who are involved in a variety of other projects were there, too. Some people just came to meet up and network. Chalalai Chaihirunkarn from Carnegie Mellon University was there to study the dynamics of the Codefest itself, and she will report on it at some point.

At BOSC, kicking off the second day of the conference, Brad summarized our accomplishments at the Codefest:
I recommend attending the next OpenBio Codefest to anyone who is interested in it. Even if you aren't currently involved in an open-source project, BOSC and the Codefest are unique, useful opportunities for personal education and professional development. In any case it's an interesting and fun experience.

2 comments:

Tiago said...

Where there any ideas on the issue of modularization?

etal said...

Yes! Here is Bow's proof-of-concept branch for a modularized Biopython:
https://github.com/bow/poc_biopy

This includes a clever mechanism for importing separate biopython-related packages as if they were part of the same top-level package.