Tuesday, November 4, 2014

Preview and preprint: CNVkit, copy number detection for targeted sequencing

I've posted a preprint of the CNVkit manuscript on bioRxiv. If you think this software or method might suit your needs, please take a look and let me know what you think of it!

What is CNVkit?

CNVkit is a software toolkit for detecting and visualizing germline copy number variants and somatic copy number alterations in targeted or whole-exome DNA sequencing data. (Source code | Documentation)

The method implemented in CNVkit takes advantage of the sparse, nonspecifically captured off-target reads present in hybrid capture sequencing output to supplement on-target read depths. The program also uses a series of normalizations and bias corrections so it can be used with or without a normal-sample copy number reference to accurately call CNVs. The overall resolution and copy ratio values are very close to those obtained with 180K array CGH.

We have used CNVkit at UCSF to assess clinical samples for several research projects over the past year.

Putting it in your pipeline

See the Quick Start page for basic usage. The software package is modular so, in addition to the simple "batch" calling style, the underlying commands can be run directly to support your workflow.

I've attempted to make CNVkit compatible with other software and easy to integrate into sequencing analysis pipelines. The following are currently supported or in development:
  • bcbio-nextgen -- in progress
  • Galaxy -- a basic wrapper is in the development Tool Shed
  • THetA2 -- CNVkit segmentation output can be used directly as input to THetA
  • Integrative Genomics Viewer -- export segments as SEG, then load in IGV to view tracks as a heatmap
  • BioDiscovery Nexus Copy Number -- export files to the Nexus "basic" format
  • Java TreeView -- export CDT or .jtv tabular files, then load in JTV for a microarray-like viewing experience
If you would like to see CNVkit play nicely with another existing program, and/or support another standard output format, or just want some help getting set up, please let me know on SeqAnswers.

Friday, October 3, 2014

On the awesomeness of the BOSC/OpenBio Codefest 2014

This summer I was in Boston for a bundle of conferences: Intelligent Systems in Molecular Biology (ISMB), the Bioinformatics Open Source Conference (BOSC) before that, and a very special Open Bioinformatics Codefest before all of it.

The Codefest was novel, so I'm writing about the highlights here.

A Galaxy Tool for CNVkit

I spent a good portion of the last year developing CNVkit, a software package for calling copy number variants from targeted DNA sequencing. At the Codefest I wanted to work on making CNVkit compatible with existing bioinformatics pipelines and workflow managements systems, in particular Galaxy and bcbio-nextgen. (I had no prior development experience with either platform.)

Galaxy is a popular open-source framework for building reusable/repeatable bioinformatic workflows through a web browser interface. In particular, existing software can be wrapped for the Galaxy platform and distributed through the Galaxy Tool Shed.  With help from Peter Cock, Matt Shirley and other members of the Galaxy team, I managed to build and successfully run a Galaxy Tool wrapping CNVkit. It's currently visible in the Test Tool Shed and in the main CNVkit source code repository on GitHub. I still need to finalize the Tool, and sometime after that it will hopefully be accepted into the main Tool Shed, making it easily available to all Galaxy users.


Brad Chapman, in addition to his involvement in developing Biopython and Cloud BioLinux and organizing the Codefest itself, is currently leading the development of bcbio-nextgen, a framework to implement and evaluate best-practice pipelines for high-throughput sequencing analyses. Recent work on this project considered structural variants; next steps will consider cancer samples and targeted or whole-exome sequencing, where CNVkit could be a useful component of the analysis pipeline.

I didn't produce any code for bcbio-nextgen at the Codefest, but I did get a chance to talk to Brad about it a little, and work is now progressing.  A goal of the bcbio-nextgen project is to produce a pipeline that not only works, but works as well as possible. To achieve this, we'll need to develop good benchmarks for evaluating structural variant and copy number variant calls on cancer samples, something of an open problem at the moment.

Arvados and Curoverse

Arvados is a robust, open-source platform for conducting large-scale genomic analyses. The project originated in George Church's group at Harvard and the Personal Genome Project. Curoverse is a startup that has built a user-friendly workflow management system (similar to DNAnexus and Appistry, conceptually) on top of Arvados. The Curoverse front-end can be installed and run locally, and jobs can also be seamlessly dispatched to distributed computing services (like the Amazon cloud); some of Galaxy and bcbio-nextgen run on Curoverse already.

Curoverse kindly sponsored the Codefest, and a few of the Arvados/Curoverse folks were in attendance and shared some of their work (and stickers, and free trial accounts and compute time) with the rest of us. The Codefest was also blessed with Amazon Web Services bucks, which we could use toward running Cloud BioLinux or Curoverse.  Anyway, Curoverse looks cool, and worth keeping an eye on.

Biopythoneers of the world unite

The core developers of Biopython are distributed globally, and BOSC is a fairly unique opportunity for any of us to meet in person. The Codefest provided a nice setting for Peter Cock, Wibowo "Bow" Arindrarto and I to get together, stake out a table and hack on Biopython for a couple days.

We started with a survey of the issue tracker and addressed some long-standing bugs. Bow then moved on to explore an idea for splitting the Biopython distribution into smaller, separately installable modules, while I cleaned some dark corners of Bio.Phylo and enabled automatic testing of the code examples in the Phylo chapter of the Biopython Tutorial and Cookbook.  Peter worked on his new BGZIP module and SAM/BAM branch in Biopython, and at some point stated that Biopython will have native (pure-Python) SAM/BAM support soon.

The scene

We met up at hack/reduce, a hackspace next to MIT and Kendall Square -- a fairly unassuming low-rise brick building, converted from an industrial space and retrofitted with good Wi-Fi, coffee urns and other essentials.

The environment inside was friendly and helpful.  Note the distinction between "codefest" and "hackathon": This one was collaborative, not competitive, and welcomed both newcomers and veterans of open-source projects.  In addition to the Biopythoneers, Galaxy was well represented, with John Chilton and Michael Crusoe conveniently within hollering distance of Team Biopython. Groups from Arvados/Curoverse, Cloud BioLinux, and individuals who are involved in a variety of other projects were there, too. Some people just came to meet up and network. Chalalai Chaihirunkarn from Carnegie Mellon University was there to study the dynamics of the Codefest itself, and she will report on it at some point.

At BOSC, kicking off the second day of the conference, Brad summarized our accomplishments at the Codefest:
I recommend attending the next OpenBio Codefest to anyone who is interested in it. Even if you aren't currently involved in an open-source project, BOSC and the Codefest are unique, useful opportunities for personal education and professional development. In any case it's an interesting and fun experience.

Saturday, June 28, 2014

Tomorrowland never dies: A clever rapid transit system sees life in Tel Aviv

A skyTran demo track will be installed in Tel Aviv, Israel during the next year, and a larger commercial installation is scheduled for 2016. If this system works well, will put every other city's mass transit options to shame.

The skyTran concept should be easy to grasp if you've been to Disneyland, specifically Tomorrowland:
  1. Start with the monorail and split it into small autonomous cars seating two people (like the now-defunct PeopleMover, if you remember it).
  2. Flip it upside down, hanging from the track rather than on top of it, so less can go wrong (a hint of SkyWay, now).
  3. Modernize the design: aerodynamic shape, a maglev track, and sensors to guide and space the cars and allow them to brake quickly if there's a problem ahead.
The cars are shunted off from the main track to allow passengers to get on or off without interrupting the flow of traffic  — like a freeway or express train. The tight spacing, automatic routing, and lack of stops along the way allow the system to carry large numbers of people to their destinations much more efficiently than a standard highway or even a train.

* * *

I found Doug Malewicki's website for skyTran in late 2005, fresh out of college. The idea looked solid and I was enthusiastic about it, but the website didn't do justice to the engineering team behind it. I volunteered to revamp the website, and met with Doug a couple of times to go over ideas. I put together a simple static site as a demo, and after much exertion, got the CSS to look all right in both Firefox and IE (because that's what people used at the time, grandkids) on a variety of screen sizes and default font sizes. But for our next meeting, Doug printed out the webpages, and — due to some combination of hardware and software that I will never know — the fonts tripled in size and the layout turned to garbage. Doug was displeased, I was helpless. In conclusion, I don't have what it takes to do front-end web development. It was probably this experience more than any other single event that motivated me to go to grad school.

Anyway, I'm delighted to see that skyTran is still moving forward.

When I first saw the skyTran design in 2005, smartphones were not popular yet, so instead of using a smartphone app to summon a car and pay for it, passengers would carry a keychain-sized RFID dongle for payment (e.g. FasTrak), and simply queue up at a raised platform to catch a car. The design I saw also indicated a maximum speed of 150 mph, making it suitable for most medium-distance travel in a metro area, and probably competitive with California's long-delayed high-speed rail, but not as fast as airlines for long-distance travel.

Obviously, this would be great in a spread-out US city like Los Angeles or San Jose, but implementing it would be politically impossible. At the time Doug told me about it, he said he was going to pursue privately funded initiatives, and he had a team in Seattle building a proper prototype. I thought a good candidate to try the technology would be a city-state like Singapore, with a strong centralized authority and a keen interest in efficient, scalable civic development. It seems Tel Aviv has the same will and ability to develop new infrastructure. So, is there any good reason California can't do the same?