Tuesday, November 4, 2014

Preview and preprint: CNVkit, copy number detection for targeted sequencing

I've posted a preprint of the CNVkit manuscript on bioRxiv. If you think this software or method might suit your needs, please take a look and let me know what you think of it!

What is CNVkit?

CNVkit is a software toolkit for detecting and visualizing germline copy number variants and somatic copy number alterations in targeted or whole-exome DNA sequencing data. (Source code | Documentation)

The method implemented in CNVkit takes advantage of the sparse, nonspecifically captured off-target reads present in hybrid capture sequencing output to supplement on-target read depths. The program also uses a series of normalizations and bias corrections so it can be used with or without a normal-sample copy number reference to accurately call CNVs. The overall resolution and copy ratio values are very close to those obtained with 180K array CGH.

We have used CNVkit at UCSF to assess clinical samples for several research projects over the past year.

Putting it in your pipeline

See the Quick Start page for basic usage. The software package is modular so, in addition to the simple "batch" calling style, the underlying commands can be run directly to support your workflow.

I've attempted to make CNVkit compatible with other software and easy to integrate into sequencing analysis pipelines. The following are currently supported or in development:
  • bcbio-nextgen -- in progress
  • Galaxy -- a basic wrapper is in the development Tool Shed
  • THetA2 -- CNVkit segmentation output can be used directly as input to THetA
  • Integrative Genomics Viewer -- export segments as SEG, then load in IGV to view tracks as a heatmap
  • BioDiscovery Nexus Copy Number -- export files to the Nexus "basic" format
  • Java TreeView -- export CDT or .jtv tabular files, then load in JTV for a microarray-like viewing experience
If you would like to see CNVkit play nicely with another existing program, and/or support another standard output format, or just want some help getting set up, please let me know on SeqAnswers.

Friday, October 3, 2014

On the awesomeness of the BOSC/OpenBio Codefest 2014

This summer I was in Boston for a bundle of conferences: Intelligent Systems in Molecular Biology (ISMB), the Bioinformatics Open Source Conference (BOSC) before that, and a very special Open Bioinformatics Codefest before all of it.

The Codefest was novel, so I'm writing about the highlights here.

A Galaxy Tool for CNVkit

I spent a good portion of the last year developing CNVkit, a software package for calling copy number variants from targeted DNA sequencing. At the Codefest I wanted to work on making CNVkit compatible with existing bioinformatics pipelines and workflow managements systems, in particular Galaxy and bcbio-nextgen. (I had no prior development experience with either platform.)

Galaxy is a popular open-source framework for building reusable/repeatable bioinformatic workflows through a web browser interface. In particular, existing software can be wrapped for the Galaxy platform and distributed through the Galaxy Tool Shed.  With help from Peter Cock, Matt Shirley and other members of the Galaxy team, I managed to build and successfully run a Galaxy Tool wrapping CNVkit. It's currently visible in the Test Tool Shed and in the main CNVkit source code repository on GitHub. I still need to finalize the Tool, and sometime after that it will hopefully be accepted into the main Tool Shed, making it easily available to all Galaxy users.

bcbio-nextgen

Brad Chapman, in addition to his involvement in developing Biopython and Cloud BioLinux and organizing the Codefest itself, is currently leading the development of bcbio-nextgen, a framework to implement and evaluate best-practice pipelines for high-throughput sequencing analyses. Recent work on this project considered structural variants; next steps will consider cancer samples and targeted or whole-exome sequencing, where CNVkit could be a useful component of the analysis pipeline.

I didn't produce any code for bcbio-nextgen at the Codefest, but I did get a chance to talk to Brad about it a little, and work is now progressing.  A goal of the bcbio-nextgen project is to produce a pipeline that not only works, but works as well as possible. To achieve this, we'll need to develop good benchmarks for evaluating structural variant and copy number variant calls on cancer samples, something of an open problem at the moment.

Arvados and Curoverse

Arvados is a robust, open-source platform for conducting large-scale genomic analyses. The project originated in George Church's group at Harvard and the Personal Genome Project. Curoverse is a startup that has built a user-friendly workflow management system (similar to DNAnexus and Appistry, conceptually) on top of Arvados. The Curoverse front-end can be installed and run locally, and jobs can also be seamlessly dispatched to distributed computing services (like the Amazon cloud); some of Galaxy and bcbio-nextgen run on Curoverse already.

Curoverse kindly sponsored the Codefest, and a few of the Arvados/Curoverse folks were in attendance and shared some of their work (and stickers, and free trial accounts and compute time) with the rest of us. The Codefest was also blessed with Amazon Web Services bucks, which we could use toward running Cloud BioLinux or Curoverse.  Anyway, Curoverse looks cool, and worth keeping an eye on.

Biopythoneers of the world unite

The core developers of Biopython are distributed globally, and BOSC is a fairly unique opportunity for any of us to meet in person. The Codefest provided a nice setting for Peter Cock, Wibowo "Bow" Arindrarto and I to get together, stake out a table and hack on Biopython for a couple days.

We started with a survey of the issue tracker and addressed some long-standing bugs. Bow then moved on to explore an idea for splitting the Biopython distribution into smaller, separately installable modules, while I cleaned some dark corners of Bio.Phylo and enabled automatic testing of the code examples in the Phylo chapter of the Biopython Tutorial and Cookbook.  Peter worked on his new BGZIP module and SAM/BAM branch in Biopython, and at some point stated that Biopython will have native (pure-Python) SAM/BAM support soon.

The scene

We met up at hack/reduce, a hackspace next to MIT and Kendall Square -- a fairly unassuming low-rise brick building, converted from an industrial space and retrofitted with good Wi-Fi, coffee urns and other essentials.

The environment inside was friendly and helpful.  Note the distinction between "codefest" and "hackathon": This one was collaborative, not competitive, and welcomed both newcomers and veterans of open-source projects.  In addition to the Biopythoneers, Galaxy was well represented, with John Chilton and Michael Crusoe conveniently within hollering distance of Team Biopython. Groups from Arvados/Curoverse, Cloud BioLinux, and individuals who are involved in a variety of other projects were there, too. Some people just came to meet up and network. Chalalai Chaihirunkarn from Carnegie Mellon University was there to study the dynamics of the Codefest itself, and she will report on it at some point.

At BOSC, kicking off the second day of the conference, Brad summarized our accomplishments at the Codefest:
I recommend attending the next OpenBio Codefest to anyone who is interested in it. Even if you aren't currently involved in an open-source project, BOSC and the Codefest are unique, useful opportunities for personal education and professional development. In any case it's an interesting and fun experience.

Saturday, June 28, 2014

Tomorrowland never dies: A clever rapid transit system sees life in Tel Aviv

A skyTran demo track will be installed in Tel Aviv, Israel during the next year, and a larger commercial installation is scheduled for 2016. If this system works well, will put every other city's mass transit options to shame.

The skyTran concept should be easy to grasp if you've been to Disneyland, specifically Tomorrowland:
  1. Start with the monorail and split it into small autonomous cars seating two people (like the now-defunct PeopleMover, if you remember it).
  2. Flip it upside down, hanging from the track rather than on top of it, so less can go wrong (a hint of SkyWay, now).
  3. Modernize the design: aerodynamic shape, a maglev track, and sensors to guide and space the cars and allow them to brake quickly if there's a problem ahead.
The cars are shunted off from the main track to allow passengers to get on or off without interrupting the flow of traffic  — like a freeway or express train. The tight spacing, automatic routing, and lack of stops along the way allow the system to carry large numbers of people to their destinations much more efficiently than a standard highway or even a train.

* * *

I found Doug Malewicki's website for skyTran in late 2005, fresh out of college. The idea looked solid and I was enthusiastic about it, but the website didn't do justice to the engineering team behind it. I volunteered to revamp the website, and met with Doug a couple of times to go over ideas. I put together a simple static site as a demo, and after much exertion, got the CSS to look all right in both Firefox and IE (because that's what people used at the time, grandkids) on a variety of screen sizes and default font sizes. But for our next meeting, Doug printed out the webpages, and — due to some combination of hardware and software that I will never know — the fonts tripled in size and the layout turned to garbage. Doug was displeased, I was helpless. In conclusion, I don't have what it takes to do front-end web development. It was probably this experience more than any other single event that motivated me to go to grad school.

Anyway, I'm delighted to see that skyTran is still moving forward.

When I first saw the skyTran design in 2005, smartphones were not popular yet, so instead of using a smartphone app to summon a car and pay for it, passengers would carry a keychain-sized RFID dongle for payment (e.g. FasTrak), and simply queue up at a raised platform to catch a car. The design I saw also indicated a maximum speed of 150 mph, making it suitable for most medium-distance travel in a metro area, and probably competitive with California's long-delayed high-speed rail, but not as fast as airlines for long-distance travel.

Obviously, this would be great in a spread-out US city like Los Angeles or San Jose, but implementing it would be politically impossible. At the time Doug told me about it, he said he was going to pursue privately funded initiatives, and he had a team in Seattle building a proper prototype. I thought a good candidate to try the technology would be a city-state like Singapore, with a strong centralized authority and a keen interest in efficient, scalable civic development. It seems Tel Aviv has the same will and ability to develop new infrastructure. So, is there any good reason California can't do the same?

Tuesday, December 24, 2013

The blinders of peer review

Does pre-publication peer-review isolate a finding from the field during the process? Sure, and that's partly the point of it, but it can lead to some inconveniences when two related papers from separate groups undergo peer review at the same time.

Earlier this year I published a bioinformatic analysis of the rhoptry kinases (ROPK), a lineage-specific family of signaling proteins involved in the invasion mechanisms of Toxoplasma gondii, Eimeria tenella and related eukaryotic parasites. During this study I found four T. gondii proteins (and their orthologs in other species) that have the hallmarks of ROPKs, including a predicted signal peptide, a protein kinase domain more similar to other ROPKs than to any other known kinases, and mRNA expression patterns matching those of other ROPKs. I named these genes numerically starting after the highest-numbered ROPK previously published (ROP46).

To informally reserve the names ahead of the publication of my own article, I posted notes on the corresponding ToxoDB gene pages: ROP47, ROP48, ROP49 and ROP50. My professor and I made some inquiries with other T. gondii researchers to see if it would be possible to confirm the localization of these proteins to the rhoptry organelle, in order to solidify our argument. Without a peer-reviewed publication to point to, though, this seemed to be the most we could do to promote the new gene names.

In parallel, another well-regarded lab that specializes in T. gondii rhoptry proteins, including but not limited to ROPKs, investigated the localization and function of three other proteins that mRNA expression had indicated were associated with other rhoptry proteins. It's great work. However, their paper and ours both passed through peer review at roughly the same time (earlier this year); we both followed the same numerical naming scheme for rhoptry proteins, starting after ROP46; and unfortunately, we ended up assigning the names ROP47 and ROP48 to different T. gondii proteins.

Crud.

How could this confusing situation have been avoided? EuPathDB is widely used, but it's not the primary source for gene names and accessions, and a user-submitted comment alone has fairly limited visibility. I presented a poster at the 2012 Molecular Parasitology Meeting, where many of the active Toxo enthusiasts gather each year, but the choice of new gene names was a minor detail on the poster. Heck, I even had breakfast with the other group's PI, but we only talked about curious features of established rhoptry proteins, not the novel ROPs we were each about to propose.

The only way to really claim a gene name is with a peer-reviewed publication.


* * *

Until now I didn't really grasp the importance of public preprint servers like arXiv, BioRxiv and PeerJ PrePrints — at least in the life sciences where a good article can be published outside a glamor mag within a few months. (In physics and mathematics, peer review and publication typically take much longer.) It was hard enough to get people I knew to review my articles before submitting them to a journal; would anyone really leave useful comments out of the blue if I posted an unreviewed paper on a preprint server? Answer: Maybe, but there's more to preprints than that.

"Competitors" have their own projects, usually planned around their own grants. They could drop everything and copy your idea if they saw it. More likely, they will do the same thing they'll do when they see your final published paper, which is to take this new information into account as they pursue their own projects. You do want to make an impact on the field, don't you?

Pre-publication peer-review is a well-established system for gathering detailed suggestions from uninvolved colleagues, a useful stick to force authors to improve their manuscripts, and sometimes a filter for junk. F1000 has an innovative process of publishing submissions first after a cursory screening, then collecting peer reviews and allowing authors to revise the manuscript at their leisure, apparently. Once a manuscript has been reviewed, revised and approved, it receives a tag indicating that it has been properly peer-reviewed. PeerJ takes a more conservative approach, hosting a preprint server alongside but separate from their peer-reviewed articles. Are either of these the way forward?

F1000 is new on the scene, and it may be too soon to tell if this is going to be a success. For one thing, will authors be motivated enough to correct their manuscripts promptly? PLoS One once fought a mighty battle against the perception that they weren't peer-reviewed. That stigma came out of thin air, and has been overcome — but will F1000 have to fight the same battle again, since their articles really are a mix of various states of peer-review? I hope not, because many scientists could benefit from having a few holes poked in the wall of pre-publication peer review.

Tuesday, October 8, 2013

Old Cajun wisdom for the young scientist

During the summer after I started grad school, Paul Graham posted an interesting article: "Ramen Profitable." I thought it was inspiring in two ways:
  1. It made the point that a startup founder is in a much safer, more comfortable and more productive position once enough revenue is coming in to support minimal cost-of-living; raising additional funding is no longer the most important thing. Replace "founder" with "grad student"/"postdoc"/"omg it never ends," and "funding" with "funding" (well, it sounds completely different when you change the context), and you've explained why applying for long-shot grants is such a resource sink, yet we all do it anyway, and why a grim but reliable stipend is an acceptable equilibrium for many academics.
  2. The article included a vague but basically great recipe for beans and rice in the footnotes.
I cooked variations of this recipe through the rest of grad school, and now as a postdoc. It hits the sweet spots for flavor, cost, ease of prep, and sheer volume of leftovers.  Here's the specific recipe I converged on over the years.

Red and Black Beans and Rice (SEC)

— or —

Life of the Mind Beans and Rice (other conferences)

In a rice cooker, mix together:
  • 1 c. dry rice (white, brown or parboiled)
  • 1/2 c. quinoa
  • 2 1/2 to 3 c. water (see rice packaging and past experience)
  • (optional: 1 Tbsp. white vinegar)
  • (optional: 1 tsp. garlic powder)

Start the rice cooker and let it do its thing. Heat in a large pan over medium:
  • 1-2 Tbsp. vegetable oil

Add:
  • 1 medium/large or 2 small yellow onion(s), chopped
  • 4-6 cloves garlic, chopped; or 2 Tbsp. garlic paste

Cook until onions are translucent, about 3 minutes. Add:
  • 2-4 oz. andouille sausage; or spicy pork sausage; or other spicy sausage; chopped
  • 2 stalks celery, chopped
  • 1 green bell pepper, chopped
  • 1/4 c. okra, chopped
  • (optional: 1/2 to 1 jalapeno pepper, chopped)

Stir casually for 4-5 minutes. (This is a good time to grab a beer from the fridge.) Season with:
  • 1 tsp. black pepper
  • 1/2 to 1 Tbsp. paprika
  • 1/2 to 1 Tbsp. cumin
  • 1 "serving" condensed chicken broth; or 1 cube chicken boullion, crumbled
  • (optional: 1/4 to 1/2 tsp. red/cayenne pepper)

Stir for 1 minute to mix thoroughly. Open:
  • 1 14-oz can black beans
  • 1 14-oz can red beans; or kidney beans; or more black beans

Pour in the liquid and about 1/3 to 1/2 of the beans from each can into the pan, and stir to mix. Put the rest of the beans in a small bowl or sturdy (Pyrex) beaker and mash somewhat. Add the semi-mashed beans to the pan. Stir for another 3-5 minutes to let the stew thicken.

(When the rice cooker finishes, take off the glass lid and drape a cheesecloth or thin towel over it for a few minutes while the rice cools a little.)

Serve the beans over the rice/quinoa blend in a shallow bowl.

Leftovers are even better.