Thursday, February 2, 2017

What is the UCSF500?

Precision medicine is more than genome sequencing, but molecular profiling is an essential part of it. Recognizing this, in 2014 UC San Francisco launched the Genomic Medicine Initiative to bring high-throughput DNA and RNA sequencing techniques into routine clinical care. The first product of this effort is the UCSF500, a targeted cancer genome sequencing service that is now available to patients at the UCSF Medical Center.

The UCSF500 service is provided by the Clinical Cancer Genomic Lab (CCGL). This group is directed by Dr. Boris C. Bastian, and the informatics group, where I work, is led by Dr. Iwei Yeh. The pilot program from 2014 to 2016 focused on local patients with metastatic disease, including children and patients with rare or poorly understood cancer types.

Several published studies from this program have already given us insight into cancer mechanisms and treatment options:

Targeted sequencing

The UCSF500 assay is a targeted panel of approximately 500 genes and other genomic regions relevant to cancer diagnosis, prognosis and treatment.

The sequenced tissue samples are typically a solid tumor biopsy and a matched normal sample, either blood draw or buccal swab. The matched normal is optional, and hematologic (blood) tumors such as leukemia and lymphoma can also be sequenced. In cases of tumor recurrence a previously sequenced normal sample's results will be reused in the new analysis to save time and cost.

CCGL staff perform DNA extraction, library preparation and hybridization on-site at UCSF. The custom target panel consists of:
  • Exonic regions of about 500 (initially 510, now 480) cancer-associated genes;
  • Selected introns of about 40 genes;
  • Microsatellite sequences, for detecting microsatellite instability (MSI);
  • Scattered SNP sites ("CGH probes") to detect loss of heterozygosity and mutation burden -- these probes are more concentrated near genes where copy number status is known to be actionable.
The captured DNA is then sequenced on two Illumina HiSeq 2500 systems at the UCSF Genomics Core.

The target panel and analysis approach originated in the Bastian lab at UCSF. Elements of this approach can be seen in published papers on melanoma progression, the genetic drivers of desmoplastic melanoma, and our copy number caller CNVkit.

Analysis on the cloud

CCGL's custom-built pipeline for variant detection and analysis runs on the DNAnexus platform. Analysis of a typical sequencing run -- 14 patient samples, sequenced to an on-target coverage depth of 400x, plus 2 control samples -- takes about 4.5 hours.

The pipeline detects:
  • Small/single nucleotide variants (SNV): GATK HaplotypeCaller and UnifiedGenotyper, FreeBayes, Mutect -- combined into a single "unified" VCF for annotation
  • Structural variants (SV): Pindel, DELLY -- run independently to detect potential gene fusions; small indels from Pindel are also included in the SNV VCF
  • Copy number variants (CNV): CNVkit
  • MSI detection: MSIsensor
  • Various validity checks and quality metrics.
The design of the pipeline was inspired by bcbio-nextgen, but does not share code, due to technical constraints. DNAnexus now provides bcbio-nextgen as an app; this is a recent addition and may be a tempting option for new clinical sequencing services.

Reporting to the oncologist

The completed analysis results are automatically pulled to an on-site server hosting the signout software. This server and software is used by CCGL staff to review and approve each run's final QC metrics, and by CCGL clinical geneticists (mainly UCSF pathologists) to review the results and generate a PDF report to be entered in the patient's medical record and returned to the ordering oncologist.

The UCSF500 report highlights clinically relevant genomic features:
  • Somatic SNVs annotated as "Pathogenic" or "Likely Pathogenic" in ClinVar
  • Copy number alterations
  • Fusion genes
  • Microsatellite stability/instability status
  • Pathogenic germline variants relevant to oncology (incidental)
The clinical geneticist also writes a free-text interpretation of the genomic features in light of the patient's disease and medical context, including literature references. The report includes a complete table of detected somatic SNVs of unknown significance as an appendix.

Finally, the clinical geneticist responsible for the report and the ordering oncologist responsible for the patient discuss therapeutic strategies at the next molecular tumor board meeting.

This is the UCSF500's key advantage, and the reason this service can only be offered by a medical center, not a startup: CCGL clinical geneticists continue to work with oncologists after delivering the final report so that the patient's medical history and treatment plan can be considered together with genetics and other diagnostic tests. This approach allows us to not only identify relevant therapies and clinical trials, but also to reconsider the initial diagnosis, quickly order follow-up lab tests, and consider germline findings that may affect the patient's family members.

Tuesday, November 4, 2014

Preview and preprint: CNVkit, copy number detection for targeted sequencing

I've posted a preprint of the CNVkit manuscript on bioRxiv. If you think this software or method might suit your needs, please take a look and let me know what you think of it!

What is CNVkit?

CNVkit is a software toolkit for detecting and visualizing germline copy number variants and somatic copy number alterations in targeted or whole-exome DNA sequencing data. (Source code | Documentation)

The method implemented in CNVkit takes advantage of the sparse, nonspecifically captured off-target reads present in hybrid capture sequencing output to supplement on-target read depths. The program also uses a series of normalizations and bias corrections so it can be used with or without a normal-sample copy number reference to accurately call CNVs. The overall resolution and copy ratio values are very close to those obtained with 180K array CGH.

We have used CNVkit at UCSF to assess clinical samples for several research projects over the past year.

Putting it in your pipeline

See the Quick Start page for basic usage. The software package is modular so, in addition to the simple "batch" calling style, the underlying commands can be run directly to support your workflow.

I've attempted to make CNVkit compatible with other software and easy to integrate into sequencing analysis pipelines. The following are currently supported or in development:
  • bcbio-nextgen -- in progress
  • Galaxy -- a basic wrapper is in the development Tool Shed
  • THetA2 -- CNVkit segmentation output can be used directly as input to THetA
  • Integrative Genomics Viewer -- export segments as SEG, then load in IGV to view tracks as a heatmap
  • BioDiscovery Nexus Copy Number -- export files to the Nexus "basic" format
  • Java TreeView -- export CDT or .jtv tabular files, then load in JTV for a microarray-like viewing experience
If you would like to see CNVkit play nicely with another existing program, and/or support another standard output format, or just want some help getting set up, please let me know on SeqAnswers.

Friday, October 3, 2014

On the awesomeness of the BOSC/OpenBio Codefest 2014

This summer I was in Boston for a bundle of conferences: Intelligent Systems in Molecular Biology (ISMB), the Bioinformatics Open Source Conference (BOSC) before that, and a very special Open Bioinformatics Codefest before all of it.

The Codefest was novel, so I'm writing about the highlights here.

A Galaxy Tool for CNVkit

I spent a good portion of the last year developing CNVkit, a software package for calling copy number variants from targeted DNA sequencing. At the Codefest I wanted to work on making CNVkit compatible with existing bioinformatics pipelines and workflow managements systems, in particular Galaxy and bcbio-nextgen. (I had no prior development experience with either platform.)

Galaxy is a popular open-source framework for building reusable/repeatable bioinformatic workflows through a web browser interface. In particular, existing software can be wrapped for the Galaxy platform and distributed through the Galaxy Tool Shed.  With help from Peter Cock, Matt Shirley and other members of the Galaxy team, I managed to build and successfully run a Galaxy Tool wrapping CNVkit. It's currently visible in the Test Tool Shed and in the main CNVkit source code repository on GitHub. I still need to finalize the Tool, and sometime after that it will hopefully be accepted into the main Tool Shed, making it easily available to all Galaxy users.


Brad Chapman, in addition to his involvement in developing Biopython and Cloud BioLinux and organizing the Codefest itself, is currently leading the development of bcbio-nextgen, a framework to implement and evaluate best-practice pipelines for high-throughput sequencing analyses. Recent work on this project considered structural variants; next steps will consider cancer samples and targeted or whole-exome sequencing, where CNVkit could be a useful component of the analysis pipeline.

I didn't produce any code for bcbio-nextgen at the Codefest, but I did get a chance to talk to Brad about it a little, and work is now progressing.  A goal of the bcbio-nextgen project is to produce a pipeline that not only works, but works as well as possible. To achieve this, we'll need to develop good benchmarks for evaluating structural variant and copy number variant calls on cancer samples, something of an open problem at the moment.

Arvados and Curoverse

Arvados is a robust, open-source platform for conducting large-scale genomic analyses. The project originated in George Church's group at Harvard and the Personal Genome Project. Curoverse is a startup that has built a user-friendly workflow management system (similar to DNAnexus and Appistry, conceptually) on top of Arvados. The Curoverse front-end can be installed and run locally, and jobs can also be seamlessly dispatched to distributed computing services (like the Amazon cloud); some of Galaxy and bcbio-nextgen run on Curoverse already.

Curoverse kindly sponsored the Codefest, and a few of the Arvados/Curoverse folks were in attendance and shared some of their work (and stickers, and free trial accounts and compute time) with the rest of us. The Codefest was also blessed with Amazon Web Services bucks, which we could use toward running Cloud BioLinux or Curoverse.  Anyway, Curoverse looks cool, and worth keeping an eye on.

Biopythoneers of the world unite

The core developers of Biopython are distributed globally, and BOSC is a fairly unique opportunity for any of us to meet in person. The Codefest provided a nice setting for Peter Cock, Wibowo "Bow" Arindrarto and I to get together, stake out a table and hack on Biopython for a couple days.

We started with a survey of the issue tracker and addressed some long-standing bugs. Bow then moved on to explore an idea for splitting the Biopython distribution into smaller, separately installable modules, while I cleaned some dark corners of Bio.Phylo and enabled automatic testing of the code examples in the Phylo chapter of the Biopython Tutorial and Cookbook.  Peter worked on his new BGZIP module and SAM/BAM branch in Biopython, and at some point stated that Biopython will have native (pure-Python) SAM/BAM support soon.

The scene

We met up at hack/reduce, a hackspace next to MIT and Kendall Square -- a fairly unassuming low-rise brick building, converted from an industrial space and retrofitted with good Wi-Fi, coffee urns and other essentials.

The environment inside was friendly and helpful.  Note the distinction between "codefest" and "hackathon": This one was collaborative, not competitive, and welcomed both newcomers and veterans of open-source projects.  In addition to the Biopythoneers, Galaxy was well represented, with John Chilton and Michael Crusoe conveniently within hollering distance of Team Biopython. Groups from Arvados/Curoverse, Cloud BioLinux, and individuals who are involved in a variety of other projects were there, too. Some people just came to meet up and network. Chalalai Chaihirunkarn from Carnegie Mellon University was there to study the dynamics of the Codefest itself, and she will report on it at some point.

At BOSC, kicking off the second day of the conference, Brad summarized our accomplishments at the Codefest:
I recommend attending the next OpenBio Codefest to anyone who is interested in it. Even if you aren't currently involved in an open-source project, BOSC and the Codefest are unique, useful opportunities for personal education and professional development. In any case it's an interesting and fun experience.

Saturday, June 28, 2014

Tomorrowland never dies: A clever rapid transit system sees life in Tel Aviv

A skyTran demo track will be installed in Tel Aviv, Israel during the next year, and a larger commercial installation is scheduled for 2016. If this system works well, will put every other city's mass transit options to shame.

The skyTran concept should be easy to grasp if you've been to Disneyland, specifically Tomorrowland:
  1. Start with the monorail and split it into small autonomous cars seating two people (like the now-defunct PeopleMover, if you remember it).
  2. Flip it upside down, hanging from the track rather than on top of it, so less can go wrong (a hint of SkyWay, now).
  3. Modernize the design: aerodynamic shape, a maglev track, and sensors to guide and space the cars and allow them to brake quickly if there's a problem ahead.
The cars are shunted off from the main track to allow passengers to get on or off without interrupting the flow of traffic  — like a freeway or express train. The tight spacing, automatic routing, and lack of stops along the way allow the system to carry large numbers of people to their destinations much more efficiently than a standard highway or even a train.

* * *

I found Doug Malewicki's website for skyTran in late 2005, fresh out of college. The idea looked solid and I was enthusiastic about it, but the website didn't do justice to the engineering team behind it. I volunteered to revamp the website, and met with Doug a couple of times to go over ideas. I put together a simple static site as a demo, and after much exertion, got the CSS to look all right in both Firefox and IE (because that's what people used at the time, grandkids) on a variety of screen sizes and default font sizes. But for our next meeting, Doug printed out the webpages, and — due to some combination of hardware and software that I will never know — the fonts tripled in size and the layout turned to garbage. Doug was displeased, I was helpless. In conclusion, I don't have what it takes to do front-end web development. It was probably this experience more than any other single event that motivated me to go to grad school.

Anyway, I'm delighted to see that skyTran is still moving forward.

When I first saw the skyTran design in 2005, smartphones were not popular yet, so instead of using a smartphone app to summon a car and pay for it, passengers would carry a keychain-sized RFID dongle for payment (e.g. FasTrak), and simply queue up at a raised platform to catch a car. The design I saw also indicated a maximum speed of 150 mph, making it suitable for most medium-distance travel in a metro area, and probably competitive with California's long-delayed high-speed rail, but not as fast as airlines for long-distance travel.

Obviously, this would be great in a spread-out US city like Los Angeles or San Jose, but implementing it would be politically impossible. At the time Doug told me about it, he said he was going to pursue privately funded initiatives, and he had a team in Seattle building a proper prototype. I thought a good candidate to try the technology would be a city-state like Singapore, with a strong centralized authority and a keen interest in efficient, scalable civic development. It seems Tel Aviv has the same will and ability to develop new infrastructure. So, is there any good reason California can't do the same?

Tuesday, December 24, 2013

The blinders of peer review

Does pre-publication peer-review isolate a finding from the field during the process? Sure, and that's partly the point of it, but it can lead to some inconveniences when two related papers from separate groups undergo peer review at the same time.

Earlier this year I published a bioinformatic analysis of the rhoptry kinases (ROPK), a lineage-specific family of signaling proteins involved in the invasion mechanisms of Toxoplasma gondii, Eimeria tenella and related eukaryotic parasites. During this study I found four T. gondii proteins (and their orthologs in other species) that have the hallmarks of ROPKs, including a predicted signal peptide, a protein kinase domain more similar to other ROPKs than to any other known kinases, and mRNA expression patterns matching those of other ROPKs. I named these genes numerically starting after the highest-numbered ROPK previously published (ROP46).

To informally reserve the names ahead of the publication of my own article, I posted notes on the corresponding ToxoDB gene pages: ROP47, ROP48, ROP49 and ROP50. My professor and I made some inquiries with other T. gondii researchers to see if it would be possible to confirm the localization of these proteins to the rhoptry organelle, in order to solidify our argument. Without a peer-reviewed publication to point to, though, this seemed to be the most we could do to promote the new gene names.

In parallel, another well-regarded lab that specializes in T. gondii rhoptry proteins, including but not limited to ROPKs, investigated the localization and function of three other proteins that mRNA expression had indicated were associated with other rhoptry proteins. It's great work. However, their paper and ours both passed through peer review at roughly the same time (earlier this year); we both followed the same numerical naming scheme for rhoptry proteins, starting after ROP46; and unfortunately, we ended up assigning the names ROP47 and ROP48 to different T. gondii proteins.


How could this confusing situation have been avoided? EuPathDB is widely used, but it's not the primary source for gene names and accessions, and a user-submitted comment alone has fairly limited visibility. I presented a poster at the 2012 Molecular Parasitology Meeting, where many of the active Toxo enthusiasts gather each year, but the choice of new gene names was a minor detail on the poster. Heck, I even had breakfast with the other group's PI, but we only talked about curious features of established rhoptry proteins, not the novel ROPs we were each about to propose.

The only way to really claim a gene name is with a peer-reviewed publication.

* * *

Until now I didn't really grasp the importance of public preprint servers like arXiv, BioRxiv and PeerJ PrePrints — at least in the life sciences where a good article can be published outside a glamor mag within a few months. (In physics and mathematics, peer review and publication typically take much longer.) It was hard enough to get people I knew to review my articles before submitting them to a journal; would anyone really leave useful comments out of the blue if I posted an unreviewed paper on a preprint server? Answer: Maybe, but there's more to preprints than that.

"Competitors" have their own projects, usually planned around their own grants. They could drop everything and copy your idea if they saw it. More likely, they will do the same thing they'll do when they see your final published paper, which is to take this new information into account as they pursue their own projects. You do want to make an impact on the field, don't you?

Pre-publication peer-review is a well-established system for gathering detailed suggestions from uninvolved colleagues, a useful stick to force authors to improve their manuscripts, and sometimes a filter for junk. F1000 has an innovative process of publishing submissions first after a cursory screening, then collecting peer reviews and allowing authors to revise the manuscript at their leisure, apparently. Once a manuscript has been reviewed, revised and approved, it receives a tag indicating that it has been properly peer-reviewed. PeerJ takes a more conservative approach, hosting a preprint server alongside but separate from their peer-reviewed articles. Are either of these the way forward?

F1000 is new on the scene, and it may be too soon to tell if this is going to be a success. For one thing, will authors be motivated enough to correct their manuscripts promptly? PLoS One once fought a mighty battle against the perception that they weren't peer-reviewed. That stigma came out of thin air, and has been overcome — but will F1000 have to fight the same battle again, since their articles really are a mix of various states of peer-review? I hope not, because many scientists could benefit from having a few holes poked in the wall of pre-publication peer review.