a frictionless surface

What is the UCSF500?

2017-02-02T16:39:00.000-05:00

Precision medicine is more than genome sequencing, but molecular profiling is an essential part of it. Recognizing this, in 2014 UC San Francisco launched the Genomic Medicine Initiative to bring high-throughput DNA and RNA sequencing techniques into routine clinical care. The first product of this effort is the UCSF500, a targeted cancer genome sequencing service that is now available to patients at the UCSF Medical Center.

The UCSF500 service is provided by the Clinical Cancer Genomic Lab (CCGL). This group is directed by Dr. Boris C. Bastian, and the informatics group, where I work, is led by Dr. Iwei Yeh. The pilot program from 2014 to 2016 focused on local patients with metastatic disease, including children and patients with rare or poorly understood cancer types.

Several published studies from this program have already given us insight into cancer mechanisms and treatment options:

Childhood gliomas and other brain tumors
Malignant phyllodes tumors of the breast
Peritoneal mesothelioma, a rare cancer of the abdomen
Anaplastic pleomorphic xanthoastrocytoma (PXA), a rare neurological tumor

Targeted sequencing

The UCSF500 assay is a targeted panel of approximately 500 genes and other genomic regions relevant to cancer diagnosis, prognosis and treatment.

The sequenced tissue samples are typically a solid tumor biopsy and a matched normal sample, either blood draw or buccal swab. The matched normal is optional, and hematologic (blood) tumors such as leukemia and lymphoma can also be sequenced. In cases of tumor recurrence a previously sequenced normal sample's results will be reused in the new analysis to save time and cost.

CCGL staff perform DNA extraction, library preparation and hybridization on-site at UCSF. The custom target panel consists of:

Exonic regions of about 500 (initially 510, now 480) cancer-associated genes;
Selected introns of about 40 genes;
Microsatellite sequences, for detecting microsatellite instability (MSI);
Scattered SNP sites ("CGH probes") to detect loss of heterozygosity and mutation burden -- these probes are more concentrated near genes where copy number status is known to be actionable.

The captured DNA is then sequenced on two Illumina HiSeq 2500 systems at the UCSF Genomics Core.

The target panel and analysis approach originated in the Bastian lab at UCSF. Elements of this approach can be seen in published papers on melanoma progression, the genetic drivers of desmoplastic melanoma, and our copy number caller CNVkit.

Analysis on the cloud

CCGL's custom-built pipeline for variant detection and analysis runs on the DNAnexus platform. Analysis of a typical sequencing run -- 14 patient samples, sequenced to an on-target coverage depth of 400x, plus 2 control samples -- takes about 4.5 hours.

The pipeline detects:

Small/single nucleotide variants (SNV): GATK HaplotypeCaller and UnifiedGenotyper, FreeBayes, Mutect -- combined into a single "unified" VCF for annotation
Structural variants (SV): Pindel, DELLY -- run independently to detect potential gene fusions; small indels from Pindel are also included in the SNV VCF
Copy number variants (CNV): CNVkit
MSI detection: MSIsensor
Various validity checks and quality metrics.

The design of the pipeline was inspired by bcbio-nextgen, but does not share code, due to technical constraints. DNAnexus now provides bcbio-nextgen as an app; this is a recent addition and may be a tempting option for new clinical sequencing services.

Reporting to the oncologist

The completed analysis results are automatically pulled to an on-site server hosting the signout software. This server and software is used by CCGL staff to review and approve each run's final QC metrics, and by CCGL clinical geneticists (mainly UCSF pathologists) to review the results and generate a PDF report to be entered in the patient's medical record and returned to the ordering oncologist.

The UCSF500 report highlights clinically relevant genomic features:

Somatic SNVs annotated as "Pathogenic" or "Likely Pathogenic" in ClinVar
Copy number alterations
Fusion genes
Microsatellite stability/instability status
Pathogenic germline variants relevant to oncology (incidental)

The clinical geneticist also writes a free-text interpretation of the genomic features in light of the patient's disease and medical context, including literature references. The report includes a complete table of detected somatic SNVs of unknown significance as an appendix.

Finally, the clinical geneticist responsible for the report and the ordering oncologist responsible for the patient discuss therapeutic strategies at the next molecular tumor board meeting.

This is the UCSF500's key advantage, and the reason this service can only be offered by a medical center, not a startup: CCGL clinical geneticists continue to work with oncologists after delivering the final report so that the patient's medical history and treatment plan can be considered together with genetics and other diagnostic tests. This approach allows us to not only identify relevant therapies and clinical trials, but also to reconsider the initial diagnosis, quickly order follow-up lab tests, and consider germline findings that may affect the patient's family members.

Preview and preprint: CNVkit, copy number detection for targeted sequencing

2014-11-04T01:30:00.001-05:00

I've posted a preprint of the CNVkit manuscript on bioRxiv. If you think this software or method might suit your needs, please take a look and let me know what you think of it!

What is CNVkit?

CNVkit is a software toolkit for detecting and visualizing germline copy number variants and somatic copy number alterations in targeted or whole-exome DNA sequencing data. (Source code | Documentation)

The method implemented in CNVkit takes advantage of the sparse, nonspecifically captured off-target reads present in hybrid capture sequencing output to supplement on-target read depths. The program also uses a series of normalizations and bias corrections so it can be used with or without a normal-sample copy number reference to accurately call CNVs. The overall resolution and copy ratio values are very close to those obtained with 180K array CGH.

We have used CNVkit at UCSF to assess clinical samples for several research projects over the past year.

Putting it in your pipeline

See the Quick Start page for basic usage. The software package is modular so, in addition to the simple "batch" calling style, the underlying commands can be run directly to support your workflow.

I've attempted to make CNVkit compatible with other software and easy to integrate into sequencing analysis pipelines. The following are currently supported or in development:

bcbio-nextgen -- in progress
Galaxy -- a basic wrapper is in the development Tool Shed
THetA2 -- CNVkit segmentation output can be used directly as input to THetA
Integrative Genomics Viewer -- export segments as SEG, then load in IGV to view tracks as a heatmap
BioDiscovery Nexus Copy Number -- export files to the Nexus "basic" format
Java TreeView -- export CDT or .jtv tabular files, then load in JTV for a microarray-like viewing experience

If you would like to see CNVkit play nicely with another existing program, and/or support another standard output format, or just want some help getting set up, please let me know on SeqAnswers.

On the awesomeness of the BOSC/OpenBio Codefest 2014

2014-10-03T02:21:00.001-04:00

This summer I was in Boston for a bundle of conferences: Intelligent Systems in Molecular Biology (ISMB), the Bioinformatics Open Source Conference (BOSC) before that, and a very special Open Bioinformatics Codefest before all of it.

The Codefest was novel, so I'm writing about the highlights here.

A Galaxy Tool for CNVkit

I spent a good portion of the last year developing CNVkit, a software package for calling copy number variants from targeted DNA sequencing. At the Codefest I wanted to work on making CNVkit compatible with existing bioinformatics pipelines and workflow managements systems, in particular Galaxy and bcbio-nextgen. (I had no prior development experience with either platform.)

Galaxy is a popular open-source framework for building reusable/repeatable bioinformatic workflows through a web browser interface. In particular, existing software can be wrapped for the Galaxy platform and distributed through the Galaxy Tool Shed. With help from Peter Cock, Matt Shirley and other members of the Galaxy team, I managed to build and successfully run a Galaxy Tool wrapping CNVkit. It's currently visible in the Test Tool Shed and in the main CNVkit source code repository on GitHub. I still need to finalize the Tool, and sometime after that it will hopefully be accepted into the main Tool Shed, making it easily available to all Galaxy users.

bcbio-nextgen

Brad Chapman, in addition to his involvement in developing Biopython and Cloud BioLinux and organizing the Codefest itself, is currently leading the development of bcbio-nextgen, a framework to implement and evaluate best-practice pipelines for high-throughput sequencing analyses. Recent work on this project considered structural variants; next steps will consider cancer samples and targeted or whole-exome sequencing, where CNVkit could be a useful component of the analysis pipeline.

I didn't produce any code for bcbio-nextgen at the Codefest, but I did get a chance to talk to Brad about it a little, and work is now progressing. A goal of the bcbio-nextgen project is to produce a pipeline that not only works, but works as well as possible. To achieve this, we'll need to develop good benchmarks for evaluating structural variant and copy number variant calls on cancer samples, something of an open problem at the moment.

Arvados and Curoverse

Arvados is a robust, open-source platform for conducting large-scale genomic analyses. The project originated in George Church's group at Harvard and the Personal Genome Project. Curoverse is a startup that has built a user-friendly workflow management system (similar to DNAnexus and Appistry, conceptually) on top of Arvados. The Curoverse front-end can be installed and run locally, and jobs can also be seamlessly dispatched to distributed computing services (like the Amazon cloud); some of Galaxy and bcbio-nextgen run on Curoverse already.

Curoverse kindly sponsored the Codefest, and a few of the Arvados/Curoverse folks were in attendance and shared some of their work (and stickers, and free trial accounts and compute time) with the rest of us. The Codefest was also blessed with Amazon Web Services bucks, which we could use toward running Cloud BioLinux or Curoverse. Anyway, Curoverse looks cool, and worth keeping an eye on.

Biopythoneers of the world unite

The core developers of Biopython are distributed globally, and BOSC is a fairly unique opportunity for any of us to meet in person. The Codefest provided a nice setting for Peter Cock, Wibowo "Bow" Arindrarto and I to get together, stake out a table and hack on Biopython for a couple days.

We started with a survey of the issue tracker and addressed some long-standing bugs. Bow then moved on to explore an idea for splitting the Biopython distribution into smaller, separately installable modules, while I cleaned some dark corners of Bio.Phylo and enabled automatic testing of the code examples in the Phylo chapter of the Biopython Tutorial and Cookbook. Peter worked on his new BGZIP module and SAM/BAM branch in Biopython, and at some point stated that Biopython will have native (pure-Python) SAM/BAM support soon.

The scene

We met up at hack/reduce, a hackspace next to MIT and Kendall Square -- a fairly unassuming low-rise brick building, converted from an industrial space and retrofitted with good Wi-Fi, coffee urns and other essentials.

The environment inside was friendly and helpful. Note the distinction between "codefest" and "hackathon": This one was collaborative, not competitive, and welcomed both newcomers and veterans of open-source projects. In addition to the Biopythoneers, Galaxy was well represented, with John Chilton and Michael Crusoe conveniently within hollering distance of Team Biopython. Groups from Arvados/Curoverse, Cloud BioLinux, and individuals who are involved in a variety of other projects were there, too. Some people just came to meet up and network. Chalalai Chaihirunkarn from Carnegie Mellon University was there to study the dynamics of the Codefest itself, and she will report on it at some point.

At BOSC, kicking off the second day of the conference, Brad summarized our accomplishments at the Codefest:

Video
Slides
Collected notes from before, during and after the Codefest

I recommend attending the next OpenBio Codefest to anyone who is interested in it. Even if you aren't currently involved in an open-source project, BOSC and the Codefest are unique, useful opportunities for personal education and professional development. In any case it's an interesting and fun experience.

Tomorrowland never dies: A clever rapid transit system sees life in Tel Aviv

2014-06-28T13:39:00.002-04:00

A skyTran demo track will be installed in Tel Aviv, Israel during the next year, and a larger commercial installation is scheduled for 2016. If this system works well, will put every other city's mass transit options to shame.

The skyTran concept should be easy to grasp if you've been to Disneyland, specifically Tomorrowland:

Start with the monorail and split it into small autonomous cars seating two people (like the now-defunct PeopleMover, if you remember it).
Flip it upside down, hanging from the track rather than on top of it, so less can go wrong (a hint of SkyWay, now).
Modernize the design: aerodynamic shape, a maglev track, and sensors to guide and space the cars and allow them to brake quickly if there's a problem ahead.

The cars are shunted off from the main track to allow passengers to get on or off without interrupting the flow of traffic — like a freeway or express train. The tight spacing, automatic routing, and lack of stops along the way allow the system to carry large numbers of people to their destinations much more efficiently than a standard highway or even a train.

* * *

I found Doug Malewicki's website for skyTran in late 2005, fresh out of college. The idea looked solid and I was enthusiastic about it, but the website didn't do justice to the engineering team behind it. I volunteered to revamp the website, and met with Doug a couple of times to go over ideas. I put together a simple static site as a demo, and after much exertion, got the CSS to look all right in both Firefox and IE (because that's what people used at the time, grandkids) on a variety of screen sizes and default font sizes. But for our next meeting, Doug printed out the webpages, and — due to some combination of hardware and software that I will never know — the fonts tripled in size and the layout turned to garbage. Doug was displeased, I was helpless. In conclusion, I don't have what it takes to do front-end web development. It was probably this experience more than any other single event that motivated me to go to grad school.

Anyway, I'm delighted to see that skyTran is still moving forward.

When I first saw the skyTran design in 2005, smartphones were not popular yet, so instead of using a smartphone app to summon a car and pay for it, passengers would carry a keychain-sized RFID dongle for payment (e.g. FasTrak), and simply queue up at a raised platform to catch a car. The design I saw also indicated a maximum speed of 150 mph, making it suitable for most medium-distance travel in a metro area, and probably competitive with California's long-delayed high-speed rail, but not as fast as airlines for long-distance travel.

Obviously, this would be great in a spread-out US city like Los Angeles or San Jose, but implementing it would be politically impossible. At the time Doug told me about it, he said he was going to pursue privately funded initiatives, and he had a team in Seattle building a proper prototype. I thought a good candidate to try the technology would be a city-state like Singapore, with a strong centralized authority and a keen interest in efficient, scalable civic development. It seems Tel Aviv has the same will and ability to develop new infrastructure. So, is there any good reason California can't do the same?

The blinders of peer review

2013-12-24T19:18:00.001-05:00

Does pre-publication peer-review isolate a finding from the field during the process? Sure, and that's partly the point of it, but it can lead to some inconveniences when two related papers from separate groups undergo peer review at the same time.

Earlier this year I published a bioinformatic analysis of the rhoptry kinases (ROPK), a lineage-specific family of signaling proteins involved in the invasion mechanisms of Toxoplasma gondii, Eimeria tenella and related eukaryotic parasites. During this study I found four T. gondii proteins (and their orthologs in other species) that have the hallmarks of ROPKs, including a predicted signal peptide, a protein kinase domain more similar to other ROPKs than to any other known kinases, and mRNA expression patterns matching those of other ROPKs. I named these genes numerically starting after the highest-numbered ROPK previously published (ROP46).

To informally reserve the names ahead of the publication of my own article, I posted notes on the corresponding ToxoDB gene pages: ROP47, ROP48, ROP49 and ROP50. My professor and I made some inquiries with other T. gondii researchers to see if it would be possible to confirm the localization of these proteins to the rhoptry organelle, in order to solidify our argument. Without a peer-reviewed publication to point to, though, this seemed to be the most we could do to promote the new gene names.

In parallel, another well-regarded lab that specializes in T. gondii rhoptry proteins, including but not limited to ROPKs, investigated the localization and function of three other proteins that mRNA expression had indicated were associated with other rhoptry proteins. It's great work. However, their paper and ours both passed through peer review at roughly the same time (earlier this year); we both followed the same numerical naming scheme for rhoptry proteins, starting after ROP46; and unfortunately, we ended up assigning the names ROP47 and ROP48 to different T. gondii proteins.

Crud.

How could this confusing situation have been avoided? EuPathDB is widely used, but it's not the primary source for gene names and accessions, and a user-submitted comment alone has fairly limited visibility. I presented a poster at the 2012 Molecular Parasitology Meeting, where many of the active Toxo enthusiasts gather each year, but the choice of new gene names was a minor detail on the poster. Heck, I even had breakfast with the other group's PI, but we only talked about curious features of established rhoptry proteins, not the novel ROPs we were each about to propose.

The only way to really claim a gene name is with a peer-reviewed publication.

* * *
Until now I didn't really grasp the importance of public preprint servers like arXiv, BioRxiv and PeerJ PrePrints — at least in the life sciences where a good article can be published outside a glamor mag within a few months. (In physics and mathematics, peer review and publication typically take much longer.) It was hard enough to get people I knew to review my articles before submitting them to a journal; would anyone really leave useful comments out of the blue if I posted an unreviewed paper on a preprint server? Answer: Maybe, but there's more to preprints than that.

"Competitors" have their own projects, usually planned around their own grants. They could drop everything and copy your idea if they saw it. More likely, they will do the same thing they'll do when they see your final published paper, which is to take this new information into account as they pursue their own projects. You do want to make an impact on the field, don't you?

Pre-publication peer-review is a well-established system for gathering detailed suggestions from uninvolved colleagues, a useful stick to force authors to improve their manuscripts, and sometimes a filter for junk. F1000 has an innovative process of publishing submissions first after a cursory screening, then collecting peer reviews and allowing authors to revise the manuscript at their leisure, apparently. Once a manuscript has been reviewed, revised and approved, it receives a tag indicating that it has been properly peer-reviewed. PeerJ takes a more conservative approach, hosting a preprint server alongside but separate from their peer-reviewed articles. Are either of these the way forward?

F1000 is new on the scene, and it may be too soon to tell if this is going to be a success. For one thing, will authors be motivated enough to correct their manuscripts promptly? PLoS One once fought a mighty battle against the perception that they weren't peer-reviewed. That stigma came out of thin air, and has been overcome — but will F1000 have to fight the same battle again, since their articles really are a mix of various states of peer-review? I hope not, because many scientists could benefit from having a few holes poked in the wall of pre-publication peer review.

Old Cajun wisdom for the young scientist

2013-10-08T02:26:00.000-04:00

During the summer after I started grad school, Paul Graham posted an interesting article: "Ramen Profitable." I thought it was inspiring in two ways:

It made the point that a startup founder is in a much safer, more comfortable and more productive position once enough revenue is coming in to support minimal cost-of-living; raising additional funding is no longer the most important thing. Replace "founder" with "grad student"/"postdoc"/"omg it never ends," and "funding" with "funding" (well, it sounds completely different when you change the context), and you've explained why applying for long-shot grants is such a resource sink, yet we all do it anyway, and why a grim but reliable stipend is an acceptable equilibrium for many academics.
The article included a vague but basically great recipe for beans and rice in the footnotes.

I cooked variations of this recipe through the rest of grad school, and now as a postdoc. It hits the sweet spots for flavor, cost, ease of prep, and sheer volume of leftovers. Here's the specific recipe I converged on over the years.

Red and Black Beans and Rice (SEC)

— or —

Life of the Mind Beans and Rice (other conferences)

In a rice cooker, mix together:

1 c. dry rice (white, brown or parboiled)
1/2 c. quinoa
2 1/2 to 3 c. water (see rice packaging and past experience)
(optional: 1 Tbsp. white vinegar)
(optional: 1 tsp. garlic powder)

Start the rice cooker and let it do its thing. Heat in a large pan over medium:

1-2 Tbsp. vegetable oil

Add:

1 medium/large or 2 small yellow onion(s), chopped
4-6 cloves garlic, chopped; or 2 Tbsp. garlic paste

Cook until onions are translucent, about 3 minutes. Add:

2-4 oz. andouille sausage; or spicy pork sausage; or other spicy sausage; chopped
2 stalks celery, chopped
1 green bell pepper, chopped
1/4 c. okra, chopped
(optional: 1/2 to 1 jalapeno pepper, chopped)

Stir casually for 4-5 minutes. (This is a good time to grab a beer from the fridge.) Season with:

1 tsp. black pepper
1/2 to 1 Tbsp. paprika
1/2 to 1 Tbsp. cumin
1 "serving" condensed chicken broth; or 1 cube chicken boullion, crumbled
(optional: 1/4 to 1/2 tsp. red/cayenne pepper)

Stir for 1 minute to mix thoroughly. Open:

1 14-oz can black beans
1 14-oz can red beans; or kidney beans; or more black beans

Pour in the liquid and about 1/3 to 1/2 of the beans from each can into the pan, and stir to mix. Put the rest of the beans in a small bowl or sturdy (Pyrex) beaker and mash somewhat. Add the semi-mashed beans to the pan. Stir for another 3-5 minutes to let the stew thicken.

(When the rice cooker finishes, take off the glass lid and drape a cheesecloth or thin towel over it for a few minutes while the rice cools a little.)

Serve the beans over the rice/quinoa blend in a shallow bowl.

Leftovers are even better.

Homebrew chronicles

2013-06-09T21:25:00.001-04:00

I've started up another blog on home brewing. The first post covers a Pumpkin Spice Porter and some important lessons learned about water.

Overly honest methods / cheap tricks for multiple sequence alignment

2013-01-19T16:56:00.003-05:00

What's the quickest way to get a so-so sequence alignment? Once you have this, there are plenty of efficient ways to refine the alignment, infer a tree, build a profile to search for related sequences, or any combination of these at once. MUSCLE and MAFFT spend a surprising amount of time computing initial pairwise distances and computing a guide tree; building a multiple alignment from this point is quite efficient. As part of my recent fascination with absurd algorithms and other unpublishable methods, I looked into this a bit.

Pairwise distances

Rigorously/naively, we could compute pairwise alignments with a dynamic programming algorithm like Smith-Waterman and take the pairwise score as the measure of similarity. Running all-versus-all BLAST and taking the best HSP score (similarity) or e-value (distance) would give roughly the same result. But we can do a crappier job of this if we put some thought into it.

For each sequence, let's count the number of occurrences of each k-mer (k to be determined). The similarity between sequences is then the sum of absolute differences in k-mer counts between two sequences.

# Count distinct tripeptides in each sequence

from collections import Counter
from Bio import SeqIO

K = 3

def get_kmers(seqstr, k):
    for i in range(len(seqstr) - k):
        yield seqstr[i:i+k]

def kmer_dist(kmers1, kmers2):
    """Sum of absolute differences between k-mer counts."""
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(kmers1.get(key, 0) - kmers2.get(key, 0))
               for key in keys)

records = list(SeqIO.parse(fname, 'fasta'))
lookup = dict((rec, Counter(get_kmers(str(rec.seq), K)))
              for rec in records)

# Example:
rec1, rec2 = records[:2]
a_distance = kmer_distance(lookup[rec1], lookup[rec2])

If some of the sequences have repetitive regions — think of paralogs from closely related species, or within a species — this approach will over-emphasize the content and size/number of those repeats at the expense of less common but more meaningful k-mers. We could down-weight abundant k-mers by taking the square root of each k-mer count before computing pairwise differences.

from math import sqrt

def kmer_dist(kmers1, kmers2):
    keys = set(kmers1.iterkeys()).union(set(kmers2.iterkeys()))
    return sum(abs(sqrt(kmers1.get(key, 0)) - sqrt(kmers2.get(key, 0)))
               for key in keys)

But why stop at the square root? Reducto ad awesome, let's drop the k-mer counts and simply track whether the k-mer is present in a sequence or not.

def kmer_dist(kmers1, kmers2):
    """Proportion of non-overlapping elements between two sets."""
    return (len(kmers1.symmetric_difference(kmers2))
            / float(len(s1.intersection(s2)))

lookup = dict((rec, set(get_kmers(str(rec.seq), K)))
              for rec in records)

This also opens up some bit-twiddling opportunities which I have not pursued. MUSCLE, MAFFT and ClustalW all have options to use this trick to quickly estimate pairwise distances for large sequence sets.

Non-phylogenetic guide tree

The guide tree determines how the pairs of sequences will be combined. The tree is typically inferred by neighbor-joining based on the computed pairwise distances, an approach that's O(N³) using the standard algorithm (though rumor has it that quadratic time is possible). Pairs of sequences are aligned according to the tips of the tree, then sub-alignments are treated as internal nodes in the tree aligned to each other. (PRANK also uses the guide tree to infer indel states for characters, but that falls into the category of "good" algorithms which we're not interested in here.)

Let's set aside the evolutionary meaning of the alignment for the moment (gasp) and look for a fast way to determine a reasonable order for performing pairwise alignments of individual sequences and sub-alignments. (MUSCLE and MAFFT choose UPGMA over more "robust" NJ methods, despite or perhaps even because of long-branch attraction, with the same rationale.) Treat the collection of pairwise distances between sequences as a graph, and use Kruskal's algorithm to compute the minimum spanning tree in linear time.

import networkx
G = networkx.Graph()

for i, rec1 in enumerate(records):
    for rec2 in records[i+1:len(records)]:
        dist = kmer_distance(lookup[rec1], lookup[rec2])
        G.add_edge(rec1.id, rec2.id, weight=dist)

H = networkx.minimum_spanning_tree(G)

# Check: Are the sequences grouped together reasonably?
import pylab
nx.draw_graphviz(H, prog='twopi', node_size=0)
pylab.show()

The edges of the resulting graph give the order of alignment; start with nodes of degree 1 as individual sequences to align with their only neighbor, then treat nodes with degree >1 as sub-alignments to align to each other, prioritizing shorter branches.

Application: In the tmalign module of Fammer, I combine pairwise alignments of 3D protein structures, obtained with TMalign in a subprocess, using a minimum spanning tree where the edge weights are the reciprocal of the TM-scores

Journal article: Biopython's Bio.Phylo

2012-11-08T22:19:00.004-05:00

We're not known for the timeliest reporting here at etalog, but FYI, our article on the Bio.Phylo module for phylogenetics in Biopython is now up in its final form on the BMC Bioinformatics:

Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython

It's a quick read. Here's one of the figures we cut from the manuscript which shows some of the key features of the module:

This manuscript gave us a chance to write about a few things we haven't had a good reason to write about elsewhere -- the design rationale, nice figures, performance, and a couple of real-world use cases.

To get started, though, I recommend the main documentation:

Biopython Tutorial (Phylo is currently chapter 12)
Phylo wiki page
Wiki cookbook

History / a bit of navel-gazing

This project stemmed from a Google Summer of Code project in 2009 to implement a phyloXML parser for Biopython, mentored by Christian Zmasek and Brad Chapman. This was administered by the National Evolutionary Synthesis Center (NESCent); the Open Bioinformatics Foundation (OBF) didn't start administering its own GSoC projects until the following year. It was a fun summer, and I got to become more involved in Biopython as a result.

After GSoC ended, we decided that rather than plug the phyloXML module into Biopython as-is, we could do something akin to SeqIO and AlignIO -- wrap the format-specific parsers (NEXUS and Newick were already supported under Bio.Nexus) under a common interface, and share the core objects. At first we planned to create "TreeIO" and "Tree" modules like BioPerl, but as this could lead to confusion with other types of trees in bioinformatics (e.g. from clustering or other low-level algorithms), we settled on "Phylo", with due credit to Rutger Vos's Perl package Bio::Phylo.

The next summer I gave a talk at BOSC 2010 in Boston about this work. My professor was a bit weary of this open-source stuff by that point, so the travel award really helped. (And Airbnb. Boston is not cheap.) The rest is pretty well covered in the BOSC 2012 talk -- the hack continued, deftly shepherded by Peter Cock; Brandon Invergo arrived bearing gifts of pypaml, and we managed to get the module into a fairly stable state in between bouts of "real research", enough to write it up.

Biopython project update at BOSC 2012

2012-10-27T14:42:00.001-04:00

Not so hot off the presses, here are the slides from the talk I gave this summer at the Bioinformatics Open Source Conference (BOSC), a satellite conference of Intelligent Systems for Molecular Biology (ISMB). Since Peter Cock wasn't able to make it out to California this year, he suggested I fill in.

In addition to the usual coverage of new features, a big theme this year was the recurring successes we've had bring in new core developers via Google Summer of Code.

Biopython Project Update (BOSC 2012) from Eric Talevich

Jan Aerts has also posted the rest of the BOSC 2012 slides.

Code Harvest: The Refactoring

2012-08-15T17:39:00.001-04:00

I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.

The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.

Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.

BioFrills

This package contains general-purpose sequence analysis functions that I can't merge into Biopython, for one reason or another:

Research-grade approaches (e.g. heuristic solutions for problems with no efficient exact solution, like estimating the effective number of independent sequences in an alignment)
Cython versions of some functions for speed-up (currently just one, summing BLOSUM62 scores for a pairwise alignment)
Python 2.6 or 2.7 only; includes recent PyPy

I gave the library a silly name to keep it from being somehow mistaken for a Biopython replacement, or even a well-maintained, well-thought-out supplementary library. But in order to make the packages that depend on it easy_installable, I guess I'll need to upload it to PyPI at some point.

One function that I had in esbglib has already been independently invented and merged into Biopython as Bio.File.as_handle (Mine was esbglib.sugar.maybe_open).

Check out BioFrills on GitHub.

BioCMA

For working with MAPGAPS, CHAIN and mcBPPS alignments. These programs use an undocumented, bespoke multiple alignment format called CMA, which I assume means either "Consensus Multiple Alignment" or "CHAIN Multiple Alignment". There are a few tools by the same author for processing this format, but again, we have neither documentation nor source code for them. For years I've felt uneasy about relying on these quirky binaries to manage our most important sequence data... so, I did something about it. Ta-da.

Check out BioCMA on GitHub.

The well-organized data science project

2012-08-01T12:47:00.000-04:00

Someone recently asked me about the basic setup a computational scientist needs to conduct research efficiently. I'm pretty satisfied with my current arrangement, which was inspired by this: "A Quick Guide to Organizing Computational Biology Projects"

My work is organized into individual "projects" which are each supposed to become papers at some point. I keep each project in Dropbox to ensure everything is synced and backed up remotely all the time -- no file left behind. I also use Mendeley, with a folder for each project's references. Mendeley can generate a project-specific BibTex file from a folder.

A well-organized project might look like this:

projects/$project_name/

README -- notes on current progress, observations, to-do items, vague ideas for the future -- a starting point for "what was I doing here?" A very sloppy lab notebook by itself, but combined with the Mercurial revision history, it captures the essentials.
data/

Files sent from collaborators, downloaded and processed data sets. Big databases are kept on the lab's file server, not here.

src/

The main code base, if there's a core software component to the project. Throwaway scripts will also be littered throughout the work/ directory tree.

work/

For each new "idea", create a new directory under here, then hack away, freely generating intermediate files. Copying and modifying files is common; only a small portion of this is eventually shared or used in the manuscript.
If things look good and the procedure is worth remembering (i.e. redoing with updated data later), organize the idea-related scripts and Bash history into a Makefile.

results/

Interesting outputs, preliminary figures and tables that have been or will be sent to collaborators or used in a presentation (e.g. lab weekly update meetings)

manuscript/

Top level: the manuscript in progress -- in my case usually LaTeX and BibTex files and a Makefile; also potentially .doc files sent by collaborators, etc. Eventually, also a cover letter for the first submission.
figures/

The "final" images that will be used in the manuscript. These may be stitched together from several components, e.g. several plots or diagrams generated from programs.

response/

Reviewer comments and the response(s) to reviewers being drafted

proofs/

PDF copies of the manuscript exactly as it was submitted to the journal, at each stage of the publication process.

I also start a Mercurial repo at the base of each project for local snapshotting and logging. For the manuscript, it turns out that diffs of LaTeX documents are often not that helpful after an editing frenzy because of automatic line wrapping, but it's better than nothing. (Git's index is too much of a hassle for this kind of work -- the benefit of a "clean" patch series is outweighed by the distraction of explicitly adding files before each commit. If I didn't already have Dropbox to sync everything, Bzr's bound repos would be a decent alternative.)

Zen

Academic research is not like engineering, where a team sets their sights on a certain goal (perhaps refined over time), divides up the component tasks, and marches onward. We don't know that a project will "work" (in science: improve on previous knowledge or methods) even if we execute the idea the "right" way.

It's more like investigative journalism, where you latch onto a potential story, follow every promising lead, and seek out multiple lines of evidence to support your claims.

Because of this, I've adopted a development "approach" that could be called Results-Driven Development, or maybe Failure-Driven Development. The idea is that I need fast results to prove the value in pursuing a line of inquiry, i.e. whatever is going on in a subdirectory of work/. It's only worthwhile to invest in proper engineering practices once it looks like there's a "lead". This is different from typical software engineering projects because much of the time in research the first steps in a new direction quickly lead to a rethinking of the problem, and it's necessary to drop the current work and take a different approach immediately, whereas in software engineering you typically already know what you're trying to achieve from the beginning. In startups, "fail fast" means weeks or months; in research it can mean minutes or hours.

The takeaway is that technical debt is OK during the first few hours of pursuing an idea. If it turns out I've hacked up something reusable, I copy the important bits into my main utility library or script collection and manage them in a reasonable way from then on. But normally the result is a slightly gnarly Bash history and a small set of one-off Bash and Python scripts that are completely specific to that inquiry, and should just stay there.

Also, note that my style of research might be completely different from yours. My lab is fairly small, and each project starts off solo or with a few data sets from a remote collaborator. If you're providing bioinformatics services for a group driven by high-throughput experiments, your infrastructure needs are much different from mine. (But I still recommend Mendeley and version control.)

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

2012-01-16T17:26:00.000-05:00

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:
http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.

Make

The make utility is normally used to determine which portions of a program need to be recompiled, and the dependencies between them; it includes special features for managing C code, but actions are specified in terms of shell commands. You can just as well use it to specify dependencies between arbitrary tasks.

Giovanni Dall'Olio has posted some helpful instructional material: http://bioinfoblog.it/?p=29

(One of his links is broken -- here's Software Carpentry's current course on make.)

Rake, a Really Awesome Make

If your pipeline is already organized as a pipeline of stand-alone scripts, Rake is a more-or-less ideal solution: http://rake.rubyforge.org/

At some point in your project, you'll likely need to do some string processing; this is where make falls down. Ruby happens to be great for this task. Don't worry if you don't know Ruby -- the basic string methods in Ruby, Python and Perl are very similar, and regular expressions work roughly the same way. You can look up everything you need to know on the fly. (I did.)

Also, Rake's the mini-language for specifying tasks is both concise and intuitive. There's a built-in facility for maintaining short descriptions of each step, which is enticing from a "lab notebook" perspective.

How about a Python solution?

Inspired by Rake, I wrote a small module called tasks.py. Here's the code, plus a simple script to demonstrate:
https://gist.github.com/1623117

This isn't just for the sake of Python evangelism; I have a bunch of special-purpose modules written in Python (for my lab) that I would like to use in an elaborate pipeline of my own. It's silly to put each of these steps in a separate, throwaway Python script that then gets called in a separate process by Rake. Instead, I can import tasks.py into a Python script and include the dependency-scheduling features in my own programs.

Key differences from other build systems (e.g. SCons, waf, ruffus):

This module is not meant to be run from the command-line -- only for specific parts of your own code
The implementation of each processing step is separate from the dependency specification (although single-command tasks can still be defined inline with a lambda expression). Separate the algorithm from the data, I always say.
Cleanup is specified within the task, not in a separate area of the Rakefile/Makefile. This makes more sense for a project with heterogenous processing steps & intermediate files.

To be clear, tasks.py is not better than Rake or make for arranging a set of scripts that you've already written. I didn't even implement concurrent jobs (because most of the CPU-intensive steps I use are calls to programs that are already multicore-aware; though try adding that feature yourself if you'd like).

Journal article: Comparative kinomics of the malaria pathogen and its relatives

2011-11-02T14:42:00.000-04:00

Hot off the presses!
Structural and evolutionary divergence of eukaryotic protein kinases in Apicomplexa

It's a thorough paper, so I'll cover the highlights here.

Why we study apicomplexans

Apicomplexans are a group of related single-celled organisms which are exclusively parasitic. The best-known member is Plasmodium falciparum, which causes the most virulent form of malaria. Another well-studied species is Toxoplasma gondii, which primarily lives in cats but can infect most mammals.

It's a hugely diverse group. But overall, we know very little about them.

Our main motivation for studying apicomplexan proteins is to find what features make them distinct from human proteins, so we can then design drugs to target those features specifically -- the drug will identify and disable the parasite protein without the risk of affecting the host proteins, too. We study protein kinases, in particular, because a number of drugs have already been designed to inhibit kinases in cancer. The same or similar compounds could be used to treat parasitic diseases, potentially.

From an evolutionary biologist's perspective, apicomplexans are also interesting to study because they belong to an evolutionary branch that is quite divergent from the animals, plants and fungi more familiar to us. By learning about apicomplexan biology, and comparing to other model organisms, we can learn more about eukaryotic diversity and the origin of eukaryotes.

Another perspective on the tree of life

Many people, including scientists, think of evolution as a ladder, with single-celled organisms at the bottom and humans at the top. Different lineages, like green plants and fungi, each branch off the ladder at some intermediate point, but evolution is nonetheless mistakenly thought of as a directed progression from bacteria to protists to fish to humans.

That's wrong. It leads to mistakes, such as considering all protists (single-celled eukaryotes) to be closely related to each other. But even within Apicomplexa, the evolutionary distance between Plasmodium falciparum and Toxoplasma gondii is as great as the distance between humans and mosquitoes.

I'm particularly proud of Figure 1 in the paper, which includes a species tree that inverts the traditional view: The closest human relative, yeast, is at the bottom, and layers of increasingly strange and unfamiliar protists build up to the Plasmodium genus.

Interesting features of proteins and genomes

When apicomplexan parasites invade a host, they secrete a mixture of dozens of different proteins into a protective vacuole formed from the host cell membrane. We'd expect that some of these proteins are essential for invasion and virulence, and therefore good targets for inhibition or diagnosis.

Two apicomplexan-specific protein kinase families are known to be exported. The FIKK family appears in 1 copy in most apicomplexans, but is amplified to 21 copies in P. falciparum and 6 copies in P. reichenowi, and does not appear in any species outside the Apicomplexa. Another family, called rhoptry kinases (ROPK) after the apicomplexan organelle they're localized to, appears in dozens of copies in coccidians (T. gondii, Neospora caninum, Eimeria tenella, Sarcocystis neurona), but not in any other lineage of Apicomplexa. Plasmodium and others still contain rhoptries, but there are no kinases in the protein cocktail those rhoptries contain.

As obligate parasites, apicomplexans evolve under different evolutionary contraints than free-living organisms like yeast and humans. Many genes are no longer necessary, and some may even be a liability if they interact with the host's own biochemical pathways. Because of this, we see widespread gene loss and overall compaction of apicomplexan genomes.

One especially curious case is the loss of upstream regulators of the MAPK cascade -- a signaling pathway found in almost all eukaryotes, consisting of 3 or 4 protein kinases each activating the next in a sort of biochemical relay. Apicomplexans contain 2 to 3 copies of the downstream protein kinase, MAPK, but the rest of the pathway components (STE7, STE11, STE20) are generally lost, and none of the surveyed apicomplexans had a complete MAPK cascade. So there's an open question: What other proteins take the place of the STEs in this important pathway, or have MAPK-like features? Is there an Achilles heel to be discovered?

The project

We:

Identified and classified the full set of protein kinases in each of the 17 apicomplexan proteomes available
Devised a pipeline to identify apicomplexan-specific ortholog groups in known protein kinase families
Compared these ortholog groups to the typical members of the kinase family to find specific sequence motifs that distinguish the divergent ortholog group
Mapped these motifs onto protein structures; reviewed the literature to understand possible functions and functional differences related to these motifs

Read about what we found here.

Journal article: Our insights into the structure and activation mechanism of ErbB/EGFR protein kinases

2011-10-28T10:32:00.000-04:00

Here's an article my lab published in PLoS One:
Co-Conserved Features Associated with cis Regulation of ErbB Tyrosine Kinases

I'll give a quick summary of it here. (Don't worry, this isn't a new direction for this blog.)

This is a study of the structural mechanisms of a certain protein family, called ErbB or EGFR (epidermal growth factor receptor), which is frequently involved in cancer. This family belongs to a protein superfamily called protein kinases.

Biochemistry background

Kinases are enzymes which perform a type of post-translational modification, phosphorylation: The kinase transfers a phosphate group from adenosine triphosphate (ATP) to another substrate molecule, leaving adenosine diphosphate (ADP) and the phosphorylated substrate.

Protein kinases are kinases that act on protein substrates, i.e. the phosphorylated molecule is another protein. The substrate could even be another protein kinase, so activation of the first protein kinase causes it to phosphorylate and activate another protein kinase, and so on. This is a type of signal transduction.

Signal transduction is how the cell senses and reacts to its environment, and also its own internal conditions. In the case of ErbB and other receptor tyrosine kinases, the signal starts at the surface of the cell (e.g. epidermal growth factor binds to the extracellular portion of EGFR) and activates the kinase, which then begins sending these phosphorylation signals. These signals are then relayed throughout the cell to trigger other activities, such as cell division or the transcription of certain genes.

What happens if a protein kinase gets "locked" into the active state, somehow? In the case of EGFR, it's as if the cell thinks it's constantly receiving the growth factor. If this signal isn't blocked by another "gatekeeper" in the cell, then the cell will grow uncontrollably -- and become cancer.

How the enzyme works

Protein kinases (PKs) all consist of two large lobes connected by a flexible hinge. Between the lobes is a binding pocket for ATP; this molecule binds inside the smaller lobe (N-terminal lobe, or N-lobe). The larger lobe (C-terminal lobe or C-lobe) provides a binding site for another protein, which will be the kinase's substrate.

The general mechanism of all protein kinases goes like this:

The kinase is initially in an inactive state, with the hinge "open" and the two lobes a bit further apart. Since ATP binds in the N-lobe and the substrate binds to the C-lobe, no phosphate is transferred when the two lobes are apart like this.
By some mechanism (it varies between different kinase families), the two lobes are brought closer together, and the kinase becomes active.
ATP binds to the ATP binding pocket, a substrate binds to the C-lobe, some amino acids shift, and a phosphate group is detached from ATP and reattached to a specific amino acid on the substrate.
The ADP and phosphorylated substrate are released.

Step 2 is the part we're interested in. How do some recurring, cancer-associated mutations cause EGFR to become "locked" in the active conformation? And, can we reverse it?

How we think ErbB kinases work

In the ErbB family, it's not just the two lobes of the kinase domain that are involved in activating the enzyme -- the adjacent sections of the protein, outside the kinase domain, are also involved.

The long C-terminal tail wraps back around the entire kinase domain and associates with the N-lobe, tethered in place by a few residues in the N-lobe and the other N-terminal flanking region (the juxtamembrane segment, between the kinase domain and the cell membrane). The C-tail is placed so that it can influence the movement and relative positioning of the N- and C-lobes, and therefore regulate the activation of the kinase.

We also examined the locations of two EGFR mutations (S768I and L861Q) that have been previously identified as occurring frequently in cancers, mapping them onto the structure. These mutations appear in locations that would disrupt the switching mechanism we proposed -- breaking necessary interactions, or forming new interactions that shouldn't be there for proper EGFR function.

If you'd like to know more, read about it here.

The statistics of the "Like" button

2011-07-07T15:28:00.002-04:00

The launch of Google+ reminded me of a question I've had about Facebook and YouTube for a while: What happens when you click the "Like" button?

Facebook isn't so much about sharing content as sharing content. But YouTube and many other sites like it recommend content based on users' response to the content itself, rather than the shape of the surrounding social network. If you're building an application like this yourself, this article is for you.

Think of a collection of user-generated pages with a "Like" or "+1" button on each. Users can browse pages at will, or arrive at them randomly from an external site, and after viewing a page will either click the "Like" button or do nothing. I'll refer to the number of times viewers do nothing on a page as "Don't care". I'll also assume you have a site-wide "Top Pages" chart that users can view to see the highest-scoring pages and jump to them.

1. Counting "Likes"

The very simplest way to score a page is to count the number of times it's been "Liked":

score = page_likes

The main problem with this method is inertia. Old "champion" pages accumulate votes over time, and dominate the rankings. New pages don't have a chance to unseat the champs, even temporarily, to gain visibility for themselves. The site appears stagnant.

2. "Likes" versus views

To make good recommendations, you want to measure the quality of a page — the chance that a user will like the recommended page themselves:

score = page_likes / page_views

In the long run, this is correct — you estimate the probability based on the frequency of "Likes" in previous views. Old "champion" pages will be unseated in the rankings if a newer page earns a better proportion of likes.

But for new or little-viewed pages, there's an issue of sample size.

A page where the first view is "Liked" (probably by the creator/uploader) scores 100% and shoots to the top of the rankings. If a few friends all immediately "Like" the page, it becomes difficult to unseat. There's a lot of noise at the top of the rankings.

A page where the first view is not "Liked" scores 0% and sinks to the bottom of the rankings. If you have some mechanism for purging bad content from your site (i.e. deleting low-scoring pages that are likely spam, trolls or just lame), then this makes that task more difficult.

Intuitively, a page with 80 likes out of 100 views is more likely to be good than a page with 4 likes out of 5 views. A page with zero likes out of 100 views is almost certainly junk, but 5 views without any likes may not mean much at all.

So, your next goal: Make the best possible estimate of a page's likeability based on the first few views and likes, using some prior knowledge. After that, all reasonable methods should converge on the same score (probability of liking). If a meme catches, people will be able to find that page through other means, and your own rankings will be less crucial to its success.

3. Pseudocounts

A pseudocount is a prior estimate of the probability of an event. To make an estimate of actual probabilities based on a small number of samples (the problem at the end of Step 2), add the pseudocounts to the actual counts of each event.

I'll demonstrate.

The events here are (a) "Like" and (b) "Don't care". I'm going to use b to represent the pseudocount for "Like". For this section, I choose the probabilities:

like = b = .1
dontcare = (1 - b) = .9

The two probabilities should sum to 1.

How do you get these values? Since b represents the probability that a random user will like an arbitrary page, taking the site-wide average of likes versus views is a good choice:

b = all_likes / all_views

To use the pseudocounts, add them to the counts in the formula in step 2:

score = (page_likes + b) / (page_views + 1)

(Recall: views = likes + dontcares; after adding pseudocounts, (likes + b) + (dontcares + 1 - b) = likes + dontcares + 1 = views + 1.)

If the database-wide sums of likes and views are large numbers, this won't significantly affect the "average" score, all_likes / all_views. But it smoothes out the initial scoring for new pages.

Example: Assume 10% of all views result in a "Like" (b = 0.1).

A single view without a "Like" places the page slightly below the global average, but not too much. (Odds are, 90% of pages will start out this way.) Additional views without a "Like" slowly sink the page score toward 0.

Likes:Views	Calculation	Percentage
0:1	0.1 / 2	5.0%
0:2	0.1 / 3	3.3%
0:3	0.1 / 4	2.5%

A single view with a "Like" gives the page a boost, but not to 100%. This can help it gain traction, but probably won't put it in the top rankings (yet). If subsequent views are also liked, the score continues to rise:

Likes:Views	Calculation	Percentage
1:1	1.1 / 2	55.0%
2:2	2.1 / 3	70.0%
3:3	3.1 / 4	77.5%
4:4	4.1 / 5	82.0%

Away from the extremes (0% or 100% liked), the effect of pseudocounts is less dramatic, and a mix of "Like" and "Don't Care" (viewed without liking) results in a score closer to what you'd see without pseudocounts — just shifted slightly toward the site-wide average. Notice that a page with two "Likes" out of three views (2:3) is scored almost as well as one "Like" and one view (1:1 above).

Likes:Views	Calculation	Percentage
1:2	1.1 / 3	36.7%
1:3	1.1 / 4	27.5%
2:3	2.1 / 4	52.5%
2:4	2.1 / 5	42.0%

To increase the effect of pseudocounts, you can put a higher weight on the prior by multiplying the pseudocounts by some constant. If the weighting factor is w, then the calculation is:

score = (page_likes + (b * w)) / (page_views + w)

Think of this as the number of "imaginary" users you have rating each page before any real users see it. The calculations above use a weight of 1, equivalent to one user giving a fractional score of .1 to every page before it goes live, and you can see the effect of it. Play with it a bit to see how it affects your rankings.

Update: Evan Miller has a great writeup on a Bayesian approach to modelling this problem, if you'd like to go much further down the rabbit hole.

4. Statistical significance

You now have a score for each page in your database and a top-to-bottom ranking. Where do you draw the line for "recommending" a page?

Quantiles: Best and worst

Having sorted all the pages by score, take the top 5% as the "best" and bottom 5% as the "worst". Or choose a fixed number, like 25. It's really up to you.

The "best" ranking is for users, especially new visitors. Depending on your application, some users might also be interested in the "worst" pages — how else would we find gems like "Friday"?

Contingency: Is the score meaningful?

Another challenge is to determine when a page's score is statistically meaningful — i.e. the difference between a score of 55% based on 1000 views versus a single view. Using pseudocounts addresses this to some extent at the extremes, but it's still possible for pages with low view counts to score highly. You may also want to purge "junk" content with horribly low rankings — but only once it's been given a fair chance.

With the like and dontcare counts, site-wide and per-page, set up a 2x2 contingency table:

	like	dontcare
Page	A	B
Global	C	D

To evaluate the significance, use a Chi-square test with one degree of freedom (df=1), or if you're picky, Fisher's exact test.

The chi-square test, in R:

> abcd = matrix(c(4, 10, 1000, 10000), nrow=2, byrow=T)
> chisq.test(abcd)

        Pearson's Chi-squared test with Yates' continuity correction

data:  abcd
X-squared = 4.2691, df = 1, p-value = 0.03881

With the common p-value cutoff ("alpha") of 0.05, we'd say this page with 4 likes out of 10 views is significant — for that cutoff, at least. And if we applied the same test across all pages in the database, we'd be wrong.

I'll try to be quick about this, because it matters.

Remember: A p-value of 0.05 means the given like/view ratio will occur by chance 1 in 20 times. Since the same test is being applied to every page in your database, you need to account for multiple hypothesis testing, or else many pages will meet the cutoff by chance.

If you only have a few pages — say, less than 40 — then you can divide alpha by the number of pages (e.g. 20) and use that in place of the original cutoff (0.05 / 40 = 0.00125, so the previous p-value of 0.03881 would not be significant).

More likely, you have many more pages than that — hence the need to use grown-up statistics in the first place. Bonferroni correction (described above) would produce a cutoff that's much too stringent, so you'll need a more powerful method.

R makes this easy. Starting with a single array of the p-values from Chi-square tests of each page:

> pvals = sapply(chisq.test, contingencytables)

Adjust these raw p-values for multiple testing (using the familywise error rate, by default — read the help page for p.adjust for all the details):

> pvals.adj = p.adjust(pvals)

What do you after this depends on your own code. You can get a boolean array signifying which adjusted p-values are now smaller than alpha, which is useful for selecting "significant" pages from the original page list:

> significant = pvals.adj < 0.05

Note that this selects both significantly liked and significantly disliked pages at the same time. To distinguish between the two, just compare each page's like/view ratio to the global average and select higher or lower.

Another note about the contingency table: Once your application has counted a very large number of site-wide likes and views (cells C and D), this test will register significance for almost any page. You might have better results by replacing the global view and "Like" counts with a per-month or per-user average. And, you can cache these values and update them only occasionally.

Calculating p-values is a lot more work than selecting the top and bottom quantiles. If you've put in the extra effort, here's another feature you can support: a list of newly significant winners and losers.

Each day (or hour or so), perform the chi-square test (described above) across all pages and note which ones cross the significance threshold. Compare this to the previous run's results to see which pages have crossed over, and add these newly significant hits to a separate chart — "Trending", I'll call it.

This chart shows the pages that have just recently been determined to be likeable, but (probably) haven't accumulated enough votes to reach the "Top Pages" chart. It's a timelier list than "Top Pages", though the average quality of the "Trending" pages is not as high. This is the place where memes show up first. If they're truly good content, they'll eventually make it onto "Top Pages" — but that's not usually the case with memes.

I'd treat the "Trending" chart as a queue, adding newly trending pages to the top at the end of each run and dropping pages from the bottom as space permits. Or just keep it rolling by week, like a blog. By adjusting alpha you can tune the number of newly significant pages found in each run, and therefore the turnover rate of your "Trending" queue.

5. But is it general?

Under the Google Whatever model (YouTube, Picasa, etc.), the ratio of Likes to total views for any given page is small. The statistics here will work in other cases, though — for example, an "Approve" button which clicked most of the time, or a "Dislike" button in place of the "Like" button. In the case where users have to click either "Like" or "Dislike" (Yes/No, Yay/Nay, or any other two options), this is also fine; just pick one option to count, and count "views" as the sum of likes and dislikes.

Update: The "Like" event can also be something implicit, like downloads or completed interactions (out of started interactions). This opens up options for sites that don't require users to register.

What about sites with 5-star ratings, like Amazon? Well, there's an easy way and a hard way. Easy: count the ratings as fractional Likes ([0, .25, .5, .75, 1] if you allow 0 stars, [0, .33, .67, 1] if you don't), and use the pseudocounts just like before. The hard way is to treat each star ranking as a separate event category — but that's going to have to wait for a later post.

Bio.Phylo: A unified phylogenetics toolkit for Biopython

2010-10-05T11:23:00.000-04:00

I presented this at the Bioinformatics Open Source Conference (BOSC 2010) in early July, but somehow forgot to post it here too. It's an overview of my somewhat new sub-package for working with phylogenetic trees in Biopython, based on my Google Summer of Code 2009 project (a phyloXML parser in Biopython).

In a nutshell, Bio.Phylo is a library for manipulating finished phylogenetic trees and integrating them into a Biopython-based workflow. It can handle the standard file formats — Newick, Nexus and phyloXML, with the current exception of NeXML — and has particularly good support for phyloXML.

This presentation walks through an example of loading a Newick tree, viewing it a few different ways, adding branch colors, and saving it as a phyloXML file.

Bio.Phylo: Phylogenetics in Biopython (BOSC 2010)

View more presentations from Eric Talevich.

The conference abstract is here. I also recommend the main documentation in the Biopython Tutorial (see chapter 12) and the wiki page.

Google Summer of Code 2010: The final draft

2010-04-08T17:12:00.003-04:00

The Google Summer of Code 2010 application period is in its final 24 hours.

I volunteered to mentor with two organizations this year, OBF and NESCent. Last month I posted a couple of ideas with each org:

The applications that have come in have been pretty good; the only thing I can complain about is that nobody has followed through with my MIAPA project -- we got a nibble from one student, but nothing after that.

Since we're doing the last round of application reviews now before the deadline, here's some general guidance on what mentors are looking for in a student application.

First, a couple of outside references:

The Zen of GSoC

Google Summer of Code is a program to recruit and foster new long-term open-source contributors.

Broadly, the mentoring organizations are asking three questions:

Are you motivated enough about this work to continue contributing after the summer?
Can you write useful code on your own?
Do you interact well with the community, so that we can work with you to merge your work cleanly into the trunk and rely on you to maintain the codebase?

You can get a sense of what Google and the mentoring orgs are looking for from the applications the orgs themselves submit to Google. For example: NESCent's 2010 app.

Here are some specific tips for demonstrating that you have some committer in you.

Put your previous work online

It's remarkable how many ostensible programmers just can't write decent code. They'll have a list of successful past projects they worked on, maybe a legitimate degree in computer science, but their code itself was clearly never fully understood by anyone, original programmer included. (Remember, programming languages exist for humans to understand -- the computer itself runs on machine code.) The only way we can be sure you can write code we can use is if we can look at something you've written previously.

Biopython uses GitHub for development, so putting a project of your own on GitHub demonstrates two useful things: you can write functioning code, and you're already up to speed with the build tools that Biopython uses.

If the most relevant code you've written is tied up in some way -- say, it's part of a research project still being prepared for publication -- see if you can use at least a few snippets of it. So far, it seems most professors have been willing to allow that.

Subscribe to your mentoring organization's mailing list

I know, e-mail mailing lists seem at least a decade behind the times. But open-source projects like to have a permanent public record of the discussions that happen, and everyone has an e-mail account. We also have IRC channels and Twitter tags (#phylosoc and #obfsoc), but project proposals are generally more than 140 characters so it's best to use e-mail at some point.

Plus, you'll be able to read all the advice the other students are getting -- mentors get fatigued as the application season wears on, and once we've written the same thing a few times we start skipping details.

Write a weekly project schedule

The GSoC application has fields for pointing to external info. Create a Google document or spreadsheet (or README.md on GitHub if you're fancy) detailing your project plan week-by-week.

Suggested fields:

Date, or week number for referencing later
GSoC events and guidelines (see the official timeline)
Deliverables for the week — what's produced, e.g. documentation sections, unit tests, classes, modules
Approach for each of these tasks, in a few words
Potential problems that could occur, specific to the tasks — perhaps a dependency turns out to be inadequate, or an integration step is required
Proposed mitigation for each of the foreseen issues

(If you want to estimate the number of hours or days each task will take, that's cool too.)

Here are the examples from previous GSoC projects that we've been sharing on the mailing lists:

Respect the deadlines

Submit a draft of your application to Google at least a day before the deadline, April 9. There are thousands of applicants each year, and Google has no reason to let the deadline slide — an important function of the application process itself is to screen out students who won't deliver by the stated deadlines. In effect, if your application isn't submitted to Google by noon PST on April 9, then you didn't apply.

BUT: If you submit something even partially complete, we can contact you later during the review stage and get the remaining information from you. And if you included a link to your weekly plan (as a separate online document), you can edit that after the deadline too.

Best of luck!

Python workshop #2: Biopython

2010-02-24T12:29:00.003-05:00

As promised, here are the slides from Monday's Biopython programming workshop:

Biopython programming workshop at UGA

View more presentations from Eric Talevich.

This was another 2-hour session, with a short snack break in the middle this time -- which was also a nice opportunity to ask everyone about the pacing, and see if who's been following along with the examples in IPython (versus staring at a BSOD or lolcats -- which I didn't notice any of).

This went well:

Pacing
Using IPython to inspect objects and display documentation -- this lets some people "read ahead" and perhaps answer their own minor questions, leading to other, better questions
The general introductory pattern of:
1. Demonstrate how to import a module and instantiate the basic class
2. Review, in English, the core features of the module and why they exist
3. Walk through a short script that uses real data to accomplish some simple but useful task(s)
4. Display the result, completing the mental pipeline of input -> transformation -> output

Room for improvement:

I didn't always execute the final draft of each example, so there were a couple typos -- inconvenient for those following along in Python. (I've fixed them in the slides here.)
Consequently, I didn't have an output file to show at the end of each example -- so I had to describe or draft one on the spot.
The PDB module was the coolest part of the workshop, and I rushed it a bit. I was afraid the visitors from Genetics and Plant Bio would be bored with it, but I don't think they were, and the Bioinformatics folks were left wanting more.

I'm planning to host both Python workshops again in the next academic year, either 1 per semester (as it was this year) or both each semester, maybe 2 weeks apart. The Biopython workshop in particular will be different next time because Bio.Phylo will finally be included with the main Biopython distribution -- evolution is cool, and more of the pretty is always a good thing to have in a programming workshop.

Python workshop #1, now on SlideShare

2010-02-19T15:52:00.004-05:00

Last November I hosted a workshop on basic Python programming at UGA. The attendees were mostly from the bioinformatics department, but this workshop didn't go into science at all -- just practical Python usage. Today I finally got around to cleaning up the slides and uploading them to SlideShare:

Python workshop #1 at UGA

View more presentations from etalevich.

It looks like LaTeX Beamer and SlideShare's PDF/Flash converter don't play well together. Meh, it's still easy enough to read.

I'm working on a Biopython-specific followup right now for a workshop on Monday, 2/22. I'll post that here when it's done, too, with reasonable haste.

Faster string concatenation in Python

2009-07-20T20:52:00.001-04:00

Nick Matzke pointed me to this discussion of string concatenation approaches in Python:

Efficient String Concatenation in Python

The issue here is whether adding strings together in a for loop is inefficient enough to be worth working around. Python strings are immutable, so this:

s = 'om'
for i in xrange(1000):
    s += 'nom'

means doing this 1000 times:

Translate the assignment to "s = s + 'nom'"

Allocate another string, 'nom'. (Or reuse a reference if it's already interred.)

Call the __add__ method on s, with 'nom' as the argument

Allocate the new string created by __add__

Assign a reference to that string back to s

So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.

There are a few ways to cope:

Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).

Let the string object's join method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.

Build a single format string and interpolate with the % operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the join method. This fits real-world scenarios better — filling in a template of a plain-text table or paragraph with computed values, either all at once with % or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.

The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.

Methods

I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code here.)

First, a changelog:

The method numbers are different, and there are a couple more. Method #2 is for the % operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses map instead of a list comprehension or generator expression; no lambda is required and the necessary function (str()) is already available, so this is a good candidate.

I used the standard lib's time.clock() to measure CPU time around just the relevant loop for each string concatenation method.

Fetching the process memory size is similar but uses the subprocess module and different options.

Docstrings are (ab)used to identify the output.

For example, the string addition method now looks like this:

def method1():
    """1. string addition"""
    start = clock()
    out_str = ''
    for num in NUMS:
        out_str += str(num)
    cpu = clock() - start
    return (out_str, cpu, memsize())

Results

Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.
Python 2.6:

1. string addition   CPU (s): 1.99   Mem (K): 11.7
2. %-interpolation   CPU (s): 2.42   Mem (K): 23.0
3. array object      CPU (s): 3.42   Mem (K): 17.3
4. cStringIO object  CPU (s): 3.24   Mem (K): 19.7
5. join + for loop   CPU (s): 2.29   Mem (K): 48.0
6. join + list comp  CPU (s): 1.93   Mem (K): 11.6
7. join + gen expr   CPU (s): 2.08   Mem (K): 11.6
8. join + str map    CPU (s): 1.47   Mem (K): 11.6

The winner is map, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected — at least in Python 2.6.

The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:

Python Patterns — An Optimization Anecdote

At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not — we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.

The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, str.join() accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to join. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.

Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe join isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.

Finally, in method 8, map allows the interpreter to look up the str constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?

Python 2.5:

1. string addition   CPU (s): 3.77   Mem (K): 10.8
2. %-interpolation   CPU (s): 2.43   Mem (K): 22.0
3. array object      CPU (s): 5.16   Mem (K): 16.4
4. cStringIO object  CPU (s): 4.93   Mem (K): 18.7
5. join + for loop   CPU (s): 3.98   Mem (K): 47.1
6. join + list comp  CPU (s): 3.30   Mem (K): 10.5
7. join + gen expr   CPU (s): 3.59   Mem (K): 10.5
8. join + str map    CPU (s): 2.72   Mem (K): 10.5

Python 2.6.2 has had the benefit of additional development time, and in notably the Unladen Swallow project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.

Conclusion

The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.

Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones

Using string interpolation or templates plays well enough with more complex formatting

Going too low-level can deprive you of Python's own optimizations

If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise

There's more discussion on Stack Overflow.

Mnemosyne: Getting Things Memorized

2009-03-08T18:50:00.000-04:00

It had been bothering me since I joined this lab that I couldn't confidently just read a protein sequence and understand what it meant — naming the residues, picturing the side chain structures, and understanding the significance of replacing one residue with another. I expected that I'd just pick it up naturally from working with sequences and structures, and that did happen somewhat. But I wanted it to be as easy as reading English, and that level of completeness doesn't happen without some rote memorization.

That brought to mind a Wired article about Piotr Wozniak and his spaced-repetition memorization program, SuperMemo. When I originally read the article I wasn't in grad school and didn't have an urge to memorize any particular list of things. Anyway, SuperMemo appeared to be Windows-only software, and an algorithm like this would be more fun to code from scratch anyway. Enough fun, really, that there had to be one or two open-source implementations floating around.

Mnemosyne popped up as the closest match in an Ubuntu package search, so I'm running with that. Putting the flash cards together was pretty simple; I was able to do it in a few minutes from inside the program and export it in the standard XML format. I zipped it up with a quick plain-text README and uploaded it to the project home page as the Amino Acids card set.

The content came from a slide in a lecture, and I did a quick sanity check on Wikipedia before uploading. The notation for the 20 standard amino acids is complete, and that was the main goal of this. The assignment of amino acid "groups" seems to be a little arbitrary, depending on the source (by structure, functional groups, chemical properties, etc.), and I tried to make the categories complete without too much overlap -- there's a small deviation from my slide here. I also added another category for "side chain properties", pH and polarity. Another enhancement might be the standard codons for each amino acid, though I'm not sure I want to deal with that yet.

Carrots and sticks

2009-02-07T14:36:00.000-05:00

What's old is new again:

Professor makes his mark, but it costs him his job

In Zen and the Art of Motorcycle Maintenance, Robert Pirsig mentions his own experiment in withholding grades at a university. He didn’t just announce on the first day that everyone would get an A+ — that seems gimmicky. Instead, since it was a class on rhetoric, he spent the course developing an argument for eliminating the grades-and-degrees system and discussing it with his students.

Initially, most students were unenthusiastic or opposed — grades and degrees are what they came for. Nonetheless, Prof. Pirsig assigned, collected and graded papers, but returned them to students with only the comments, not the grade.

At first:

A-students felt annoyed by the uncertainty of the situation, but did the work anyway;
B-C students blew off some assignments; and
C-D students usually skipped class.

He observed this and changed nothing. If students acted up, he let it slide.

Around 3–4 weeks into it:

A-students got nervous and pushed themselves harder, in class and in papers;
B-C students saw what the A-students were doing and returned to the usual level of effort; and
C-D students who had made a routine of skipping class would occasionally show up out of curiosity.

And finally:

A-students relaxed and began enjoying the class as active participants. In a final essay, still not knowing what their grades were, these students favored eliminating grades by 2–1.
B-C students saw this, panicked, and began putting an unusual amount of effort into their work. Eventually, they joined the A students in engaging class discussions. Ultimately, these were evenly divided over the issue of eliminating grades.
C-D students — or those who attended — also saw this and began trying to hand in reasonable work. Those who couldn’t hack it freaked even more, and remained in a state of Kafkaesque terror until the quarter mercifully ended. Naturally, in the final essay these students were unanimously opposed to eliminating grades.

Interesting as this result was, Pirsig reverted to the regular grading system the next quarter because he couldn’t provide any alternate goal for students — those who can recognize quality in their own work don’t need the university; those who can’t need something to work toward, or they don’t progress.

Psychology research with Mechanical Turk

2008-10-25T12:29:00.000-04:00

Elevator pitch: There's a missing gear in the machine of psychology research. Every significant human study requires weeks or months of data collection, and more time coding that data in a form that can be analyzed statistically. This makes it infeasible to do the sort of fast, iterative refinement of models that biology has seen in recent years.

Amazon's Mechanical Turk provides the missing piece. It provides an accessible interface for building a survey, interactive test, or other psychological measure, pushing it out to thousands of participants, quickly returning the results to the researcher in electronic form, and screening out unusable data. It's flexible enough to allow screening and debriefing, and gives access to a vastly larger pool of participants than Experimetrix. And it's cheap.

Background

First, take a look at this: Ten Thousand Cents.

When I first heard about bioinformatics I was under the impression that it was the exponentially increasing power of computers that made it irresistible to start using them for biological research. But actually, it was pretty much the reverse -- high-throughput experimental methods like gene sequencing, mass spectrometry and X-ray crystallography generated too much data for humans to process manually. Computers were only barely able to handle this workload in 1986, when the human genome project started -- scientists just did what was needed to move around the mountains of data coming out of their experiments. Similarly, new computational research is coming out of the Large Hadron Collider project now.

Psychology researchers (especially in social psychology) currently spend semesters at a time gathering data for their studies and converting it into data that can be quantitatively analyzed. High-throughput experimental methods are scarce and expensive, so there's no "data glut" driving the development of better information-management methods. Progress in the field is slow and lossy -- since there's not much demand for the raw data, conclusions are described qualitatively, which makes it hard to use prior results as a solid foundation for future work.

With Mechanical Turk, it's possible to do in one shot a study that would otherwise require a meta-analysis of several studies across particular locations or demographics. With more consistent data and larger populations, data can be reusable.

How it could work

If it fits behind a web interface, or can be described and completed with plain language or pictures, it can be done with Mechanical Turk. Necessarily, a form of consent can precede the main task, and a blurb of debriefing can finish.

To get a feel for how it's done, read this article: The Faces of Mechanical Turk

Naturally, the first study done this way should be something to determine how the population of Turkers corresponds to the general population and the student populations that have already been characterized in previous studies. A public-domain measure of the Big Five or something like the Narcissistic Personality Inventory would be good candidates. Then, let slip the hounds of statistics. Are Turkers as representative of the general population as psychology undergrads? More so?

Some research along these lines has already been blogged here: Mechanical Turk Demographics

Now, let's try some examples.

Surveys: You craft a survey, Turkers take it, and you retrieve and filter the results through the Mechanical Turk interface. Pretty straightforward, no?

Interactive tasks: This is what Mechanical Turk is designed for; only, the focus was expected to be the task, not the Turker. Anyway, the data's yours. An example of a task like this would be a simple, unbounded game (Flash or JavaScript) that the participant can quit any time (possibly paired with another stimulus). The returned data would be the play duration alongside any personal or demographic information requested.

Coding visual or audio data: Following the original intent of Mechanical Turk more closely, this application of the service distributes a repetitive task normally performed over several weeks or months by the researcher or a group of grad students. Rather collect new data about a participant, this simply boils down a vast quantity of data that's already been generated -- this is a problem we want to have. A two-step example: (1) run a Mechanical Turk task in which participants draw or assemble an arbitrary image; (2) run a second task with a different set of participants who look at these images and code (type or select) the relevant traits they see in the images.

Measure development: One of the more uncomfortable questions in social psychology research is the validity of personality measures. Devise a series of questions and a method for tabulating the results; run it on some participants; analyze the results to get some answers. But, what's really being tested here -- the population, or the measure? Tragically, there's no time to refine the measure very much; if the results are useful, you run with it. But! With Mechanical Turk, collecting survey results is cheap and quick; and since the general format of the survey isn't changing between revisions, the same set of statistical transformations can be applied programatically to each iteration of the survey.

This is a great way to build a psychological measure that you can be confident in: Push an initial draft of the measure out to Turkers, receive some results, perform a statistical analysis and save the operations as an R or SPSS script. Then, manually refine the measure, put it back on Turk, filter the new results through your analysis script, and repeat until it looks good. This can get as advanced as you'd like -- start with several times as many questions as you'd like to see in the final survey, then automatically dispatch random subsets of the question list to Turk, filter through your automatic analysis to get some scores indicating quality, and use a Bayesian classifier to narrow down the best possible subset of questions.

Update: Here's a conference paper on the same topic.
http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf

Vimming your way to the top

2008-03-07T15:48:00.002-05:00

Here's the Vim syntax file I use for highlighting my to-do list. It's based on the syntax file for YAML.
http://www.vim.org/scripts/script.php?script_id=2599

Benefits:

Different colors for lines ending in ':', or starting with '*' or '{'
Assign keywords to be automatically highlighted, like important locations, coworkers' names, customers, taquerias, etc.
Start sections with a line of underscores and a heading beginning with the '{' character. The heading stands out (red with GVim's "desert" color scheme), and you can jump between sections just like C blocks using ]] and [[ keystrokes.
Ordinary text (i.e. not specifically formatted for this syntax) looks sane.

Normally I have a line in my .vimrc assigning the filetype "todolist" to the file where I keep my permanent todolist, but another way to add this highlighting to a text file is to add vim: ft=todolist to the end of a file. It's harmless.

Update (4/2/09): I uploaded the script to vim.org, where it will be easier to track and update.

Update (1/1/10): Here's an example of how to use this color scheme for course notes.

60 underscores (my preference) and a curly brace indicate a new section
Subsection lines end with a colon (generally followed by bullet points)
Special or out-of-context notes start with an asterisk
For separation, or to display a different sort of sub-heading, play with asterisks: '* * *' centered, or '** OLD **' for example

At school, I run a shell script for each new class that creates a new directory from the course name, copies a skeleton of this example text to a file called lecture-notes.txt, etc., and adds the directory to Mercurial -- so while there's some boilerplate involved with this plugin, it's easy to automate and plays well with Vim's text-munging capabilities.

I've also picked up the habit of putting @contexts above unsorted items at the top of my main to-do list, inspired by the GTD approach. The syntax plugin doesn't take advantage of this yet; I'll post another update when that's done.