Monday, January 16, 2012

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:
http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.

Make

The make utility is normally used to determine which portions of a program need to be recompiled, and the dependencies between them; it includes special features for managing C code, but actions are specified in terms of shell commands. You can just as well use it to specify dependencies between arbitrary tasks.

Giovanni Dall'Olio has posted some helpful instructional material: http://bioinfoblog.it/?p=29

(One of his links is broken -- here's Software Carpentry's current course on make.)

Rake, a Really Awesome Make

If your pipeline is already organized as a pipeline of stand-alone scripts, Rake is a more-or-less ideal solution: http://rake.rubyforge.org/

 At some point in your project, you'll likely need to do some string processing; this is where make falls down. Ruby happens to be great for this task. Don't worry if you don't know Ruby -- the basic string methods in Ruby, Python and Perl are very similar, and regular expressions work roughly the same way. You can look up everything you need to know on the fly. (I did.)

Also, Rake's the mini-language for specifying tasks is both concise and intuitive. There's a built-in facility for maintaining short descriptions of each step, which is enticing from a "lab notebook" perspective.

How about a Python solution?

Inspired by Rake, I wrote a small module called tasks.py. Here's the code, plus a simple script to demonstrate:
https://gist.github.com/1623117

This isn't just for the sake of Python evangelism; I have a bunch of special-purpose modules written in Python (for my lab) that I would like to use in an elaborate pipeline of my own. It's silly to put each of these steps in a separate, throwaway Python script that then gets called in a separate process by Rake. Instead, I can import tasks.py into a Python script and include the dependency-scheduling features in my own programs.

Key differences from other build systems (e.g. SCons, waf, ruffus):
  • This module is not meant to be run from the command-line -- only for specific parts of your own code
  • The implementation of each processing step is separate from the dependency specification (although single-command tasks can still be defined inline with a lambda expression). Separate the algorithm from the data, I always say.
  • Cleanup is specified within the task, not in a separate area of the Rakefile/Makefile. This makes more sense for a project with heterogenous processing steps & intermediate files.
To be clear, tasks.py is not better than Rake or make for arranging a set of scripts that you've already written. I didn't even implement concurrent jobs (because most of the CPU-intensive steps I use are calls to programs that are already multicore-aware; though try adding that feature yourself if you'd like).

Wednesday, November 2, 2011

Journal article: Comparative kinomics of the malaria pathogen and its relatives

Hot off the presses!
Structural and evolutionary divergence of eukaryotic protein kinases in Apicomplexa

It's a thorough paper, so I'll cover the highlights here.

Why we study apicomplexans

Apicomplexans are a group of related single-celled organisms which are exclusively parasitic. The best-known member is Plasmodium falciparum, which causes the most virulent form of malaria. Another well-studied species is Toxoplasma gondii, which primarily lives in cats but can infect most mammals.

It's a hugely diverse group. But overall, we know very little about them.

Our main motivation for studying apicomplexan proteins is to find what features make them distinct from human proteins, so we can then design drugs to target those features specifically -- the drug will identify and disable the parasite protein without the risk of affecting the host proteins, too. We study protein kinases, in particular, because a number of drugs have already been designed to inhibit kinases in cancer. The same or similar compounds could be used to treat parasitic diseases, potentially.

From an evolutionary biologist's perspective, apicomplexans are also interesting to study because they belong to an evolutionary branch that is quite divergent from the animals, plants and fungi more familiar to us. By learning about apicomplexan biology, and comparing to other model organisms, we can learn more about eukaryotic diversity and the origin of eukaryotes.

Another perspective on the tree of life

Many people, including scientists, think of evolution as a ladder, with single-celled organisms at the bottom and humans at the top. Different lineages, like green plants and fungi, each branch off the ladder at some intermediate point, but evolution is nonetheless mistakenly thought of as a directed progression from bacteria to protists to fish to humans.

That's wrong. It leads to mistakes, such as considering all protists (single-celled eukaryotes) to be closely related to each other. But even within Apicomplexa, the evolutionary distance between Plasmodium falciparum and Toxoplasma gondii is as great as the distance between humans and sponges.

I'm particularly proud of Figure 1 in the paper, which includes a species tree that inverts the traditional view: The closest human relative, yeast, is at the bottom, and layers of increasingly strange and unfamiliar protists build up to the Plasmodium genus.

Interesting features of proteins and genomes

When apicomplexan parasites invade a host, they secrete a mixture of dozens of different proteins into a protective vacuole formed from the host cell membrane. We'd expect that some of these proteins are essential for invasion and virulence, and therefore good targets for inhibition or diagnosis.

Two apicomplexan-specific protein kinase families are known to be exported. The FIKK family appears in 1 copy in most apicomplexans, but is amplified to 21 copies in P. falciparum and 6 copies in P. reichenowi, and does not appear in any species outside the Apicomplexa. Another family, called rhoptry kinases (ROPK) after the apicomplexan organelle they're localized to, appears in dozens of copies in coccidians (T. gondii, Neospora caninum, Eimeria tenella, Sarcocystis neurona), but not in any other lineage of Apicomplexa. Plasmodium and others still contain rhoptries, but there are no kinases in the protein cocktail those rhoptries contain.

As obligate parasites, apicomplexans evolve under different evolutionary contraints than free-living organisms like yeast and humans. Many genes are no longer necessary, and some may even be a liability if they interact with the host's own biochemical pathways. Because of this, we see widespread gene loss and overall compaction of apicomplexan genomes.

One especially curious case is the loss of upstream regulators of the MAPK cascade -- a signaling pathway found in almost all eukaryotes, consisting of 3 or 4 protein kinases each activating the next in a sort of biochemical relay. Apicomplexans contain 2 to 3 copies of the downstream protein kinase, MAPK, but the rest of the pathway components (STE7, STE11, STE20) are generally lost, and none of the surveyed apicomplexans had a complete MAPK cascade. So there's an open question: What other proteins take the place of the STEs in this important pathway, or have MAPK-like features? Is there an Achilles heel to be discovered?

The project

We:
  1. Identified and classified the full set of protein kinases in each of the 17 apicomplexan proteomes available
  2. Devised a pipeline to identify apicomplexan-specific ortholog groups in known protein kinase families
  3. Compared these ortholog groups to the typical members of the kinase family to find specific sequence motifs that distinguish the divergent ortholog group
  4. Mapped these motifs onto protein structures; reviewed the literature to understand possible functions and functional differences related to these motifs
Read about what we found here.

Friday, October 28, 2011

Journal article: Our insights into the structure and activation mechanism of ErbB/EGFR protein kinases

Here's an article my lab published in PLoS One:
Co-Conserved Features Associated with cis Regulation of ErbB Tyrosine Kinases

I'll give a quick summary of it here. (Don't worry, this isn't a new direction for this blog.)

This is a study of the structural mechanisms of a certain protein family, called ErbB or EGFR (epidermal growth factor receptor), which is frequently involved in cancer. This family belongs to a protein superfamily called protein kinases.

Biochemistry background

Kinases are enzymes which perform a type of post-translational modification, phosphorylation: The kinase transfers a phosphate group from adenosine triphosphate (ATP) to another substrate molecule, leaving adenosine diphosphate (ADP) and the phosphorylated substrate.

Protein kinases are kinases that act on protein substrates, i.e. the phosphorylated molecule is another protein. The substrate could even be another protein kinase, so activation of the first protein kinase causes it to phosphorylate and activate another protein kinase, and so on. This is a type of signal transduction.

Signal transduction is how the cell senses and reacts to its environment, and also its own internal conditions. In the case of ErbB and other receptor tyrosine kinases, the signal starts at the surface of the cell (e.g. epidermal growth factor binds to the extracellular portion of EGFR) and activates the kinase, which then begins sending these phosphorylation signals. These signals are then relayed throughout the cell to trigger other activities, such as cell division or the transcription of certain genes.

What happens if a protein kinase gets "locked" into the active state, somehow? In the case of EGFR, it's as if the cell thinks it's constantly receiving the growth factor. If this signal isn't blocked by another "gatekeeper" in the cell, then the cell will grow uncontrollably -- and become cancer.

How the enzyme works

Protein kinases (PKs) all consist of two large lobes connected by a flexible hinge. Between the lobes is a binding pocket for ATP; this molecule binds inside the smaller lobe (N-terminal lobe, or N-lobe). The larger lobe (C-terminal lobe or C-lobe) provides a binding site for another protein, which will be the kinase's substrate.

The general mechanism of all protein kinases goes like this:
  1. The kinase is initially in an inactive state, with the hinge "open" and the two lobes a bit further apart. Since ATP binds in the N-lobe and the substrate binds to the C-lobe, no phosphate is transferred when the two lobes are apart like this.
  2. By some mechanism (it varies between different kinase families), the two lobes are brought closer together, and the kinase becomes active.
  3. ATP binds to the ATP binding pocket, a substrate binds to the C-lobe, some amino acids shift, and a phosphate group is detached from ATP and reattached to a specific amino acid on the substrate.
  4. The ADP and phosphorylated substrate are released.
Step 2 is the part we're interested in. How do some recurring, cancer-associated mutations cause EGFR to become "locked" in the active conformation? And, can we reverse it?

How we think ErbB kinases work

In the ErbB family, it's not just the two lobes of the kinase domain that are involved in activating the enzyme -- the adjacent sections of the protein, outside the kinase domain, are also involved.

The long C-terminal tail wraps back around the entire kinase domain and associates with the N-lobe, tethered in place by a few residues in the N-lobe and the other N-terminal flanking region (the juxtamembrane segment, between the kinase domain and the cell membrane). The C-tail is placed so that it can influence the movement and relative positioning of the N- and C-lobes, and therefore regulate the activation of the kinase.

We also examined the locations of two EGFR mutations (S768I and L861Q) that have been previously identified as occurring frequently in cancers, mapping them onto the structure. These mutations appear in locations that would disrupt the switching mechanism we proposed -- breaking necessary interactions, or forming new interactions that shouldn't be there for proper EGFR function.

If you'd like to know more, read about it here.

Thursday, July 7, 2011

The statistics of the "Like" button

The launch of Google+ reminded me of a question I've had about Facebook and YouTube for a while: What happens when you click the "Like" button?

Facebook isn't so much about sharing content as sharing content. But YouTube and many other sites like it recommend content based on users' response to the content itself, rather than the shape of the surrounding social network. If you're building an application like this yourself, this article is for you.

Think of a collection of user-generated pages with a "Like" or "+1" button on each. Users can browse pages at will, or arrive at them randomly from an external site, and after viewing a page will either click the "Like" button or do nothing. I'll refer to the number of times viewers do nothing on a page as "Don't care". I'll also assume you have a site-wide "Top Pages" chart that users can view to see the highest-scoring pages and jump to them.

1. Counting "Likes"

The very simplest way to score a page is to count the number of times it's been "Liked":
score = page_likes
The main problem with this method is inertia. Old "champion" pages accumulate votes over time, and dominate the rankings. New pages don't have a chance to unseat the champs, even temporarily, to gain visibility for themselves. The site appears stagnant.

2. "Likes" versus views

To make good recommendations, you want to measure the quality of a page — the chance that a user will like the recommended page themselves:
score = page_likes / page_views
In the long run, this is correct — you estimate the probability based on the frequency of "Likes" in previous views. Old "champion" pages will be unseated in the rankings if a newer page earns a better proportion of likes.

But for new or little-viewed pages, there's an issue of sample size.
  • A page where the first view is "Liked" (probably by the creator/uploader) scores 100% and shoots to the top of the rankings. If a few friends all immediately "Like" the page, it becomes difficult to unseat. There's a lot of noise at the top of the rankings.
  • A page where the first view is not "Liked" scores 0% and sinks to the bottom of the rankings. If you have some mechanism for purging bad content from your site (i.e. deleting low-scoring pages that are likely spam, trolls or just lame), then this makes that task more difficult.
Intuitively, a page with 80 likes out of 100 views is more likely to be good than a page with 4 likes out of 5 views. A page with zero likes out of 100 views is almost certainly junk, but 5 views without any likes may not mean much at all.

So, your next goal: Make the best possible estimate of a page's likeability based on the first few views and likes, using some prior knowledge. After that, all reasonable methods should converge on the same score (probability of liking). If a meme catches, people will be able to find that page through other means, and your own rankings will be less crucial to its success.

3. Pseudocounts

A pseudocount is a prior estimate of the probability of an event. To make an estimate of actual probabilities based on a small number of samples (the problem at the end of Step 2), add the pseudocounts to the actual counts of each event.

I'll demonstrate.

The events here are (a) "Like" and (b) "Don't care". I'm going to use b to represent the pseudocount for "Like". For this section, I choose the probabilities:
like = b = .1
dontcare = (1 - b) = .9
The two probabilities should sum to 1.

How do you get these values? Since b represents the probability that a random user will like an arbitrary page, taking the site-wide average of likes versus views is a good choice:
b = all_likes / all_views
To use the pseudocounts, add them to the counts in the formula in step 2:
score = (page_likes + b) / (page_views + 1)
(Recall: views = likes + dontcares; after adding pseudocounts, (likes + b) + (dontcares + 1 - b) = likes + dontcares + 1 = views + 1.)

If the database-wide sums of likes and views are large numbers, this won't significantly affect the "average" score, all_likes / all_views. But it smoothes out the initial scoring for new pages.

Example: Assume 10% of all views result in a "Like" (b = 0.1).

A single view without a "Like" places the page slightly below the global average, but not too much. (Odds are, 90% of pages will start out this way.) Additional views without a "Like" slowly sink the page score toward 0.

Likes:Views Calculation Percentage
0:1 0.1 / 2 5.0%
0:2 0.1 / 3 3.3%
0:3 0.1 / 4 2.5%

A single view with a "Like" gives the page a boost, but not to 100%. This can help it gain traction, but probably won't put it in the top rankings (yet). If subsequent views are also liked, the score continues to rise:

Likes:Views Calculation Percentage
1:1 1.1 / 2 55.0%
2:2 2.1 / 3 70.0%
3:3 3.1 / 4 77.5%
4:4 4.1 / 5 82.0%

Away from the extremes (0% or 100% liked), the effect of pseudocounts is less dramatic, and a mix of "Like" and "Don't Care" (viewed without liking) results in a score closer to what you'd see without pseudocounts — just shifted slightly toward the site-wide average. Notice that a page with two "Likes" out of three views (2:3) is scored almost as well as one "Like" and one view (1:1 above).

Likes:Views Calculation Percentage
1:2 1.1 / 3 36.7%
1:3 1.1 / 4 27.5%
2:3 2.1 / 4 52.5%
2:4 2.1 / 5 42.0%

To increase the effect of pseudocounts, you can put a higher weight on the prior by multiplying the pseudocounts by some constant. If the weighting factor is w, then the calculation is:
score = (page_likes + (b * w)) / (page_views + w)
Think of this as the number of "imaginary" users you have rating each page before any real users see it. The calculations above use a weight of 1, equivalent to one user giving a fractional score of .1 to every page before it goes live, and you can see the effect of it. Play with it a bit to see how it affects your rankings.

4. Statistical significance

You now have a score for each page in your database and a top-to-bottom ranking. Where do you draw the line for "recommending" a page?

Quantiles: Best and worst

Having sorted all the pages by score, take the top 5% as the "best" and bottom 5% as the "worst". Or choose a fixed number, like 25. It's really up to you.

The "best" ranking is for users, especially new visitors. Depending on your application, some users might also be interested in the "worst" pages — how else would we find gems like "Friday"?

Contingency: Is the score meaningful?

Another challenge is to determine when a page's score is statistically meaningful — i.e. the difference between a score of 55% based on 1000 views versus a single view. Using pseudocounts addresses this to some extent at the extremes, but it's still possible for pages with low view counts to score highly. You may also want to purge "junk" content with horribly low rankings — but only once it's been given a fair chance.

With the like and dontcare counts, site-wide and per-page, set up a 2x2 contingency table:

like dontcare
Page A B
Global C D

To evaluate the significance, use a Chi-square test with one degree of freedom (df=1), or if you're picky, Fisher's exact test.

The chi-square test, in R:
> abcd = matrix(c(4, 10, 1000, 10000), nrow=2, byrow=T)
> chisq.test(abcd)

        Pearson's Chi-squared test with Yates' continuity correction

data:  abcd
X-squared = 4.2691, df = 1, p-value = 0.03881
With the common p-value cutoff ("alpha") of 0.05, we'd say this page with 4 likes out of 10 views is significant — for that cutoff, at least. And if we applied the same test across all pages in the database, we'd be wrong.

I'll try to be quick about this, because it matters.

Remember: A p-value of 0.05 means the given like/view ratio will occur by chance 1 in 20 times. Since the same test is being applied to every page in your database, you need to account for multiple hypothesis testing, or else many pages will meet the cutoff by chance.

If you only have a few pages — say, less than 40 — then you can divide alpha by the number of pages (e.g. 20) and use that in place of the original cutoff (0.05 / 40 = 0.00125, so the previous p-value of 0.03881 would not be significant).

More likely, you have many more pages than that — hence the need to use grown-up statistics in the first place. Bonferroni correction (described above) would produce a cutoff that's much too stringent, so you'll need a more powerful method.

R makes this easy. Starting with a single array of the p-values from Chi-square tests of each page:
> pvals = sapply(chisq.test, contingencytables)
Adjust these raw p-values for multiple testing (using the familywise error rate, by default — read the help page for p.adjust for all the details):
> pvals.adj = p.adjust(pvals)
What do you after this depends on your own code. You can get a boolean array signifying which adjusted p-values are now smaller than alpha, which is useful for selecting "significant" pages from the original page list:
> significant = pvals.adj < 0.05
Note that this selects both significantly liked and significantly disliked pages at the same time. To distinguish between the two, just compare each page's like/view ratio to the global average and select higher or lower.

Another note about the contingency table: Once your application has counted a very large number of site-wide likes and views (cells C and D), this test will register significance for almost any page. You might have better results by replacing the global view and "Like" counts with a per-month or per-user average. And, you can cache these values and update them only occasionally.

Trending

Calculating p-values is a lot more work than selecting the top and bottom quantiles. If you've put in the extra effort, here's another feature you can support: a list of newly significant winners and losers.

Each day (or hour or so), perform the chi-square test (described above) across all pages and note which ones cross the significance threshold. Compare this to the previous run's results to see which pages have crossed over, and add these newly significant hits to a separate chart — "Trending", I'll call it.

This chart shows the pages that have just recently been determined to be likeable, but (probably) haven't accumulated enough votes to reach the "Top Pages" chart. It's a timelier list than "Top Pages", though the average quality of the "Trending" pages is not as high. This is the place where memes show up first. If they're truly good content, they'll eventually make it onto "Top Pages" — but that's not usually the case with memes.

I'd treat the "Trending" chart as a queue, adding newly trending pages to the top at the end of each run and dropping pages from the bottom as space permits. Or just keep it rolling by week, like a blog. By adjusting alpha you can tune the number of newly significant pages found in each run, and therefore the turnover rate of your "Trending" queue.

5. But is it general?

Under the Google Whatever model (YouTube, Picasa, etc.), the ratio of Likes to total views for any given page is small. The statistics here will work in other cases, though — for example, an "Approve" button which clicked most of the time, or a "Dislike" button in place of the "Like" button. In the case where users have to click either "Like" or "Dislike" (Yes/No, Yay/Nay, or any other two options), this is also fine; just pick one option to count, and count "views" as the sum of likes and dislikes.

Update: The "Like" event can also be something implicit, like downloads or completed streams (out of started streams). This opens up options for sites that don't require users to register.

What about sites with 5-star ratings, like Amazon? Well, there's an easy way and a hard way. Easy: count the ratings as fractional Likes ([0, .25, .5, .75, 1] if you allow 0 stars, [0, .33, .67, 1] if you don't), and use the pseudocounts just like before. The hard way is to treat each star ranking as a separate event category — but that's going to have to wait for a later post.

Tuesday, October 5, 2010

Bio.Phylo: A unified phylogenetics toolkit for Biopython

I presented this at the Bioinformatics Open Source Conference (BOSC 2010) in early July, but somehow forgot to post it here too. It's an overview of my somewhat new sub-package for working with phylogenetic trees in Biopython, based on my Google Summer of Code 2009 project (a phyloXML parser in Biopython).

In a nutshell, Bio.Phylo is a library for manipulating finished phylogenetic trees and integrating them into a Biopython-based workflow. It can handle the standard file formats — Newick, Nexus and phyloXML, with the current exception of NeXML — and has particularly good support for phyloXML.

This presentation walks through an example of loading a Newick tree, viewing it a few different ways, adding branch colors, and saving it as a phyloXML file.


The conference abstract is here. I also recommend the main documentation in the Biopython Tutorial (see chapter 12) and the wiki page.