Thursday, July 7, 2011

The statistics of the "Like" button

The launch of Google+ reminded me of a question I've had about Facebook and YouTube for a while: What happens when you click the "Like" button?

Facebook isn't so much about sharing content as sharing content. But YouTube and many other sites like it recommend content based on users' response to the content itself, rather than the shape of the surrounding social network. If you're building an application like this yourself, this article is for you.

Think of a collection of user-generated pages with a "Like" or "+1" button on each. Users can browse pages at will, or arrive at them randomly from an external site, and after viewing a page will either click the "Like" button or do nothing. I'll refer to the number of times viewers do nothing on a page as "Don't care". I'll also assume you have a site-wide "Top Pages" chart that users can view to see the highest-scoring pages and jump to them.

1. Counting "Likes"

The very simplest way to score a page is to count the number of times it's been "Liked":
score = page_likes
The main problem with this method is inertia. Old "champion" pages accumulate votes over time, and dominate the rankings. New pages don't have a chance to unseat the champs, even temporarily, to gain visibility for themselves. The site appears stagnant.

2. "Likes" versus views

To make good recommendations, you want to measure the quality of a page — the chance that a user will like the recommended page themselves:
score = page_likes / page_views
In the long run, this is correct — you estimate the probability based on the frequency of "Likes" in previous views. Old "champion" pages will be unseated in the rankings if a newer page earns a better proportion of likes.

But for new or little-viewed pages, there's an issue of sample size.
  • A page where the first view is "Liked" (probably by the creator/uploader) scores 100% and shoots to the top of the rankings. If a few friends all immediately "Like" the page, it becomes difficult to unseat. There's a lot of noise at the top of the rankings.
  • A page where the first view is not "Liked" scores 0% and sinks to the bottom of the rankings. If you have some mechanism for purging bad content from your site (i.e. deleting low-scoring pages that are likely spam, trolls or just lame), then this makes that task more difficult.
Intuitively, a page with 80 likes out of 100 views is more likely to be good than a page with 4 likes out of 5 views. A page with zero likes out of 100 views is almost certainly junk, but 5 views without any likes may not mean much at all.

So, your next goal: Make the best possible estimate of a page's likeability based on the first few views and likes, using some prior knowledge. After that, all reasonable methods should converge on the same score (probability of liking). If a meme catches, people will be able to find that page through other means, and your own rankings will be less crucial to its success.

3. Pseudocounts

A pseudocount is a prior estimate of the probability of an event. To make an estimate of actual probabilities based on a small number of samples (the problem at the end of Step 2), add the pseudocounts to the actual counts of each event.

I'll demonstrate.

The events here are (a) "Like" and (b) "Don't care". I'm going to use b to represent the pseudocount for "Like". For this section, I choose the probabilities:
like = b = .1
dontcare = (1 - b) = .9
The two probabilities should sum to 1.

How do you get these values? Since b represents the probability that a random user will like an arbitrary page, taking the site-wide average of likes versus views is a good choice:
b = all_likes / all_views
To use the pseudocounts, add them to the counts in the formula in step 2:
score = (page_likes + b) / (page_views + 1)
(Recall: views = likes + dontcares; after adding pseudocounts, (likes + b) + (dontcares + 1 - b) = likes + dontcares + 1 = views + 1.)

If the database-wide sums of likes and views are large numbers, this won't significantly affect the "average" score, all_likes / all_views. But it smoothes out the initial scoring for new pages.

Example: Assume 10% of all views result in a "Like" (b = 0.1).

A single view without a "Like" places the page slightly below the global average, but not too much. (Odds are, 90% of pages will start out this way.) Additional views without a "Like" slowly sink the page score toward 0.

Likes:Views Calculation Percentage
0:1 0.1 / 2 5.0%
0:2 0.1 / 3 3.3%
0:3 0.1 / 4 2.5%

A single view with a "Like" gives the page a boost, but not to 100%. This can help it gain traction, but probably won't put it in the top rankings (yet). If subsequent views are also liked, the score continues to rise:

Likes:Views Calculation Percentage
1:1 1.1 / 2 55.0%
2:2 2.1 / 3 70.0%
3:3 3.1 / 4 77.5%
4:4 4.1 / 5 82.0%

Away from the extremes (0% or 100% liked), the effect of pseudocounts is less dramatic, and a mix of "Like" and "Don't Care" (viewed without liking) results in a score closer to what you'd see without pseudocounts — just shifted slightly toward the site-wide average. Notice that a page with two "Likes" out of three views (2:3) is scored almost as well as one "Like" and one view (1:1 above).

Likes:Views Calculation Percentage
1:2 1.1 / 3 36.7%
1:3 1.1 / 4 27.5%
2:3 2.1 / 4 52.5%
2:4 2.1 / 5 42.0%

To increase the effect of pseudocounts, you can put a higher weight on the prior by multiplying the pseudocounts by some constant. If the weighting factor is w, then the calculation is:
score = (page_likes + (b * w)) / (page_views + w)
Think of this as the number of "imaginary" users you have rating each page before any real users see it. The calculations above use a weight of 1, equivalent to one user giving a fractional score of .1 to every page before it goes live, and you can see the effect of it. Play with it a bit to see how it affects your rankings.

Update: Evan Miller has a great writeup on a Bayesian approach to modelling this problem, if you'd like to go much further down the rabbit hole.

4. Statistical significance

You now have a score for each page in your database and a top-to-bottom ranking. Where do you draw the line for "recommending" a page?

Quantiles: Best and worst

Having sorted all the pages by score, take the top 5% as the "best" and bottom 5% as the "worst". Or choose a fixed number, like 25. It's really up to you.

The "best" ranking is for users, especially new visitors. Depending on your application, some users might also be interested in the "worst" pages — how else would we find gems like "Friday"?

Contingency: Is the score meaningful?

Another challenge is to determine when a page's score is statistically meaningful — i.e. the difference between a score of 55% based on 1000 views versus a single view. Using pseudocounts addresses this to some extent at the extremes, but it's still possible for pages with low view counts to score highly. You may also want to purge "junk" content with horribly low rankings — but only once it's been given a fair chance.

With the like and dontcare counts, site-wide and per-page, set up a 2x2 contingency table:

like dontcare
Page A B
Global C D

To evaluate the significance, use a Chi-square test with one degree of freedom (df=1), or if you're picky, Fisher's exact test.

The chi-square test, in R:
> abcd = matrix(c(4, 10, 1000, 10000), nrow=2, byrow=T)
> chisq.test(abcd)

        Pearson's Chi-squared test with Yates' continuity correction

data:  abcd
X-squared = 4.2691, df = 1, p-value = 0.03881
With the common p-value cutoff ("alpha") of 0.05, we'd say this page with 4 likes out of 10 views is significant — for that cutoff, at least. And if we applied the same test across all pages in the database, we'd be wrong.

I'll try to be quick about this, because it matters.

Remember: A p-value of 0.05 means the given like/view ratio will occur by chance 1 in 20 times. Since the same test is being applied to every page in your database, you need to account for multiple hypothesis testing, or else many pages will meet the cutoff by chance.

If you only have a few pages — say, less than 40 — then you can divide alpha by the number of pages (e.g. 20) and use that in place of the original cutoff (0.05 / 40 = 0.00125, so the previous p-value of 0.03881 would not be significant).

More likely, you have many more pages than that — hence the need to use grown-up statistics in the first place. Bonferroni correction (described above) would produce a cutoff that's much too stringent, so you'll need a more powerful method.

R makes this easy. Starting with a single array of the p-values from Chi-square tests of each page:
> pvals = sapply(chisq.test, contingencytables)
Adjust these raw p-values for multiple testing (using the familywise error rate, by default — read the help page for p.adjust for all the details):
> pvals.adj = p.adjust(pvals)
What do you after this depends on your own code. You can get a boolean array signifying which adjusted p-values are now smaller than alpha, which is useful for selecting "significant" pages from the original page list:
> significant = pvals.adj < 0.05
Note that this selects both significantly liked and significantly disliked pages at the same time. To distinguish between the two, just compare each page's like/view ratio to the global average and select higher or lower.

Another note about the contingency table: Once your application has counted a very large number of site-wide likes and views (cells C and D), this test will register significance for almost any page. You might have better results by replacing the global view and "Like" counts with a per-month or per-user average. And, you can cache these values and update them only occasionally.

Trending

Calculating p-values is a lot more work than selecting the top and bottom quantiles. If you've put in the extra effort, here's another feature you can support: a list of newly significant winners and losers.

Each day (or hour or so), perform the chi-square test (described above) across all pages and note which ones cross the significance threshold. Compare this to the previous run's results to see which pages have crossed over, and add these newly significant hits to a separate chart — "Trending", I'll call it.

This chart shows the pages that have just recently been determined to be likeable, but (probably) haven't accumulated enough votes to reach the "Top Pages" chart. It's a timelier list than "Top Pages", though the average quality of the "Trending" pages is not as high. This is the place where memes show up first. If they're truly good content, they'll eventually make it onto "Top Pages" — but that's not usually the case with memes.

I'd treat the "Trending" chart as a queue, adding newly trending pages to the top at the end of each run and dropping pages from the bottom as space permits. Or just keep it rolling by week, like a blog. By adjusting alpha you can tune the number of newly significant pages found in each run, and therefore the turnover rate of your "Trending" queue.

5. But is it general?

Under the Google Whatever model (YouTube, Picasa, etc.), the ratio of Likes to total views for any given page is small. The statistics here will work in other cases, though — for example, an "Approve" button which clicked most of the time, or a "Dislike" button in place of the "Like" button. In the case where users have to click either "Like" or "Dislike" (Yes/No, Yay/Nay, or any other two options), this is also fine; just pick one option to count, and count "views" as the sum of likes and dislikes.

Update: The "Like" event can also be something implicit, like downloads or completed interactions (out of started interactions). This opens up options for sites that don't require users to register.

What about sites with 5-star ratings, like Amazon? Well, there's an easy way and a hard way. Easy: count the ratings as fractional Likes ([0, .25, .5, .75, 1] if you allow 0 stars, [0, .33, .67, 1] if you don't), and use the pseudocounts just like before. The hard way is to treat each star ranking as a separate event category — but that's going to have to wait for a later post.