Thursday, July 7, 2011

The statistics of the "Like" button

The launch of Google+ reminded me of a question I've had about Facebook and YouTube for a while: What happens when you click the "Like" button?

Facebook isn't so much about sharing content as sharing content. But YouTube and many other sites like it recommend content based on users' response to the content itself, rather than the shape of the surrounding social network. If you're building an application like this yourself, this article is for you.

Think of a collection of user-generated pages with a "Like" or "+1" button on each. Users can browse pages at will, or arrive at them randomly from an external site, and after viewing a page will either click the "Like" button or do nothing. I'll refer to the number of times viewers do nothing on a page as "Don't care". I'll also assume you have a site-wide "Top Pages" chart that users can view to see the highest-scoring pages and jump to them.

1. Counting "Likes"

The very simplest way to score a page is to count the number of times it's been "Liked":
score = page_likes
The main problem with this method is inertia. Old "champion" pages accumulate votes over time, and dominate the rankings. New pages don't have a chance to unseat the champs, even temporarily, to gain visibility for themselves. The site appears stagnant.

2. "Likes" versus views

To make good recommendations, you want to measure the quality of a page — the chance that a user will like the recommended page themselves:
score = page_likes / page_views
In the long run, this is correct — you estimate the probability based on the frequency of "Likes" in previous views. Old "champion" pages will be unseated in the rankings if a newer page earns a better proportion of likes.

But for new or little-viewed pages, there's an issue of sample size.
  • A page where the first view is "Liked" (probably by the creator/uploader) scores 100% and shoots to the top of the rankings. If a few friends all immediately "Like" the page, it becomes difficult to unseat. There's a lot of noise at the top of the rankings.
  • A page where the first view is not "Liked" scores 0% and sinks to the bottom of the rankings. If you have some mechanism for purging bad content from your site (i.e. deleting low-scoring pages that are likely spam, trolls or just lame), then this makes that task more difficult.
Intuitively, a page with 80 likes out of 100 views is more likely to be good than a page with 4 likes out of 5 views. A page with zero likes out of 100 views is almost certainly junk, but 5 views without any likes may not mean much at all.

So, your next goal: Make the best possible estimate of a page's likeability based on the first few views and likes, using some prior knowledge. After that, all reasonable methods should converge on the same score (probability of liking). If a meme catches, people will be able to find that page through other means, and your own rankings will be less crucial to its success.

3. Pseudocounts

A pseudocount is a prior estimate of the probability of an event. To make an estimate of actual probabilities based on a small number of samples (the problem at the end of Step 2), add the pseudocounts to the actual counts of each event.

I'll demonstrate.

The events here are (a) "Like" and (b) "Don't care". I'm going to use b to represent the pseudocount for "Like". For this section, I choose the probabilities:
like = b = .1
dontcare = (1 - b) = .9
The two probabilities should sum to 1.

How do you get these values? Since b represents the probability that a random user will like an arbitrary page, taking the site-wide average of likes versus views is a good choice:
b = all_likes / all_views
To use the pseudocounts, add them to the counts in the formula in step 2:
score = (page_likes + b) / (page_views + 1)
(Recall: views = likes + dontcares; after adding pseudocounts, (likes + b) + (dontcares + 1 - b) = likes + dontcares + 1 = views + 1.)

If the database-wide sums of likes and views are large numbers, this won't significantly affect the "average" score, all_likes / all_views. But it smoothes out the initial scoring for new pages.

Example: Assume 10% of all views result in a "Like" (b = 0.1).

A single view without a "Like" places the page slightly below the global average, but not too much. (Odds are, 90% of pages will start out this way.) Additional views without a "Like" slowly sink the page score toward 0.

Likes:Views Calculation Percentage
0:1 0.1 / 2 5.0%
0:2 0.1 / 3 3.3%
0:3 0.1 / 4 2.5%

A single view with a "Like" gives the page a boost, but not to 100%. This can help it gain traction, but probably won't put it in the top rankings (yet). If subsequent views are also liked, the score continues to rise:

Likes:Views Calculation Percentage
1:1 1.1 / 2 55.0%
2:2 2.1 / 3 70.0%
3:3 3.1 / 4 77.5%
4:4 4.1 / 5 82.0%

Away from the extremes (0% or 100% liked), the effect of pseudocounts is less dramatic, and a mix of "Like" and "Don't Care" (viewed without liking) results in a score closer to what you'd see without pseudocounts — just shifted slightly toward the site-wide average. Notice that a page with two "Likes" out of three views (2:3) is scored almost as well as one "Like" and one view (1:1 above).

Likes:Views Calculation Percentage
1:2 1.1 / 3 36.7%
1:3 1.1 / 4 27.5%
2:3 2.1 / 4 52.5%
2:4 2.1 / 5 42.0%

To increase the effect of pseudocounts, you can put a higher weight on the prior by multiplying the pseudocounts by some constant. If the weighting factor is w, then the calculation is:
score = (page_likes + (b * w)) / (page_views + w)
Think of this as the number of "imaginary" users you have rating each page before any real users see it. The calculations above use a weight of 1, equivalent to one user giving a fractional score of .1 to every page before it goes live, and you can see the effect of it. Play with it a bit to see how it affects your rankings.

4. Statistical significance

You now have a score for each page in your database and a top-to-bottom ranking. Where do you draw the line for "recommending" a page?

Quantiles: Best and worst

Having sorted all the pages by score, take the top 5% as the "best" and bottom 5% as the "worst". Or choose a fixed number, like 25. It's really up to you.

The "best" ranking is for users, especially new visitors. Depending on your application, some users might also be interested in the "worst" pages — how else would we find gems like "Friday"?

Contingency: Is the score meaningful?

Another challenge is to determine when a page's score is statistically meaningful — i.e. the difference between a score of 55% based on 1000 views versus a single view. Using pseudocounts addresses this to some extent at the extremes, but it's still possible for pages with low view counts to score highly. You may also want to purge "junk" content with horribly low rankings — but only once it's been given a fair chance.

With the like and dontcare counts, site-wide and per-page, set up a 2x2 contingency table:

like dontcare
Page A B
Global C D

To evaluate the significance, use a Chi-square test with one degree of freedom (df=1), or if you're picky, Fisher's exact test.

The chi-square test, in R:
> abcd = matrix(c(4, 10, 1000, 10000), nrow=2, byrow=T)
> chisq.test(abcd)

        Pearson's Chi-squared test with Yates' continuity correction

data:  abcd
X-squared = 4.2691, df = 1, p-value = 0.03881
With the common p-value cutoff ("alpha") of 0.05, we'd say this page with 4 likes out of 10 views is significant — for that cutoff, at least. And if we applied the same test across all pages in the database, we'd be wrong.

I'll try to be quick about this, because it matters.

Remember: A p-value of 0.05 means the given like/view ratio will occur by chance 1 in 20 times. Since the same test is being applied to every page in your database, you need to account for multiple hypothesis testing, or else many pages will meet the cutoff by chance.

If you only have a few pages — say, less than 40 — then you can divide alpha by the number of pages (e.g. 20) and use that in place of the original cutoff (0.05 / 40 = 0.00125, so the previous p-value of 0.03881 would not be significant).

More likely, you have many more pages than that — hence the need to use grown-up statistics in the first place. Bonferroni correction (described above) would produce a cutoff that's much too stringent, so you'll need a more powerful method.

R makes this easy. Starting with a single array of the p-values from Chi-square tests of each page:
> pvals = sapply(chisq.test, contingencytables)
Adjust these raw p-values for multiple testing (using the familywise error rate, by default — read the help page for p.adjust for all the details):
> pvals.adj = p.adjust(pvals)
What do you after this depends on your own code. You can get a boolean array signifying which adjusted p-values are now smaller than alpha, which is useful for selecting "significant" pages from the original page list:
> significant = pvals.adj < 0.05
Note that this selects both significantly liked and significantly disliked pages at the same time. To distinguish between the two, just compare each page's like/view ratio to the global average and select higher or lower.

Another note about the contingency table: Once your application has counted a very large number of site-wide likes and views (cells C and D), this test will register significance for almost any page. You might have better results by replacing the global view and "Like" counts with a per-month or per-user average. And, you can cache these values and update them only occasionally.

Trending

Calculating p-values is a lot more work than selecting the top and bottom quantiles. If you've put in the extra effort, here's another feature you can support: a list of newly significant winners and losers.

Each day (or hour or so), perform the chi-square test (described above) across all pages and note which ones cross the significance threshold. Compare this to the previous run's results to see which pages have crossed over, and add these newly significant hits to a separate chart — "Trending", I'll call it.

This chart shows the pages that have just recently been determined to be likeable, but (probably) haven't accumulated enough votes to reach the "Top Pages" chart. It's a timelier list than "Top Pages", though the average quality of the "Trending" pages is not as high. This is the place where memes show up first. If they're truly good content, they'll eventually make it onto "Top Pages" — but that's not usually the case with memes.

I'd treat the "Trending" chart as a queue, adding newly trending pages to the top at the end of each run and dropping pages from the bottom as space permits. Or just keep it rolling by week, like a blog. By adjusting alpha you can tune the number of newly significant pages found in each run, and therefore the turnover rate of your "Trending" queue.

5. But is it general?

Under the Google Whatever model (YouTube, Picasa, etc.), the ratio of Likes to total views for any given page is small. The statistics here will work in other cases, though — for example, an "Approve" button which clicked most of the time, or a "Dislike" button in place of the "Like" button. In the case where users have to click either "Like" or "Dislike" (Yes/No, Yay/Nay, or any other two options), this is also fine; just pick one option to count, and count "views" as the sum of likes and dislikes.

Update: The "Like" event can also be something implicit, like downloads or completed streams (out of started streams). This opens up options for sites that don't require users to register.

What about sites with 5-star ratings, like Amazon? Well, there's an easy way and a hard way. Easy: count the ratings as fractional Likes ([0, .25, .5, .75, 1] if you allow 0 stars, [0, .33, .67, 1] if you don't), and use the pseudocounts just like before. The hard way is to treat each star ranking as a separate event category — but that's going to have to wait for a later post.

2 comments:

Jason F. said...

You bring up some thoughtful points, but I don't know that its possible to really analyze the true popularity of something from the Like button statistics in all but the most general way. Taking into account page views would help, but there's more to it than just that.

* My real life friends and my work friends are my friends on Facebook and I don't want to announce to all of them that I spend all day looking at nerdy articles online or wasting time at work. So even if I like something online, I'm very hesitant about clicking the Like button. Thats where I think that Google+ with its more pronounced Circles feature might have an edge over Facebook with regards to sharing.

* Placement of the Like button probably affects clicks greatly and its hard to directly compare two different articles on two different sites with different place.

* There are a bunch of companies that actually sell Facebook fans and likes (citation) and the results are probably skewed because of that

* Even if somebody doesn't go and buy likes, there are people who sit there all day and beg their friends to like their blog and stuff. And that friends pressure skews the result somewhat. There are people who self-promote themselves endlessly and people who do not so this also affects the number of Likes.

There are also probably other factors involved that I'm not thinking of right now.

I think a better analysis of popularity would have to involve data from multiple sources, semantic analysis of comments from various streams of data like Twitter, popularity data from social news sites like Reddit, and more than I could ever think of.

What is really interesting to me is if Facebook ever adds a Dislike button. They would be unlikely to ever do that, but that would open up a pretty interesting additional set of data for analysis regardless.

Eric Talevich said...

Right, the design of Facebook makes it hard to properly recommend arbitrary Facebook pages -- I'm not sure how they do things internally, but I assume their recommendations are based on what your friends have liked, and which pages tend to be liked by the same people.

The stats here are designed for separate sites, where an equivalent "Like"/"+1"/"Cool" button isn't connected to sharing off-site (though maybe it goes on your profile within the site), and presumably the layout of every "content" page is the same -- something like YouTube or deviantART.

(I agree Google+'s circles open up a lot of good opportunities for data mining.)

"Like" can also be something implied, like downloads or completed streams (scrobbles?). That would fit the model of some sites that don't even require users to register.