Saturday, October 25, 2008

Psychology research with Mechanical Turk

Elevator pitch: There's a missing gear in the machine of psychology research. Every significant human study requires weeks or months of data collection, and more time coding that data in a form that can be analyzed statistically. This makes it infeasible to do the sort of fast, iterative refinement of models that biology has seen in recent years.

Amazon's Mechanical Turk provides the missing piece. It provides an accessible interface for building a survey, interactive test, or other psychological measure, pushing it out to thousands of participants, quickly returning the results to the researcher in electronic form, and screening out unusable data. It's flexible enough to allow screening and debriefing, and gives access to a vastly larger pool of participants than Experimetrix. And it's cheap.

Background

First, take a look at this: Ten Thousand Cents.

When I first heard about bioinformatics I was under the impression that it was the exponentially increasing power of computers that made it irresistible to start using them for biological research. But actually, it was pretty much the reverse -- high-throughput experimental methods like gene sequencing, mass spectrometry and X-ray crystallography generated too much data for humans to process manually. Computers were only barely able to handle this workload in 1986, when the human genome project started -- scientists just did what was needed to move around the mountains of data coming out of their experiments. Similarly, new computational research is coming out of the Large Hadron Collider project now.

Psychology researchers (especially in social psychology) currently spend semesters at a time gathering data for their studies and converting it into data that can be quantitatively analyzed. High-throughput experimental methods are scarce and expensive, so there's no "data glut" driving the development of better information-management methods. Progress in the field is slow and lossy -- since there's not much demand for the raw data, conclusions are described qualitatively, which makes it hard to use prior results as a solid foundation for future work.

With Mechanical Turk, it's possible to do in one shot a study that would otherwise require a meta-analysis of several studies across particular locations or demographics. With more consistent data and larger populations, data can be reusable.

How it could work

If it fits behind a web interface, or can be described and completed with plain language or pictures, it can be done with Mechanical Turk. Necessarily, a form of consent can precede the main task, and a blurb of debriefing can finish.

To get a feel for how it's done, read this article: The Faces of Mechanical Turk

Naturally, the first study done this way should be something to determine how the population of Turkers corresponds to the general population and the student populations that have already been characterized in previous studies. A public-domain measure of the Big Five or something like the Narcissistic Personality Inventory would be good candidates. Then, let slip the hounds of statistics. Are Turkers as representative of the general population as psychology undergrads? More so?

Some research along these lines has already been blogged here: Mechanical Turk Demographics

Now, let's try some examples.

Surveys: You craft a survey, Turkers take it, and you retrieve and filter the results through the Mechanical Turk interface. Pretty straightforward, no?

Interactive tasks: This is what Mechanical Turk is designed for; only, the focus was expected to be the task, not the Turker. Anyway, the data's yours. An example of a task like this would be a simple, unbounded game (Flash or JavaScript) that the participant can quit any time (possibly paired with another stimulus). The returned data would be the play duration alongside any personal or demographic information requested.

Coding visual or audio data: Following the original intent of Mechanical Turk more closely, this application of the service distributes a repetitive task normally performed over several weeks or months by the researcher or a group of grad students. Rather collect new data about a participant, this simply boils down a vast quantity of data that's already been generated -- this is a problem we want to have. A two-step example: (1) run a Mechanical Turk task in which participants draw or assemble an arbitrary image; (2) run a second task with a different set of participants who look at these images and code (type or select) the relevant traits they see in the images.

Measure development: One of the more uncomfortable questions in social psychology research is the validity of personality measures. Devise a series of questions and a method for tabulating the results; run it on some participants; analyze the results to get some answers. But, what's really being tested here -- the population, or the measure? Tragically, there's no time to refine the measure very much; if the results are useful, you run with it. But! With Mechanical Turk, collecting survey results is cheap and quick; and since the general format of the survey isn't changing between revisions, the same set of statistical transformations can be applied programatically to each iteration of the survey.

This is a great way to build a psychological measure that you can be confident in: Push an initial draft of the measure out to Turkers, receive some results, perform a statistical analysis and save the operations as an R or SPSS script. Then, manually refine the measure, put it back on Turk, filter the new results through your analysis script, and repeat until it looks good. This can get as advanced as you'd like -- start with several times as many questions as you'd like to see in the final survey, then automatically dispatch random subsets of the question list to Turk, filter through your automatic analysis to get some scores indicating quality, and use a Bayesian classifier to narrow down the best possible subset of questions.


Update: Here's a conference paper on the same topic.
http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf

Friday, March 7, 2008

Vimming your way to the top

Here's the Vim syntax file I use for highlighting my to-do list. It's based on the syntax file for YAML.
http://www.vim.org/scripts/script.php?script_id=2599

Benefits:
  • Different colors for lines ending in ':', or starting with '*' or '{'
  • Assign keywords to be automatically highlighted, like important locations, coworkers' names, customers, taquerias, etc.
  • Start sections with a line of underscores and a heading beginning with the '{' character. The heading stands out (red with GVim's "desert" color scheme), and you can jump between sections just like C blocks using ]] and [[ keystrokes.
  • Ordinary text (i.e. not specifically formatted for this syntax) looks sane.
Normally I have a line in my .vimrc assigning the filetype "todolist" to the file where I keep my permanent todolist, but another way to add this highlighting to a text file is to add vim: ft=todolist to the end of a file. It's harmless.

Update (4/2/09): I uploaded the script to vim.org, where it will be easier to track and update.

Update (1/1/10): Here's an example of how to use this color scheme for course notes.


  • 60 underscores (my preference) and a curly brace indicate a new section
  • Subsection lines end with a colon (generally followed by bullet points)
  • Special or out-of-context notes start with an asterisk
  • For separation, or to display a different sort of sub-heading, play with asterisks: '* * *' centered, or '** OLD **' for example
At school, I run a shell script for each new class that creates a new directory from the course name, copies a skeleton of this example text to a file called lecture-notes.txt, etc., and adds the directory to Mercurial -- so while there's some boilerplate involved with this plugin, it's easy to automate and plays well with Vim's text-munging capabilities.

I've also picked up the habit of putting @contexts above unsorted items at the top of my main to-do list, inspired by the GTD approach. The syntax plugin doesn't take advantage of this yet; I'll post another update when that's done.

Saturday, January 19, 2008

On Blub

There's an interesting (old) discussion thread at Raganwald.com:
http://weblog.raganwald.com/2006/10/are-we-blub-programmers.html

Blub Theory evolves over the course of the comments. First off, the requirements for Blubbiness are defined as:
  1. There's at least one language that's worse for the task at hand, and the programmer realizes (validly) that it's less suited for the task than Blub.
  2. There's at least one language that's better for the task, and the programmer doesn't realize that it's better for the task than Blub.
That's Blub from the programmer's perspective. For the adventurous programmer, Blub is the language you're fighting against when you try to introduce Python or OCaml to your programming team. (Citations: Beating the Averages, The Python Paradox)

Another commenter glances on the management perspective of Blub:
I misread "Blub" to be "bulb". As in when a programmer burns out his employer just throws him away and screws in a new one.
'Nuff said.

When we're not trying to pin it down too carefully, we know exactly what Blub is. The first Blub was Fortran — and in some circles, it's still Blub. Guy Steele Jr. (who was involved in the design of Scheme, Common Lisp, and Java in previous efforts to take down previous Blubs) is currently working on Fortress with the same goal. Fortress looks nothing like Fortran, but the name's close enough to get the point across, with the point being that it's intended to be much better suited for tasks where scientific programmers instinctively reach for Fortran. C++ was Blub for the '90s, and since real Blubs never die, it's still the Blub of choice for most performance-critical stuff. Java out-Blubbed C++, and now Java and C# are splitting the Blub market. However, examples of Blub code are still generally a simplified C++, since the equivalent in Java would take too much boilerplate to be worth the column space.

(Sidebar: Modern Fortran looks almost nothing like the original Fortran. C++, especially variety found in Visual Studio now, and with the 200X extensions, is also a mutant. And it's not a superset of C99, either, defeating the original purpose of the language. Visual Basic, VBA and VB.NET are not compatible, despite the naming. How can a language take its users for such a ride over the years when another language with a different name might be the more logical next step? Javascript and C# surely got a bit more mileage on name recognition. All the evidence points to human psychology still working on programmers.)

But the argument breaks down when we try to explain why another language is better suited for the task than Blub. There are C++ programmers out there who can bust out a better program than you can write with any other language. Out of the last 10 years of the ICFP Programming Contest, 3 winners used Haskell, 3 used OCaml, and 2 used C++ — and this is a contest arranged by functional-programming gurus. Python and Ruby have never received any prizes, though Perl was recognized by the second-place team last year.

I see two axes to evaluate languages on: something like the front end and the back end. Semantics and implementation. Both are labeled "power", but for the front end that implies what the language does for the programmer, and for the back end it's what the program does for the machine.

Example 1: Ruby has an excellent front end. It's one of the most expressive languages available; that's probably why 37signals picked it for Rails. The best semantic ideas from Smalltalk, Common Lisp and Perl are all in there; most of the famous Design Patterns are built in either implicitly or explicitly. (It's not that the language makes them all obsolete; the language designer just had the foresight to implement the tricky ones for you.)

But the back end has been playing catch-up. Performance lagged well behind even Python and Perl until the very recent v1.9, and there's no native-code compiler. I could be misinformed, but I've also heard that: threading suffers the same issue as Python of limiting the interpreter process to a single processor (or so I've heard); there's no built-in foreign-function interface a la Haskell or Python's ctypes module; Unicode support has rough spots; and large-scale parallelism and concurrency basically mean running a bunch of separate Ruby processes.

There's a lot for a Blub programmer to pick on.

Example 2: Delphi, a.k.a. Object Pascal, has an excellent native-code compiler, with support for cross-platform compilation and single-file (DLL-free) executables, and can also run on .NET, with all these options available through the same IDE. It's competitive with C on benchmarks, often faster. Integration with databases and other external components is solid. Refactoring tools are included with the IDE, lots of fun with static analysis. Object Pascal itself was originally designed at Apple some years before Borland picked it for their offering, then abandoned, but there seems to be something inherent in the language that enables highly optimized compilation. The Free Pascal implementation, for instance, comes well ahead of every other language in the Computer Language Benchmarks Game when memory and speed are weighted equally. On the combined benchmarks, Free Pascal uses only half as much memory as C (gcc)!

The catch is, Object Pascal is a cheesy-looking language. On the same benchmarks set, comparing the size of the code in gzipped bytes (emphasizing tokens instead of characters), Object Pascal comes in 24th out of 33 languages, just behind C. It beats Fortran, Java and C++, but not C#. I think I'd just buy more RAM rather than rewrite a Blub program in Object Pascal.

The benchmarks tend to be heavy mathematical algorithms, rather than general-use applications, so certain things like I/O, libraries and support for bottom-up programming and meta-programming are discounted. Regardless, Python, Perl and Ruby are the top 3 languages for code size on these benchmarks — I think Lisp was hurt more by this aspect of the benchmarks, since syntactic sugar isn't built in; there's no room for code reduction via mini-language. Haskell was probably helped by the absence of I/O. In general the benchmarks show that Blub languages perform well but are somewhat verbose, while scripting and Web-friendly languages are concise but have poor performance; Prolog ranks badly in every way, while OCaml and Haskell do well in every way; this fits reality fairly well for number-crunching but not for the Web or AI. Let's acknowledge once again that benchmarks aren't perfect, and forge ahead.

Fact: A language and a compiler are not the same thing.

But the Object Pascal example should show that a language can baby the compiler to give better results. The three arguments go:
  1. Declaring static types and generally programming close to the metal gives the compiler the information it needs to generate an optimally efficient program. That's why C can be fast — it's a close fit to the hardware. Same goes for Java and the JVM.
  2. Using the right abstractions and strong type inferencing lets the compiler get a high-level view of what your algorithm is doing, allowing it to do more optimizations itself. That's why OCaml and Haskell can be fast — they're a close fit to the pure algorithm.
  3. While the expressiveness of new languages like Ruby and Python is appealing, the race to incorporate imperative, object-oriented and functional programming styles into every major language is actually resulting in weaker languages. Borrowing features doesn't bring a language any closer to providing a new model of computation, and it certainly doesn't give a better angle of attack at the whole point of all of this — making the computer hardware do what we want.
The third argument was made by John Backus, creator of Fortran, the Backus-Naur Form for defining programming language syntaxes, and later the FP and FL programming languages.
"Programming languages appear to be in trouble. Each successive language incorporates, with a little cleaning up, all the features of its predecessors plus a few more. [...] Each new language claims new and fashionable features... but the plain fact is that few languages make programming sufficiently cheaper or more reliable to justify the cost of producing and learning to use them."
— John Backus

The talk that began with this argument went on to introduce function-level programming. At the time, everyone thought Backus was talking about functional programming, so it unintentionally gave a boost to Lisp and later the ML family, of which Haskell and OCaml are derived. But no: it was actually about a new language called FP, somewhat based on APL. FP begat FL, which went nowhere, but Morgan Stanley created an ASCII-friendly variant of APL called A+ (which is now free, GPL'd software), and the proprietary J and K have carried the torch since then. The use of these languages now seems to be mostly in the financial world. (Perhaps because it's really very well-suited to financial tasks, and perhaps because that's where APL made its splash — who knows, it may have even Blubbed its way in.)

The main idea is point-free programming: rather than pushing values around (as even functional programming languages do), compose functions together to create an algorithm that only references functions, not values. Then create a basic set of operators that can be composed together to create higher-level functions. This is an excellent way to do manipulate arrays and matrices. Haskell touches on this idea but doesn't emphasize it.

Benchmarks for any of these languages are hard to find, but I see one cryptic statement about K here:
 [k is much faster than c on strings and memory management.]

And another startling statement on Wikipedia:
The performance of modern CPUs is improving at a much faster rate than their memory subsystems. The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor. Vector processing makes efficient use of the cache row fetching mechanism and posted writes without introducing bubbles into the pipeline by creating a dependency between consecutive instructions.

It looks pretty convincing to me. Finally, a fresh look at how programming languages make a machine do work.

This seems to be the argument missing from every language war: by removing the non-orthogonal parts of a language, it becomes more powerful. K doesn't have objects or continuations, and it doesn't need them. Likewise, Haskell restricts the ability to modify state to monads, Erlang's flow-control constructs throw out traditional iteration entirely, and Lisp virtually strips out syntax itself.

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.
— Revised5 Report on the Algorithmic Language Scheme


The corollary to (and irony of) the Blub paradox is that since these optimized languages are missing constructs found in Blub — by design — a Blub programmer will always have plenty to pick on.