Monday, July 20, 2009

Faster string concatenation in Python

Nick Matzke pointed me to this discussion of string concatenation approaches in Python:

Efficient String Concatenation in Python

The issue here is whether adding strings together in a for loop is inefficient enough to be worth working around. Python strings are immutable, so this:

s = 'om'
for i in xrange(1000):
s += 'nom'

means doing this 1000 times:

  1. Translate the assignment to "s = s + 'nom'"

  2. Allocate another string, "buzz". (Or reuse a reference if it's already interred.)

  3. Call the __add__ method on s, with ' nom' as the argument

  4. Allocate the new string created by __add__

  5. Assign a reference to that string back to s


So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.

There are a few ways to cope:

  • Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).

  • Let the string object's join method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.

  • Build a single format string and interpolate with the % operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the join method. This fits real-world scenarios better — filling in a template of a plain-text table or paragraph with computed values, either all at once with % or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.


The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.

Methods


I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code here.)

First, a changelog:

  • The method numbers are different, and there are a couple more. Method #2 is for the % operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses map instead of a list comprehension or generator expression; no lambda is required and the necessary function (str()) is already available, so this is a good candidate.

  • I used the standard lib's time.clock() to measure CPU time around just the relevant loop for each string concatenation method.

  • Fetching the process memory size is similar but uses the subprocess module and different options.

  • Docstrings are (ab)used to identify the output.


For example, the string addition method now looks like this:

def method1():
"""1. string addition"""
start = clock()
out_str = ''
for num in NUMS:
out_str += str(num)
cpu = clock() - start
return (out_str, cpu, memsize())


Results


Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.
Python 2.6:

1. string addition CPU (s): 1.99 Mem (K): 11.7
2. %-interpolation CPU (s): 2.42 Mem (K): 23.0
3. array object CPU (s): 3.42 Mem (K): 17.3
4. cStringIO object CPU (s): 3.24 Mem (K): 19.7
5. join + for loop CPU (s): 2.29 Mem (K): 48.0
6. join + list comp CPU (s): 1.93 Mem (K): 11.6
7. join + gen expr CPU (s): 2.08 Mem (K): 11.6
8. join + str map CPU (s): 1.47 Mem (K): 11.6

The winner is map, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected — at least in Python 2.6.

The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:

Python Patterns — An Optimization Anecdote

At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not — we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.

The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, str.join() accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to join. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.

Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe join isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.

Finally, in method 8, map allows the interpreter to look up the str constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?

Python 2.5:

1. string addition CPU (s): 3.77 Mem (K): 10.8
2. %-interpolation CPU (s): 2.43 Mem (K): 22.0
3. array object CPU (s): 5.16 Mem (K): 16.4
4. cStringIO object CPU (s): 4.93 Mem (K): 18.7
5. join + for loop CPU (s): 3.98 Mem (K): 47.1
6. join + list comp CPU (s): 3.30 Mem (K): 10.5
7. join + gen expr CPU (s): 3.59 Mem (K): 10.5
8. join + str map CPU (s): 2.72 Mem (K): 10.5

Python 2.6.2 has had the benefit of additional development time, and in notably the Unladen Swallow project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.

Conclusion


The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.

  • Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones

  • Using string interpolation or templates plays well enough with more complex formatting

  • Going too low-level can deprive you of Python's own optimizations

  • If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise


There's more discussion on Stack Overflow.

Sunday, March 8, 2009

Mnemosyne: Getting Things Memorized

It had been bothering me since I joined this lab that I couldn't confidently just read a protein sequence and understand what it meant — naming the residues, picturing the side chain structures, and understanding the significance of replacing one residue with another. I expected that I'd just pick it up naturally from working with sequences and structures, and that did happen somewhat. But I wanted it to be as easy as reading English, and that level of completeness doesn't happen without some rote memorization.

That brought to mind a Wired article about Piotr Wozniak and his spaced-repetition memorization program, SuperMemo. When I originally read the article I wasn't in grad school and didn't have an urge to memorize any particular list of things. Anyway, SuperMemo appeared to be Windows-only software, and an algorithm like this would be more fun to code from scratch anyway. Enough fun, really, that there had to be one or two open-source implementations floating around.

Mnemosyne popped up as the closest match in an Ubuntu package search, so I'm running with that. Putting the flash cards together was pretty simple; I was able to do it in a few minutes from inside the program and export it in the standard XML format. I zipped it up with a quick plain-text README and uploaded it to the project home page as the Amino Acids card set.

The content came from a slide in a lecture, and I did a quick sanity check on Wikipedia before uploading. The notation for the 20 standard amino acids is complete, and that was the main goal of this. The assignment of amino acid "groups" seems to be a little arbitrary, depending on the source (by structure, functional groups, chemical properties, etc.), and I tried to make the categories complete without too much overlap -- there's a small deviation from my slide here. I also added another category for "side chain properties", pH and polarity. Another enhancement might be the standard codons for each amino acid, though I'm not sure I want to deal with that yet.

Saturday, February 7, 2009

Carrots and sticks

What's old is new again:

Professor makes his mark, but it costs him his job

In Zen and the Art of Motorcycle Maintenance, Robert Pirsig mentions his own experiment in withholding grades at a university. He didn't just announce on the first day that everyone would get an A+ (that seems gimmicky), but since it was a class on rhetoric, he spent the course developing an argument for eliminating the grades-and-degrees system and discussing it with his students. Most students were unenthusiastic or opposed -- grades and degrees are what they came for.

He assigned, collected and graded papers but returned them to students with only the comments, not the grade.

At first:

  • A-students were annoyed, but did the work anyway
  • B-C students blew off some assignments
  • C-D students usually skipped class

He observed this and changed nothing. If students acted up, he let it slide.

Around 3-4 weeks into it:

  • A-students got nervous and pushed themselves harder, in class and in papers
  • B-C students saw what the try-hards were doing and returned to the usual level of effort
  • C-D students who had committed to ditching would occasionally show up out of curiousity

And finally:

  • A-students relaxed and began enjoying the class as active participants. In a final essay, still not knowing what their grades were, these students favored eliminating grades by 2-1.
  • B-C students saw this and freaked, putting an unusual amount of effort into their work. Eventually, they joined the A students in engaging class discussions. These were evenly divided over the issue of eliminating grades.
  • C-D students, or those who attended, also saw this and began trying to hand in reasonable work. Those who couldn't hack it freaked even more, and remained in a state of Kafkaesque terror until the quarter mercifully ended. Naturally, in the final essay these students were unanimously opposed to eliminating grades.

Interesting as this result was, he reverted to the regular grading system the next quarter because he couldn't provide any alternate goal for students -- those who can recognize quality in their own work don't need the university; those who can't need something to work toward, or they don't progress.

Saturday, October 25, 2008

Psychology research with Mechanical Turk

Elevator pitch: There's a missing gear in the machine of psychology research. Every significant human study requires weeks or months of data collection, and more time coding that data in a form that can be analyzed statistically. This makes it infeasible to do the sort of fast, iterative refinement of models that biology has seen in recent years.

Amazon's Mechanical Turk provides the missing piece. It provides an accessible interface for building a survey, interactive test, or other psychological measure, pushing it out to thousands of participants, quickly returning the results to the researcher in electronic form, and screening out unusable data. It's flexible enough to allow screening and debriefing, and gives access to a vastly larger pool of participants than Experimetrix. And it's cheap.

Background

First, take a look at this: Ten Thousand Cents.

When I first heard about bioinformatics I was under the impression that it was the exponentially increasing power of computers that made it irresistible to start using them for biological research. But actually, it was pretty much the reverse -- high-throughput experimental methods like gene sequencing, mass spectrometry and X-ray crystallography generated too much data for humans to process manually. Computers were only barely able to handle this workload in 1986, when the human genome project started -- scientists just did what was needed to move around the mountains of data coming out of their experiments. Similarly, new computational research is coming out of the Large Hadron Collider project now.

Psychology researchers (especially in social psychology) currently spend semesters at a time gathering data for their studies and converting it into data that can be quantitatively analyzed. High-throughput experimental methods are scarce and expensive, so there's no "data glut" driving the development of better information-management methods. Progress in the field is slow and lossy -- since there's not much demand for the raw data, conclusions are described qualitatively, which makes it hard to use prior results as a solid foundation for future work.

With Mechanical Turk, it's possible to do in one shot a study that would otherwise require a meta-analysis of several studies across particular locations or demographics. With more consistent data and larger populations, data can be reusable.

How it could work

If it fits behind a web interface, or can be described and completed with plain language or pictures, it can be done with Mechanical Turk. Necessarily, a form of consent can precede the main task, and a blurb of debriefing can finish.

To get a feel for how it's done, read this article: The Faces of Mechanical Turk

Naturally, the first study done this way should be something to determine how the population of Turkers corresponds to the general population and the student populations that have already been characterized in previous studies. A public-domain measure of the Big Five or something like the Narcissistic Personality Inventory would be good candidates. Then, let slip the hounds of statistics. Are Turkers as representative of the general population as psychology undergrads? More so?

Some research along these lines has already been blogged here: Mechanical Turk Demographics

Now, let's try some examples.

Surveys: You craft a survey, Turkers take it, and you retrieve and filter the results through the Mechanical Turk interface. Pretty straightforward, no?

Interactive tasks: This is what Mechanical Turk is designed for; only, the focus was expected to be the task, not the Turker. Anyway, the data's yours. An example of a task like this would be a simple, unbounded game (Flash or JavaScript) that the participant can quit any time (possibly paired with another stimulus). The returned data would be the play duration alongside any personal or demographic information requested.

Coding visual or audio data: Following the original intent of Mechanical Turk more closely, this application of the service distributes a repetitive task normally performed over several weeks or months by the researcher or a group of grad students. Rather collect new data about a participant, this simply boils down a vast quantity of data that's already been generated -- this is a problem we want to have. A two-step example: (1) run a Mechanical Turk task in which participants draw or assemble an arbitrary image; (2) run a second task with a different set of participants who look at these images and code (type or select) the relevant traits they see in the images.

Measure development: One of the more uncomfortable questions in social psychology research is the validity of personality measures. Devise a series of questions and a method for tabulating the results; run it on some participants; analyze the results to get some answers. But, what's really being tested here -- the population, or the measure? Tragically, there's no time to refine the measure very much; if the results are useful, you run with it. But! With Mechanical Turk, collecting survey results is cheap and quick; and since the general format of the survey isn't changing between revisions, the same set of statistical transformations can be applied programatically to each iteration of the survey.

This is a great way to build a psychological measure that you can be confident in: Push an initial draft of the measure out to Turkers, receive some results, perform a statistical analysis and save the operations as an R or SPSS script. Then, manually refine the measure, put it back on Turk, filter the new results through your analysis script, and repeat until it looks good. This can get as advanced as you'd like -- start with several times as many questions as you'd like to see in the final survey, then automatically dispatch random subsets of the question list to Turk, filter through your automatic analysis to get some scores indicating quality, and use a Bayesian classifier to narrow down the best possible subset of questions.


Update: Here's a conference paper on the same topic.
http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf

Friday, March 7, 2008

Vimming your way to the top

Here's the Vim syntax file I use for highlighting my to-do list. It's based on the syntax file for YAML.
http://www.vim.org/scripts/script.php?script_id=2599

Benefits:
  • Different colors for lines ending in ':' or starting with '*' or '{'
  • Assign keywords to be automatically highlighted, like important locations, coworkers' names, customers, taquerias, etc.
  • Start sections with a line of underscores and a heading beginning with the '{' character. The heading stands out (red with GVim's "desert" color scheme), and you can jump between sections just like C blocks using ]] and [[ keystrokes.
  • Ordinary text (i.e. not specifically formatted for this syntax) looks sane.
Normally I have a line in my .vimrc assigning the filetype "todolist" to the file where I keep my permanent todolist, but another way to add this highlighting to a text file is to add vim: ft=todolist to the end of a file. It's harmless.

Update (4/2/09): I uploaded the script to vim.org, where it will be easier to track and update.