Monday, July 20, 2009

Faster string concatenation in Python

Nick Matzke pointed me to this discussion of string concatenation approaches in Python:

Efficient String Concatenation in Python

The issue here is whether adding strings together in a for loop is inefficient enough to be worth working around. Python strings are immutable, so this:
s = 'om'
for i in xrange(1000):
    s += 'nom'

means doing this 1000 times:

  1. Translate the assignment to "s = s + 'nom'"

  2. Allocate another string, 'nom'. (Or reuse a reference if it's already interred.)

  3. Call the __add__ method on s, with 'nom' as the argument

  4. Allocate the new string created by __add__

  5. Assign a reference to that string back to s

So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.

There are a few ways to cope:

  • Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).

  • Let the string object's join method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.

  • Build a single format string and interpolate with the % operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the join method. This fits real-world scenarios better — filling in a template of a plain-text table or paragraph with computed values, either all at once with % or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.

The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.

Methods


I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code here.)

First, a changelog:

  • The method numbers are different, and there are a couple more. Method #2 is for the % operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses map instead of a list comprehension or generator expression; no lambda is required and the necessary function (str()) is already available, so this is a good candidate.

  • I used the standard lib's time.clock() to measure CPU time around just the relevant loop for each string concatenation method.

  • Fetching the process memory size is similar but uses the subprocess module and different options.

  • Docstrings are (ab)used to identify the output.

For example, the string addition method now looks like this:
def method1():
    """1. string addition"""
    start = clock()
    out_str = ''
    for num in NUMS:
        out_str += str(num)
    cpu = clock() - start
    return (out_str, cpu, memsize())


Results


Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.
Python 2.6:
1. string addition   CPU (s): 1.99   Mem (K): 11.7
2. %-interpolation   CPU (s): 2.42   Mem (K): 23.0
3. array object      CPU (s): 3.42   Mem (K): 17.3
4. cStringIO object  CPU (s): 3.24   Mem (K): 19.7
5. join + for loop   CPU (s): 2.29   Mem (K): 48.0
6. join + list comp  CPU (s): 1.93   Mem (K): 11.6
7. join + gen expr   CPU (s): 2.08   Mem (K): 11.6
8. join + str map    CPU (s): 1.47   Mem (K): 11.6

The winner is map, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected — at least in Python 2.6.

The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:

Python Patterns — An Optimization Anecdote

At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not — we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.

The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, str.join() accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to join. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.

Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe join isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.

Finally, in method 8, map allows the interpreter to look up the str constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?

Python 2.5:
1. string addition   CPU (s): 3.77   Mem (K): 10.8
2. %-interpolation   CPU (s): 2.43   Mem (K): 22.0
3. array object      CPU (s): 5.16   Mem (K): 16.4
4. cStringIO object  CPU (s): 4.93   Mem (K): 18.7
5. join + for loop   CPU (s): 3.98   Mem (K): 47.1
6. join + list comp  CPU (s): 3.30   Mem (K): 10.5
7. join + gen expr   CPU (s): 3.59   Mem (K): 10.5
8. join + str map    CPU (s): 2.72   Mem (K): 10.5

Python 2.6.2 has had the benefit of additional development time, and in notably the Unladen Swallow project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.

Conclusion


The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.

  • Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones

  • Using string interpolation or templates plays well enough with more complex formatting

  • Going too low-level can deprive you of Python's own optimizations

  • If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise

There's more discussion on Stack Overflow.

Sunday, March 8, 2009

Mnemosyne: Getting Things Memorized

It had been bothering me since I joined this lab that I couldn't confidently just read a protein sequence and understand what it meant — naming the residues, picturing the side chain structures, and understanding the significance of replacing one residue with another. I expected that I'd just pick it up naturally from working with sequences and structures, and that did happen somewhat. But I wanted it to be as easy as reading English, and that level of completeness doesn't happen without some rote memorization.

That brought to mind a Wired article about Piotr Wozniak and his spaced-repetition memorization program, SuperMemo. When I originally read the article I wasn't in grad school and didn't have an urge to memorize any particular list of things. Anyway, SuperMemo appeared to be Windows-only software, and an algorithm like this would be more fun to code from scratch anyway. Enough fun, really, that there had to be one or two open-source implementations floating around.

Mnemosyne popped up as the closest match in an Ubuntu package search, so I'm running with that. Putting the flash cards together was pretty simple; I was able to do it in a few minutes from inside the program and export it in the standard XML format. I zipped it up with a quick plain-text README and uploaded it to the project home page as the Amino Acids card set.

The content came from a slide in a lecture, and I did a quick sanity check on Wikipedia before uploading. The notation for the 20 standard amino acids is complete, and that was the main goal of this. The assignment of amino acid "groups" seems to be a little arbitrary, depending on the source (by structure, functional groups, chemical properties, etc.), and I tried to make the categories complete without too much overlap -- there's a small deviation from my slide here. I also added another category for "side chain properties", pH and polarity. Another enhancement might be the standard codons for each amino acid, though I'm not sure I want to deal with that yet.

Saturday, February 7, 2009

Carrots and sticks

What's old is new again:
Professor makes his mark, but it costs him his job
In Zen and the Art of Motorcycle Maintenance, Robert Pirsig mentions his own experiment in withholding grades at a university. He didn’t just announce on the first day that everyone would get an A+  —  that seems gimmicky. Instead, since it was a class on rhetoric, he spent the course developing an argument for eliminating the grades-and-degrees system and discussing it with his students. 
Initially, most students were unenthusiastic or opposed — grades and degrees are what they came for. Nonetheless, Prof. Pirsig assigned, collected and graded papers, but returned them to students with only the comments, not the grade.
At first:
  • A-students felt annoyed by the uncertainty of the situation, but did the work anyway;
  • B-C students blew off some assignments; and
  • C-D students usually skipped class.
He observed this and changed nothing. If students acted up, he let it slide.
Around 3–4 weeks into it:
  • A-students got nervous and pushed themselves harder, in class and in papers;
  • B-C students saw what the A-students were doing and returned to the usual level of effort; and
  • C-D students who had made a routine of skipping class would occasionally show up out of curiosity.
And finally:
  • A-students relaxed and began enjoying the class as active participants. In a final essay, still not knowing what their grades were, these students favored eliminating grades by 2–1.
  • B-C students saw this, panicked, and began putting an unusual amount of effort into their work. Eventually, they joined the A students in engaging class discussions. Ultimately, these were evenly divided over the issue of eliminating grades.
  • C-D students  —  or those who attended  —  also saw this and began trying to hand in reasonable work. Those who couldn’t hack it freaked even more, and remained in a state of Kafkaesque terror until the quarter mercifully ended. Naturally, in the final essay these students were unanimously opposed to eliminating grades.
Interesting as this result was, Pirsig reverted to the regular grading system the next quarter because he couldn’t provide any alternate goal for students  —  those who can recognize quality in their own work don’t need the university; those who can’t need something to work toward, or they don’t progress.