Monday, July 20, 2009

Faster string concatenation in Python

Nick Matzke pointed me to this discussion of string concatenation approaches in Python:

Efficient String Concatenation in Python

The issue here is whether adding strings together in a for loop is inefficient enough to be worth working around. Python strings are immutable, so this:
s = 'om'
for i in xrange(1000):
    s += 'nom'

means doing this 1000 times:

  1. Translate the assignment to "s = s + 'nom'"

  2. Allocate another string, 'nom'. (Or reuse a reference if it's already interred.)

  3. Call the __add__ method on s, with 'nom' as the argument

  4. Allocate the new string created by __add__

  5. Assign a reference to that string back to s

So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.

There are a few ways to cope:

  • Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).

  • Let the string object's join method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.

  • Build a single format string and interpolate with the % operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the join method. This fits real-world scenarios better — filling in a template of a plain-text table or paragraph with computed values, either all at once with % or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.

The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.

Methods


I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code here.)

First, a changelog:

  • The method numbers are different, and there are a couple more. Method #2 is for the % operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses map instead of a list comprehension or generator expression; no lambda is required and the necessary function (str()) is already available, so this is a good candidate.

  • I used the standard lib's time.clock() to measure CPU time around just the relevant loop for each string concatenation method.

  • Fetching the process memory size is similar but uses the subprocess module and different options.

  • Docstrings are (ab)used to identify the output.

For example, the string addition method now looks like this:
def method1():
    """1. string addition"""
    start = clock()
    out_str = ''
    for num in NUMS:
        out_str += str(num)
    cpu = clock() - start
    return (out_str, cpu, memsize())


Results


Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.
Python 2.6:
1. string addition   CPU (s): 1.99   Mem (K): 11.7
2. %-interpolation   CPU (s): 2.42   Mem (K): 23.0
3. array object      CPU (s): 3.42   Mem (K): 17.3
4. cStringIO object  CPU (s): 3.24   Mem (K): 19.7
5. join + for loop   CPU (s): 2.29   Mem (K): 48.0
6. join + list comp  CPU (s): 1.93   Mem (K): 11.6
7. join + gen expr   CPU (s): 2.08   Mem (K): 11.6
8. join + str map    CPU (s): 1.47   Mem (K): 11.6

The winner is map, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected — at least in Python 2.6.

The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:

Python Patterns — An Optimization Anecdote

At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not — we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.

The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, str.join() accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to join. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.

Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe join isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.

Finally, in method 8, map allows the interpreter to look up the str constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?

Python 2.5:
1. string addition   CPU (s): 3.77   Mem (K): 10.8
2. %-interpolation   CPU (s): 2.43   Mem (K): 22.0
3. array object      CPU (s): 5.16   Mem (K): 16.4
4. cStringIO object  CPU (s): 4.93   Mem (K): 18.7
5. join + for loop   CPU (s): 3.98   Mem (K): 47.1
6. join + list comp  CPU (s): 3.30   Mem (K): 10.5
7. join + gen expr   CPU (s): 3.59   Mem (K): 10.5
8. join + str map    CPU (s): 2.72   Mem (K): 10.5

Python 2.6.2 has had the benefit of additional development time, and in notably the Unladen Swallow project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.

Conclusion


The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.

  • Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones

  • Using string interpolation or templates plays well enough with more complex formatting

  • Going too low-level can deprive you of Python's own optimizations

  • If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise

There's more discussion on Stack Overflow.