Python: string concatenation VS list join

Posted on 6 November 2012 by Paolo Bernardi

Probably everyone who has used Python for a while came across this classic performance hint: “instead of concatenating multiple strings one after the other, create a sequence of such strings and then join them”.

That’s it, instead of this…

s = ""
for x in items:
    s += x

… it’s usually better to write something like this:

s = "".join(items)

But why? Let’s dig a bit inside Python’s source code to unveil the mystery.

Inside the source code

The Python implementation I’m referring to is CPython, the reference Python interpreter developed by Guido & co., written mainly in C, and hosted on a public Mercurial repository. I will specifically target the version 3.3.0: you can browse its source code online.

We need to look at the implementation of Python’s Unicode strings. The unicodectype.c file: bingo!

The function that implements the strings’ concatenation with the plus operator is at row 10619:

PyObject *
PyUnicode_Concat(PyObject *left, PyObject *right)

Inside the function, after skipping a lot of (important!) checks and setup code, at row 10653, there is the core of Python’s string concatenation:

maxchar = PyUnicode_MAX_CHAR_VALUE(u);
maxchar2 = PyUnicode_MAX_CHAR_VALUE(v);
maxchar = MAX_MAXCHAR(maxchar, maxchar2);

/* Concat the two Unicode strings */
w = PyUnicode_New(new_len, maxchar);

That’s it: every time you concatenate two strings with the plus operator, the Python’s interpreter allocates some memory and copies the two strings into it, one after the other. If you concatenate a lot of big strings the interpreter must allocate memory a lot of times, wrestling around with fragmentation and such. This behaviour should not surprise you: Python’s string objects are immutable, so each concatenation generates a new string instead of modifying the existing one in place. It’s also no surprise that multiple string concatenations are considered slow.

What about the join method instead? Join is a strings’ method, so we can find it in this very same file, at row 9478:

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)

The function iterates twice over the sequence of strings. The first iteration computes the total length of the joined string (row 9568):

for (i = 0; i < seqlen; i++)
    [..
    sz += PyUnicode_GET_LENGTH(item);   // At row 9580

Then, at row 9597, the function allocates the space for the joined string:

res = PyUnicode_New(sz, maxchar);

Finally there is a second iteration that copies every string in the sequence inside the newly allocated string (see at row 9612):

for (i = 0, res_offset = 0; i < seqlen; ++i) {
    [...]
    Py_MEMCPY(res_data,
        sep_data,
        kind * seplen);  // At **row 9618**

Conclusion

When using the join method, Python allocates memory for the joined string only once; if you concatenate strings in succession, instead, Python has to allocate new memory for each concatenation. Guess what’s faster? 😉

Posted in Software Engineering,
Tagged in python

Get in touch

Thank you for contacting me, I will be in touch with you as soon as possible.
There was an error while trying to send the comment, please try again later.