Probably everyone who has used Python for a while came across this classic performance hint: “instead of concatenating multiple strings one after the other, create a sequence of such strings and then join them”.
That’s it, instead of this…
s = ""
for x in items:
s += x
… it’s usually better to write something like this:
s = "".join(items)
But why? Let’s dig a bit inside Python’s source code to unveil the mystery.
Inside the source code
The Python implementation I’m referring to is CPython, the reference Python interpreter developed by Guido & co., written mainly in C, and hosted on a public Mercurial repository. I will specifically target the version 3.3.0: you can browse its source code online.
We need to look at the implementation of Python’s Unicode strings. The unicodectype.c file: bingo!
The function that implements the strings’ concatenation with the plus operator is at row 10619:
PyObject *
PyUnicode_Concat(PyObject *left, PyObject *right)
Inside the function, after skipping a lot of (important!) checks and setup code, at row 10653, there is the core of Python’s string concatenation:
maxchar = PyUnicode_MAX_CHAR_VALUE(u);
maxchar2 = PyUnicode_MAX_CHAR_VALUE(v);
maxchar = MAX_MAXCHAR(maxchar, maxchar2);
/* Concat the two Unicode strings */
w = PyUnicode_New(new_len, maxchar);
That’s it: every time you concatenate two strings with the plus operator, the Python’s interpreter allocates some memory and copies the two strings into it, one after the other. If you concatenate a lot of big strings the interpreter must allocate memory a lot of times, wrestling around with fragmentation and such. This behaviour should not surprise you: Python’s string objects are immutable, so each concatenation generates a new string instead of modifying the existing one in place. It’s also no surprise that multiple string concatenations are considered slow.
What about the join method instead? Join is a strings’ method, so we can find it in this very same file, at row 9478:
PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
The function iterates twice over the sequence of strings. The first iteration computes the total length of the joined string (row 9568):
for (i = 0; i < seqlen; i++)
[..
sz += PyUnicode_GET_LENGTH(item); // At row 9580
Then, at row 9597, the function allocates the space for the joined string:
res = PyUnicode_New(sz, maxchar);
Finally there is a second iteration that copies every string in the sequence inside the newly allocated string (see at row 9612):
for (i = 0, res_offset = 0; i < seqlen; ++i) {
[...]
Py_MEMCPY(res_data,
sep_data,
kind * seplen); // At **row 9618**
Conclusion
When using the join method, Python allocates memory for the joined string only once; if you concatenate strings in succession, instead, Python has to allocate new memory for each concatenation. Guess what’s faster? 😉