Fast deserialization in Python

March 22nd, 2009 | Tags: , ,

All standard YMMV disclaimers apply.

Update (20090324-2): According to John Millikin, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin’s comment.

Summary:

For quickly deserializing data in Python, use cjson.
simplejson is mysteriously slow on certain installations.

Update (20090324): According to Extra Cheese, cjson 1.0.5 has an incompatibility with simplejson in processing slashes. A fix is available from Matt Billenstein. However, Dan Pascu, the author of cjson, deprecates Matt Billenstein’s cjson 1.0.6 because Matt’s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alternatives in certain circumstances. You will not find Matt’s cjson on the cheeseshop, only on Matt’s site.

Abstract:

We were initially using simplejson for our work, because the JSON format is human-readable and because anecdotal evidence from the blogosphere touted simplejson’s new C speedups. We observed that simplejson was actually quite slow on one of our installation environments. This observation prompted to do this study. We found the cjson consistently achieves the fastest deserialization performance. We still do not understand why simplejson is slow in certain installation environments.

Approach:

We compared the following serialization approaches:

  • simplejson 2.0.9, with C speedups
  • jsonlib 1.3.10
  • cjson 1.0.5
  • PyYAML 3.05 with libyaml 0.1.1/0.1.2 C bindings. (We used 0.1.1 on dormeur and 0.1.2 on mammouth.)
  • PySyck 0.61.2 with syck 0.55 C bindings. Note that PySyck did not compile until we followed the advice in this ticket.
  • Google protobuf 2.0.3
  • Python pickle, protocol=-1 (binary)
  • Python pickle, protocol=0 (text)

We have not tried the following serialization approaches:

  • Python marshall, which is supposedly much faster than Python pickle. On the downside, the marshal format may change between Python versions.
  • Native Python, i.e. reading the repr() of the data as a module
  • XML implementations
  • Facebook thrift
  • Hand-coding C serialization

Experiments:

Data:

We were working with a data structure we call the “vocabulary”. The vocabulary is a list of vocabulary terms. Each vocabulary term in turn contained a list of term forms. An example vocabulary term is as follows:

{
    "term class": "the propos delet",
    "canonical form": "the proposed deletion",
    "rank": 3590,
    "count": 7180.0,
    "term forms": [
        { "form": "the proposed deletion", "count": 7153.333333333333 },
        { "form": "the proposed deletions", "count": 13.666666666666666 },
        { "form": "The proposed deletion", "count": 12.0 },
        { "form": "the proposed deletes", "count": 1.0 }
    ]
}

We perform all our deserialization experiments on a vocabulary file that contained 502K fields, as computed using:

zcat vocabulary.json.gz | grep ':' | wc -l

We use gzip on all serialized files, both when writing them and when reading them. The size of the vocabulary in different serialization formats was as follows:

Format gzip’ed size
protobuf 1.7 MB
JSON 1.9 MB
pickle (protocol -1) 4.0 MB
pickle (protocol 0) 4.3 MB

gzip’ed JSON only use 10% more disk space than gzip’ed protobuf format, which is the most compact serialization format we tested. JSON has the advantage of being human-readable, unlike protocol buffer.

Setup:

We tested on two different eight core x86-64 Linux installation environments.

Name Python version CPU model name OS version
dormeur 2.5 Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 2.6.23.17-88.fc7
mammouth 2.6.1 Intel(R) Xeon(R) CPU E5462 @ 2.80GHz 2.6.18-92.1.10.el5_lustre.1.6.6smp

Results:

We read in the vocabulary using a particular deserialization approach. We measure real time, as well as the combined user time and system time, using the Unix ‘time’ command. For each experiment, we ran the deserialization of the vocabulary three times, and averaged the times over these three runs. Variance appeared to be low, but we did not compute it. We present all times in seconds. Some experiments were not performed on mammouth.

The first result line in the table, ‘read’, is when we read the vocabulary json.gz file into memory, but do not deserialize it. It provides an upper-bound on the performance of the deserializer.

The following table presents the results, sorted by real time on dormeur.

deserializer dormeur mammouth
real user+sys real user+sys
read 0.76 0.24 0.18 0.18
cjson 2.17 1.04 0.93 0.91
jsonlib 7.88 6.59 3.77 3.77
cPickle (protocol -1) 13.3 9.9 10.2 10.2
PySyck 19.1 18.2
simplejson 24.7 16.2 1.10 1.04
cPickle (protocol 0) 25.1 20.4 20.7 20.7
protobuf 42.3 32.4
PyYAML 89.3 80.5 319 318

Observe that simplejson is more than an order of magnitude slower on dormeur.

Conclusions:

gzip’ed JSON only use 10% more disk space than the most compact serialization format we tested (gzip’ed protocol buffer). JSON has the advantage of being human-readable, unlike protocol buffer.

cjson has the fastest deserialization time of all packages we tested. We have not measured serialization time in the experiments above, but we do so in the next section.

We did not realize that simplejson was far slower on one of our installs until we did speed tests. simplejson should be avoided unless you specifically determine that it is comparable in speed to cjson. On certain installs, simplejson deserialization is as fast as cjson. On other installs, simplejson deserialization is an order of magnitude slower than cjson. On “slow” installs, the user is led to believe that C speedups have been compiled into simplejson. Indeed, evidence indicates that our “slow” simplejson installation was, nonetheless, using C speedups:

>>> simplejson.decoder.make_scanner
<type 'simplejson._speedups.Scanner'>
>>> simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True

The user might not only detect that simplejson is slow without using a direct speed comparison to cjson.

protobuf is interesting because it requires one to declare the protocol schema. This is useful for documenting your data format. Unfortunately, the Python implementation of Google’s Protocol Buffers is very slow because it is pure Python.

Generating C++ Protocol Buffers and wrapping them with swig, as suggested by this commentator, might be faster than cjson. Hand-coding C serialization routines is another option if one must eke out every last bit of speed.

Related work:

This study and this followup provide supporting evidence that cjson is faster than alternatives. Neither of these studies experienced any simplejson slowness.

We used bouncybouncy’s sertest2 code code, and modified it to CDumper and CLoader (the C libyaml bindings) in PyYAML. We modified their code to create 100K records.

Here is the output of sertest2 running on dormeur, which we have modified slightly for improved readability:

100000 total records        (0.830s)

get_thrift                  (0.300s)
get_protobuf                (5.010s)

Serialize:
ser_cjson                   (0.270s) 6807019 bytes
ser_simplejson              (2.210s) 6807019 bytes
ser_yaml                    (31.590s) 6107019 bytes
ser_protobuf                (19.760s) 1716519 bytes

Serialize to a gzip'ed file:
ser_cjson_compressed        (0.520s) 1245257 bytes
ser_simplejson_compressed   (2.440s) 1245257 bytes
ser_protobuf_compressed     (19.920s) 980508 bytes
ser_yaml_compressed         (31.610s) 1205509 bytes

Deserialize:
serde_cjson                 (0.510s)
serde_simplejson            (12.370s)
serde_protobuf              (36.740s)
serde_yaml                  [slow, got tired of waiting for it]

bouncybouncy’s related study also compares with thrift, which we do not use. bouncybouncy finds that thrift is faster than protobuf but slower than cjson. When we installed thrift (SVN revision 757299) on dormeur, sertest2 thrift routines crashed with the following traceback:

Traceback (most recent call last):
  File "./test_speed.py", line 169, in <module>
    print 'serde_thrift        (%0.3fs)' % t(serde_thrift)[0]
  File "./test_speed.py", line 138, in t
    ret = f()
  File "./test_speed.py", line 108, in serde_thrift
    s = _ser_thrift()
  File "./test_speed.py", line 73, in _ser_thrift
    return thrift_to_bytes(ret)
  File "./test_speed.py", line 59, in thrift_to_bytes
    var.write(protocolOut)
  File "gen-py/passivedns/ttypes.py", line 146, in write
    iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'

The results presented in this section, as well as the results of the related studies, matches the relative performance of these libraries on mammouth in our earlier experiments.

  • Share/Bookmark
  1. March 22nd, 2009 at 22:06
    Reply | Quote | #1

    This reddit thread has some good discussion of an earlier study.

    This author points out that thrift as a network protocol is much faster than JSON over HTTP.

    haberman points out that he is writing C bindings for Python protobuf.

  2. March 22nd, 2009 at 23:10
    Reply | Quote | #2

    An older benchmark, showing that marshal might be the fastest.

  3. March 22nd, 2009 at 23:58
    Reply | Quote | #3

    According to Extra Cheese, cjson has an incompatibility with simplejson in processing slashes. A fix is available from vazor.

  4. March 23rd, 2009 at 04:06
    Reply | Quote | #4

    Check if the slower simplejson install does something with locales? I've seen grep go really slow when trying to do utf-8 stuff, which disappeared after setting LANG=C / LC_ALL=C…

  5. March 23rd, 2009 at 07:11
    Reply | Quote | #5

    Nice writeup :-) Good to see that you get the same results on a more complicated data structure.

    I still have high hopes for protobuf: it can get faster, but json can't get any smaller. At some point protobuf will be both the fastest and most compact method.

  6. March 23rd, 2009 at 11:50
    Reply | Quote | #6

    I am excited for a faster protobuf. In particular, haberman's C extensions look promising.

    Compactness is very important for transferring data over a network.
    However, during the development cycle, human readability is important and often overlooked. If all you need to do to read your data is type 'zcat', you are much more likely to be looking at your data, and hence more likely to catch bugs.

  7. John Millikin
    March 24th, 2009 at 08:59
    Reply | Quote | #7

    (reposting a comment from Hacker News, at Joseph Turian's request)

    I'm the author of jsonlib, and I registered specifically to post this message. Please, please, please do not use cjson!

    First, it is unmaintained. The latest version available was posted on August 24, 2007. When you encounter one of its myriad bugs, you'll either have to patch it yourself or pick another JSON library. Just skip the intermediate step and use another library to begin with.

    Second, it is buggy. In some cases, parsing text it just generated will return a different value from what you passed in! It's almost entirely ignorant of Unicode, and what little it tries to parse it gets wrong.

    Third, it's exceedingly non-compliant. The text it parses and generates bears only a passing resemblance to JSON. There are varying degrees of conformance to the spec between libraries, based on personal preference of the authors — I prefer strict conformance, others less strict — but cjson is so different as to be simply unusable.

    Yes, it's fast. I know. I wrote jsonlib partly because I was unsatisfied with simplejson's performance, and one goal (never truly achieved) was always to surpass cjson. However, speed isn't everything. As the saying goes, “if I want my math performed fast and wrong I'll ask my cat”.

    In my opinion, the only Python JSON libraries worth considering are:

    * simplejson — it's in the standard library, and should therefore be considered first and most thoroughly.

    * jsonlib — it's fast, well-tested, and standards-compliant.

    * demjson — has several options for reliable parsing of invalid input.

    Last time I checked, jsonlib and simplejson's C extensions are neck-and-neck performance-wise. In some quick, unscientific tests, jsonlib reads faster and simplejson writes faster. However, simplejson's extensions are only used for certain subsets of input — if you want to use an uncommon feature, performance will degrade. jsonlib has an implementation in pure C, which avoids this problem at the cost of complexity.

    Apologies for the brain-dump, but even if you skip right over it, please remember: don't use cjson.

  8. Nir
    November 10th, 2009 at 03:41
    Reply | Quote | #8

    Seems that Bob Ippolito fixed simplejson slowness.
    Retry with latest version.

MetaOptimize is Digg proof thanks to caching by WP Super Cache