CouchDB JSON Parser Timings
===========================
I've put together some work on integrating eep0018 [1] int CouchDB as
well as adding support for Spidermonkey 1.8.1 [2]. This is all still very
experimental. I have the test suite passing for both branches except where
Spidermonkey's JSON serialization differs from the JavaScript function
previously used ([undefined] is serialized as [null]).
So, after getting those branches together I spent a bit of time and ran some
tests to see what kind of speed differences I could get. Turns out it's
dependent on the amount of data we give it, but there is a noticeable impact.
Branches
--------
* Trunk [3] - As of when I ran the tests today.
* eep0018 [4] - Includes the JSON parser in Erlang
* Spidermonkey 1.8.1 - Includes eep0018 and Spidermonkey trunk as of today
(or yesterday...) Configured with --enable-optimized=-O3 is the only flag
touched. (Ie, no JIT enabled.)
Caveats
-------
Notice that I'm inserting 4KiB documents. If we shrunk those down to a couple
bytes then these numbers tend to even out. I know that the eep0018 code is
hampered by moving across the VM boundary so when we're dealing with \_bulk_docs
the key would be that eep0018 allows us to post more docs in a single request.
Other things people might want to play with are the numbers in
couch\_utils:should_flush/0 to see if we can tune how much data gets sent to the
view server in one go.
So, there's a lot of permutations for different speed tests, not to mention just
making sure that these branches aren't screwed beyond recognition in terms of
actually working.
If you're bored and looking for something to do instead of clicking through
Twitter or the current trendy social news site on a Monday morning, I invite you
to grab one or two or all of the branches and run your own tests.
Benchmark Script
----------------
I realize this isn't the most sound measurement system, but I'm tired and didn't
feel like being thorough. You can grab at [4].
#! /usr/bin/env python
import time
import couchdb
server = couchdb.Server("http://127.0.0.1:5984/")
if "eep0018" in server:
del server["eep0018"]
db = server.create("eep0018")
start = time.time()
updates = []
for docid in xrange(10000):
doc = {"_id": "%.10d" % docid, "integer": docid, "text": "a" * 4096}
updates.append(doc)
if len(updates) >= 1000:
db.update(updates)
updates = []
if len(updates): db.update(updates)
end = time.time()
print "Inserting: %f" % (end - start)
start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer % 100);}"):
pass
end = time.time()
print "Map only: %f" % (end - start)
start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer / 100);}",
reduce_fun="function(keys, vals) {return sum(vals);}"
):
pass
end = time.time()
print "With reduce: %f" % (end - start)
start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer * 2);}",
reduce_fun="_sum"
):
pass
end = time.time()
print "With erlang reduce: %f" % (end - start)
Results
-------
This is the data that I got from running that test script against each of the
three branches three times. The error bars are simple min/max notations. No
fancy standard deviation shit going on here [5].
Hand Waving
-----------
The results generally make sense. We get a speed bump during insertion when we
switch to the eep0018 branch. The views are faster too. When we add the
Spidermonkey 1.8.1 updates we get the same insert speed (because we don't touch
the view server) and faster view computation.
For the more motivated timing people out there, if someone wants to play around
with data sizes and look at timings for different scenarios that'd be pretty
awesome. And more fancy number math probably wouldn't hurt.
References
----------
[1]: http://github.com/davisp/couchdb/tree/eep0018
[2]: http://github.com/davisp/couchdb/tree/spidermonkey181
[3]: http://github.com/davisp/couchdb/tree/master
[4]: /res/2009-05-18-couchdb-timings.py
[5]: /res/2009-05-18-couchdb-timings.jpg
Copyright Notice
----------------
Copyright 2008-2010 Paul Joseph Davis
License
-------
http://creativecommons.org/licenses/by/3.0/