CouchDB Joins
=============
There has been a fairly active thread [1] on the couchdb-user about how to join
couchdb documents. The authoritative documentation on such things so far has
been a blog post [2] by Christopher Lenz [3]. Its a very good introduction into
the power of collation. Unfortunately a couple issues keep cropping up on the
list generally related to avoiding denormalization in some form or another.
After thinking about the most recent thread I think I've managed to sit down and
create a couple views that would fulfill the offered requirements. For
reference, a short description of the setup:
The Documents
-------------
1. User documents: A unique non-changing document_id with a cusotmizable username.
2. Comment documents: Contain a reference to the user document id.
Goals
-----
1. Get a list of users that have comments
2. Get a list of users with 5 comments (sorted on some criteria).
General Outline
---------------
Briefly, the idea here is to generate a Map/Reduce view that will list the users
with comments and the number of comments per user. This is pretty easy but
necessary to allow us to get the list of users with their top five comments
while ignoring users with no comments.
There are two Map/Reduce views we'll be using, "with-comments" and
"max-comments". "with-comments" will be responsible for listing the users with
the number of their comments. Users with zero comments will not appear in this
view. By paging through this view will allow us to walk through the
"max-comments" view getting the data we need to display a listing of users with
their top N comments. (Sorry, N not configurable at query time.)
I'm using couchview from couchrest [4] to manage my views. Hence the
'-map' and '-reduce' suffices on each view name. Both Map/Reduce views are
assumed to be placed in a "users" design doc.
Generating data
---------------
First things first, here's a script to load some data into a db named "test" on
the local host:
#! /usr/bin/env python
import random
import uuid
import couchdb
server = couchdb.Server('http://localhost:5984')
if 'test' not in server:
server.create('test')
db = server['test']
def new_uuid():
return uuid.uuid4().hex.upper()
users = 1000
docs = []
for i in range(users):
uid = new_uuid()
docs.append({'_id': uid, 'type': 'user', 'name': 'user-%06d' % i})
for c in range(random.randint(0,25)):
cid = new_uuid()
docs.append({'_id': cid, 'type': 'comment', 'user_id': uid,
'position': c, 'text': "Comment %d %d" % (i, c)})
db.update(docs)
So, here we load 1000 users and a random number of comments (0 to 25). Its
pretty straight forward. You'll notice it requires the couchdb-python [5]
module.
Use the group_level, Luke
-------------------------
Remember, the default behavior of CouchDB's reduce is to produce a single value.
Reduces work by creating a 'shadow' tree on top of the view b+tree. A good
description (with pretty pictures) is this article [6] by Ricky Ho
from Adobe. So in order to produce a reduce result that has multiple rows we
must include a group=true or group\_level=N query parameter.
The group=true parameter means that for every unique key, we will get a single
reduce value. If we have a view that has keys that are JSON arrays, we can use
group_level=N to reduce arrays that have N identical prefix elements. Ie:
for i in range(N):
key1[i] == key2[i]
Since each of the reduces below is going to want multiple rows from the reduce
functions each will use a group_level=1 to group keys for reduction.
Users with Number of Comments
-----------------------------
This view is straight forward. Any comments are emitted, and then reduced to a
sum. Notice users with zero comments will not show up in the reduce because only
rows for each comment are emit()'ed.
with-comments-map.js
--------------------
function(doc)
{
if(doc.type == "comment")
{
emit([doc.user_id, doc._id], 1);
}
}
with-comments-reduce.js
-----------------------
function(keys, values)
{
return sum(values) ;
}
Users with Top N Comments
-------------------------
This is where things get a bit complicated.
The Map portion is fairly straight forward. For each comment we emit a user id,
-1 pair. The -1 is to indicate that this is a user document. Thinking about it
now, this is an artifact of an earlier iteration, but any value that is outside
the range of comment id's would be acceptable. (Assuming you change the reduce
appropriately)
The reduce function is a bit of wild one. The basic idea is that we're going to
take a set of values and condense them into an object that lists the user_id and
top N comments. We have to be careful on re-reduce steps that we can merge these
summary objects correctly. A more detailed discussion follows the code.
max-comments-map.js:
function(doc)
{
if(doc.type == "user")
{
emit([doc._id, -1], doc.name);
}
else if(doc.type == "comment")
{
emit([doc.user_id, doc._id], [doc.position, doc.text]);
}
}
max-comments-reduce.js:
function(keys, values, rereduce)
{
var MAX = 5 ;
var obj = null ;
function sort_comments(a, b)
{
return a[0] - b[0] ;
}
function add_comment(current, to_add)
{
current[current.length] = to_add ;
current.sort(sort_comments) ;
log("Current: " + current.toSource()) ;
return current.slice(0,MAX) ;
}
try
{
if(rereduce)
{
for(var i = 0 ; i < values.length ; i++)
{
if(typeof(values[i]) == 'object' && values[i].type == 'reduction')
{
if(obj == null)
{
obj = values[i] ;
}
else
{
if(obj.user_id == null)
{
obj.user_id = values[i].user_id ;
}
for(var j = 0 ; j < values[i].comments.length ; j++)
{
obj.comments = add_comment(obj.comments, values[i].comments[j]) ;
}
}
}
}
}
if(obj == null)
{
obj = {"comments": [], 'type': 'reduction', 'user_id': null} ;
}
if(keys == null)
{
return obj ;
}
for(var i = 0 ; i < keys.length ; i++)
{
if(typeof(values[i]) == 'object' && values[i].type == 'reduction')
{
continue ;
}
if(keys[i][0][1] == -1)
{
obj.user_id = values[i] ;
}
else
{
obj.comments = add_comment(obj.comments, values[i]) ;
}
}
}
catch(e)
{
log("Error: " + obj + " Rereduce: " + rereduce) ;
log(e.stack) ;
}
return obj ;
}
First we define a couple functions for merging lists of comments. This is a
fairly constrained function. The returned comments array must not grow quickly.
And quickly includes linear relative to the number of documents. Ie, if you just
append every comment to this array it most likely will not work. By keeping just
a maximum number of rows we're alleviating that concern.
The second major section (lines 19-45) deals with when we're running a rereduce.
Rereduce is important because we may be consuming the output of this function.
So here we go through and merge all reduction objects we returned earlier. The
biggest gotchya in this code is that user_id only exists on one reduction object
initially (because we only emit()'ed it once). So we need to be sure we move it
up as things are rereduced.
In the third section (lines 57-73) we are dealing with the original Map output.
Ignoring a couple checks we're basically testing if the row was emit()'ed, then
depending on the result adding the user_id or merging a new comment.
Hopefully that's all clear.
Using the Views
---------------
So the general idea here is to page through the with-comments reduce view and
use the first and last user_id to get the data we need to page through the
max-comments view.
Paging through the with-comments view as per normal suggestion, we would start
like such:
curl 'http://localhost:5984/test/_view/users/with-comments-reduce?group_level=1&count=10'
Using the first and last row from this query, we have enough information to get
the appropriate range in our max-comments view using the `startkey` and `endkey`
query parameters with a GET request like:
curl 'http://localhost:5984/test/_view/users/max-comments-reduce?group_level=1&startkey=`blah`&endkey=`foo`'
Keeping in mind that we would have to filter any returned results that had zero
comments.
Feedback
--------
Hopefully this works for people other than me. If you try it out and have
questions or comments feel free to email me [7].
References
----------
[1]: http://mail-archives.apache.org/mod_mbox/incubator-couchdb-user/200810.mbox/%3C2098F155-ECB7-468E-8CA7-8E54F18EE606@groovie.org%3E
[2]: http://www.cmlenz.net/archives/2007/10/couchdb-joins
[3]: http://www.cmlenz.net/
[4]: http://github.com/jchris/couchrest/tree/master
[5]: http://code.google.com/p/couchdb-python/
[6]: http://horicky.blogspot.com/2008/10/couchdb-implementation.html
[7]: mailto:paul.joseph.davis@gmail.com
Copyright Notice
----------------
Copyright 2008-2010 Paul Joseph Davis
License
-------
http://creativecommons.org/licenses/by/3.0/