CouchDB Expectations
====================
Just got finished reading this post [1] by Jonathan Ellis that brought up
some interesting ideas about what CouchDB is and more specifically what it
isn't. I started to write a comment trying to explain some of my disagreement
and realized I was writing an entire post. So instead of spamming his comments,
I figure I'll just post my thoughts in my own little corner.
I distribute, therefore I am... distributed?
--------------------------------------------
Jonathon's main beef appears to be that CouchDB considers itself a distributed
database and he disagrees with that billing. So before I get to far ahead of
myself I thought I'd pull up Wikipedia's [2] definition:
A distributed database is a database that is under the control of a central
database management system (DBMS) in which storage devices are not all
attached to a common CPU. It may be stored in multiple computers located in
the same physical location, or may be dispersed over a network of
interconnected computers.
As in all things life, the true meaning of that definition depends on how you
parse it. As a proponent of the "CouchDB is distributed" idea, I would say that
we meet the definition in full. Someone arguing against CouchDB being
distributed would refer to the line that comments on the "under the control of a
central database management system". Or to put it another way, with no apparent
coordination of a set of CouchDB nodes, is it really a distributed database?
The Interwebs?
--------------
So, to answer whether CouchDB is a distributed database, I'll merely ask, "Is
the web a distributed system?" Because really, the answers are one in the same.
Some would argue that the web is merely the collection of emergent properties of
the underlying systems. I would argue that the web, while having no central
coordinating authority, is a distributed system. So, in reality, its merely,
"You say tomato, I say tomato."
Less Hand Waving
----------------
While given the whole theoretical hand wavy arguments, I think I understand
Jonathon's concern. CouchDB does not provide users with a method of
automatically spreading load amongst a set of physically distinct nodes. No
automatic document sharding or re-balancing as nodes enter and leave the system.
Yet. This sort of work is planned, its been discussed, general methods and
algorithms have been proposed on IRC and the mailing lists. The thing is, such
features haven't hit the top of the priority queue in terms of their
cost/benefit ratio. I think one of Damien Katz's less appreciated traits is that
he appears to be extremely focused on developing features in order of the most
benefit to the community, instead of in the order of neato-ness. That or we have
quite different definitions of neato.
CouchDB != RDBMS
----------------
Lots of people seem to confuse CouchDB as a replacement for an RDBMS. They may
try to convince you otherwise by saying things like, "Now, I know CouchDB isn't
trying to replace the RDBMS's out there, but..." and then launch into a laundry
list of things CouchDB doesn't do. I don't think its a conscious decision at
all. I spent my first three or four months with CouchDB trying to figure out how
to bend it to my preconceived notions. It turns out I was bending the wrong
thing. It was how I thought of CouchDB that needed to change.
I have spent a fair amount of time with PostgreSQL. Its awesome. I very much
dislike MySQL. Its not awesome. The reasons I like PostgreSQL are all the
reasons that I imagine Jonathon is alluding to when he says that you should ask
your favorite non-MySQL DBA why those fancy features exist. The thing is, he's
also disregarding that a huge part of the market using RDBMS's aren't using
these features. He also doesn't mention the fact that all the talk about
denormalizing to improve scalability are spitting in the face of these features.
The most important part of this entire post is the following statement: The only
time when CouchDB should be considered as a replacement for an RDBMS is when an
RDBMS was the wrong choice in the first place.
Just as CouchDB is not always the right tool for the job, RDBMS's are also not
always the right answer. Now, to be clear, my financial and medical institutions
better damn well be using some sort of RDBMS that has all of those fancy
features. My blog on the other hand (if it weren't static) does not require
materialized views or pivot tables.
Other Responses
---------------
* Yes, writes to disk are serialized for a given DB on a given CouchDB node.
Though its embarrassingly parallel once sharding is implemented.
* CouchDB databases require compaction to reclaim unused space. Similar in
spirit to PostgreSQL's vaccuum. This isn't at all a consequence of MVCC
though. It has to do with CouchDB's append only file structure. The append
only file operations make things like reader snapshots trivial in
implementation. Could compaction be better? Certainly! And it will be
eventually. Patches welcome :D
* The argument that Map/Reduce is hard could be valid. But I don't think it's
any harder than SQL. Doing complicated things is generally hard regardless
of the paradigm/language.
Updatifying
-----------
I wrote this pretty quick so it's probably got errors and what not. I'll be
re-reading it and updating over the next day or so.
[1]: http://spyced.blogspot.com/2008/12/couchdb-not-drinking-kool-aid.html
[2]: http://en.wikipedia.org/wiki/Distributed_database
Copyright Notice
----------------
Copyright 2008-2010 Paul Joseph Davis
License
-------
http://creativecommons.org/licenses/by/3.0/