Paul Joseph Davis

Fighting Hype with Hype - RDBMS FTW!!!1!
========================================

Yay Google Alerts
-----------------

I woke up this morning to an amusing blog post [1] by Ryan Park [2] in my Google
alerts. I tend to read alot of the "RDBMS's are awesome! No, NOSQL is moar
awesome!" with great bemusement. Even though it's a year and a half old it was
amusing enough to motivate me to write out some of the thoughts I had while
reading it.

1. Data integrity is not guaranteed.
------------------------------------

Ryan spends a couple paragraphs talking about how horrible it is that non-RDBMS
systems don't provide data constraints. In general he's pretty spot on here. One
of the first things that tends to get left out of non-RDBMS systems is
constraint enforcement.

There are two and a half points I'd like to make. First, constraints are usually
the first to go because they're costly. Costly to implement and costly at
runtime. Especially when the system is being designed with the ability to run on
multiple machines.

Secondly, there are plenty of people that don't use constraints. Ryan falls
pretty squarely into the "RDBMS's work for me, so they should work for you too"
camp. He appears to know his stuff but what people like Ryan forget is that
there are a lot of people that don't. They use an RDBMS because that's what the
internet says to do. And then they plop an ORM on top of it and never actually
use any of the RDBMS features that are so lauded. As it turns out, lots of these
developers are super happy using a database that doesn't provide constraint
enforcement.

The last half a point I'll make later as it was suggested in a comment on Ryan's
post and applies later on in the conversation.

2. Inconsistency will provide a terrible user experience.
---------------------------------------------------------

Even reading the bullet point on this one and I knew I was in for some fun. I
mean seriously, if that's not proof by assertion then I don't know what is.

Ryan is quite right that developers need to make some things appear consistent
so as to not confuse users. I can't speak to the specifics of SimpleDB, but
obviously someone's using it successfully so I'll assume that its possible.

The two things I'd point out though is that consistency is not limited to those
crazy non-RDBMS people. Even in a traditional three-tier web architecture,
there's the issue with sessions. Basically the issue is that a client needs to
be repeatedly routed to the same application server handling their session.

The other thing I'll point out is an interesting Facebook blog
post [3] I read a couple months ago. Its an interesting look at
how Facebook added a second datacenter on the east coast. A datacenter based on
MySQL no less. I'll draw your attention to the "Cache Consistency" section. And
I'll I'm going to point out is that their solution required modifying MySQL's
query parser. Seriously.

3. Aggregate operations will require more coding.
-------------------------------------------------

For the bullet point, yes, that's more or less true. The argument in support of
this is pretty much non-existent. If Ryan really wanted to make an argument
about aggregates, the best thing would be to go on about how a non-RDBMS
requires you to know what type of aggregates you'll want up front and then do
insert time calculations for these values. While that will work just fine, it
makes ad-hoc queries harder. The ad-hoc issue is the next bullet point, but for
some reason the connection wasn't made.

4. Complicated reports, and ad hoc queries, will require a lot more coding.
---------------------------------------------------------------------------

This was one of my favoqrite bullet points in the whole article. And by favorite
I mean that it produced the most WTF's per word.

Firstly, Ryan points out that there are three general work loads for databases.
(1) General queries that are used by the application, (2) More complicated
queries run by staff for reporting, (3) ad-hoc queries for debugging. I would
pretty much agree with him there. But then he goes on to make the assertion that
points 2 and 3 are better served by SQL.

The entire second paragraph is some sort of weird twisted logic to bolster the
argument that SQL makes reporting super easy. My favorite part of the whole
thing is the quote right in the middle:

    In my previous jobs, our reports often required hundreds of lines of SQL to
    get the right information out of the database. This is a lot of code, but it
    was required to generate the data for our customers.

As far as I can tell, the argument is that SQL makes complex reports easy even
though it still might take hundreds of lines to get the data required. And the
other thing that's not mentioned, these reports can still take a substantial
amount of time to generate. But obviously this is still better than the
non-RDBMS systems where they don't even have SQL! Because obviously its
impossible that an any imperative language could be as good as SQL... because...
because... well I never figured that part out either.

5. Aggregate operations will be much slower if you don't use an RDBMS.
----------------------------------------------------------------------

Yeah. This one is special.

    RDBMSes are highly optimized for performing aggregate operations across huge
    volumes of data. Fast algorithms like the hash join, merge join, and indexed
    binary search have been around for 20 years or more.

Ok. Breathe. I'm assuming that he forgot that joins are not aggregates. And a
binary search, well, shit... I guess the non-RDBMS people really are screwed.
And the second paragraph talks about how the client is going to need to scan the
entire database thus incurring the huge network transfer to even try and compute
an aggregate.

I'll just point out that there are non-RDBMS systems that provide aggregate
functionality and anything that uses a b+tree probably uses binary search.
Remember people, just because you can't think of a different solution to your
problem doesn't mean it can't exist.

6. Data import, export, and backup will be slow and difficult.
--------------------------------------------------------------

Now this is just FUD. Sorry, but there's no better way to say it. If you've ever
had to fit some random piece of data into your existing relational schema you'll
probably agree that this is crap. Munging random data is hard. And if its not
random data then its not really that important. And getting data out? Perhaps
Ryan was being satirical?

7. SimpleDB isn't that fast.
----------------------------

Jan Lehnardt has a couple thought provoking arguments [4] [5] and Volker Mische
[6] provides some interesting fodder as well. Basically, to say something isn't
fast requires you to define what fast is and, generally, no two people will ever
agree on the same definition.

That said, Ryan does make an allusion to this situation when he mentions that
SimpleDB probably needs a larger DB to be measured on. And he also points out
that lots of databases probably fit into RAM.

There's an interesting article [7] by one of the 37signals guys about buying
more RAM instead of sharding. While definitely a valid approach, not everyone
can go out and buy a single machine with 32 GiB of RAM (though obviously that's
getting closer). Though I now curiously wonder what type of disks they have to
keep up with the write load they might have.

8. Relational databases are scalable, even with massive data sets.
------------------------------------------------------------------

I don't have a better response than the commenter jackson [8] on the original
blog post. Once an RDBMS is scaled to multiple machines, lots of the benefits
are nullified and you're dealing with the same issues that the non-RDBMS folks
are.

9. Super-scalability is overrated. Slowing the pace of your product development is even worse.
----------------------------------------------------------------------------------------------

There is definitely a lot of noise in the echo chamber about scalability.
Developers like to talk about needing hundreds of nodes to support their work
load because that's just cool. But in reality, the issue isn't adding the
hundredth node to a system, its adding the second. Regardless of the database
being used, if that second node isn't planned for it'll be painful. Non-RDBMS
systems generally reduce that pain point by discouraging designs that exacerbate
the problems when adding a second node.

10. SimpleDB is useful, but only in certain contexts.
-----------------------------------------------------

I'll file this under the "No shit?" category. There are plenty of places that an
RDBMS might be a better fit than any given non-RDBMS. And vice versa. The
underlying issue that people seem to miss is being able to describe situations
where one might be better than the other.

The bottom line to this whole "My database is better than your database!"
argument is that "You're both right, so STFU!" Eventually people will calm down
and start to realize that there are multiple solutions and the right one will
depend as much on the problem domain as the developer coding the solution. A
better use of time would be finding personal projects and drawing up the
arguments for and against the coded solution so that others might learn from
past experience.

References
----------

[1]:  http://www.ryanpark.org/2008/04/top-10-avoid-the-simpledb-hype.html
[2]:  http://www.ryanpark.org/about-2
[3]:  http://www.facebook.com/note.php?note_id=23844338919
[4]:  http://jan.prima.de/plok/archives/175-Benchmarks-You-are-Doing-it-Wrong.html
[5]:  http://jan.prima.de/plok/archives/176-Caveats-of-Evaluating-Databases.html
[6]:  http://vmx.cx/cgi-bin/blog/index.cgi/benchmarking-is-not-easy:2009-09-23:en,CouchDB,Python,TileCache,geo
[7]:  http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on-sharding
[8]:  http://www.ryanpark.org/2008/04/top-10-avoid-the-simpledb-hype.html#comment-4308692


Copyright Notice
----------------

Copyright 2008-2010 Paul Joseph Davis

License
-------

http://creativecommons.org/licenses/by/3.0/