Paul Joseph Davis

CouchDB Lucene Indexing
=======================

    *NOTE* I rewrote couchdb-lucene [1] pretty thoroughly last night. I decided
    that it was quick enough that instead of spamming my own blog I'll just spam
    your feed reader.

I updated couchdb-lucene [1] today to work with trunk. I changed the behavior of
the indexer to index views generated as per normal CouchDB [2] semantics with a
few minor constraints.

Indexing Strategy
-----------------

The basics now revolve around a `_design/lucene` document in your database. Any
view defined in this document will be indexed by Lucene [3]. In order to
be indexed appropriately you should make sure that all of the `emit(key, value)`
calls specify `doc._id` as the key.

In the future I plan on adding more configuration options to this document so
that indexing can be controlled from Futon etc. At the moment the interaction is
limited to just specifying views to be indexed.

The reason for changing semantics from any specified views in any design
document to all views in a single document are two fold. First it makes the
index reset semantics a lot easier to think about. Second, it allows us to
extend the awesomeness of Lucene querying a crap load further.

For instance, if your `_design/lucene` document specifies two views `foo` and
`bar` you can use the standard Lucene syntax like `foo:plankton AND bar:goat` to
get back the intersection of those two views. In the future I can see adding
support for numbers to support numeric range queries. Technically, if you emit
string sortable dates, you can already do date ranges.

Caveats
-------

At the moment, Lucene indexes are reset every time the `_design/lucene` revision
changes. Obviously this is sub-par and will change eventually, but I didn't have
the brain power to consider all that was necessary to track when coding that
bit.

Querying
--------

After the rewrite, querying should be a crapload more efficient. I'm caching all
of the Lucene objects in an LRU cache so things should keep pretty quick.

For Lucene People
-----------------

Right now I'm not using any of the extra fancy features in Lucene and the few
conversations I've had about Lucene internals make me realize that I'm probably
doing some fairly nasty things. If you know Lucene and have some time please
take a look at `org.apache.couchdb.lucene.Index` and send me any comments about
things I should be doing to make stuff suck less.

Java People
-----------

My Java is probably less than awesome. Any of you out there that feels like
helping out with this project, please for the love of god start sending me pull
requests on github. That is all.

References
----------

[1]:  http://github.com/davisp/couchdb-lucene
[2]:  http://couchdb.apache.org/
[3]:  http://lucene.apache.org/

Copyright Notice
----------------

Copyright 2008-2010 Paul Joseph Davis

License
-------

http://creativecommons.org/licenses/by/3.0/