Paul Joseph Davis


I really don't like Java. Not one bit. I spent entirely too long fighting with
it to get a Lucene [1] full text indexer going for CouchDB [2] running.
Somewhere during that work I stumbled across some Python [3] bindings for Hyper
Estraier [4] that looked to be pretty awesome. In about three hours I managed to
duplicate about forty hours of Java work (estimates adjusted for personal bias).
Now I present you with HyperCouch [5].

Yay Features!

Just to start you off with some indication of indication of what's currently
available in HyperCouch:

  1. Full text searching - Too Obvious?
  2. Skip/limit parameters for paging
  3. Indexing via custom JavaScript [6] methods - No wasted view
     generation here.
  4. Indexing supports full-text and property searching.
  5. Searching arbitrary document properties
  6. Custom sorting based on document properties
  7. HTML text snippets from the document based on the search
  8. Did I mention indexing via custom JavaScript methods?


The basic steps to get HyperCouch up and running are covered on the GitHub [6]
page. Hopefully those directions are sufficient enough for you to get the
necessary bits installed. I feel a bit bad about requiring so many projects to
get things working but hopefully the install process doesn't cause to many
issues. Feel more than free to email me [8] directly if you're having issues.

After it appears to be up and running the only thing you should need to do is
add an ft_index function to one or many of your _design/documents in CouchDB. An
ft_index function acts very much like a normal CouchDB view function in that it
takes a single document as input and produces some output. Unlike view functions
ft_index functions don't use an emit(key, value) function to communicate
results. Instead they use index(data) and property(name, value) functions to
specify data that should be indexed.

    index(data) calls should specify textual data that is intended for searching
    via full text queries like 'foo AND bar'.

    property(name, value) calls specify properties of the document that should
    be indexed for operations such as filtering and sorting of full text search

Design Document Foo

        "_id": "_design/my_awesome_doc",
        "_rev": "024244112",
        "ft_index" : "function(doc) {if(doc.body) index(doc.body); if(doc.baz) property("baz_prop", doc.baz)}"

This stuff is pretty straight forward. You can specify an ft_index on as many
_design/documents as you want. The functions are additive in terms of
index(data) and property(name, value) both attach data to the same document
object. There will probably be weirdness if your function doesn't compile in
JavaScript right now. Its on the agenda to add proper error reporting for.

Query String Parameters

Without further ado, the list of url parameters for querying HyperCouch are as

   1. q - The full text query (Can use AND OR etc. See the Hyper Estraier
      Search [9] documentation.
   2. matching - Specifies different HyperEstraier query processing types.
      (Default is most applicable)
   3. limit - Limit the number of returned documents.
   4. skip - Skip a number of documents in the result set.
   5. order - Specify ordering of results on arbitrary parameters. See the
      Hyper Estraier [9] docs.
   6. highlight - Receive HTML highlights from the documents returned.
      (Currently only supports hightlight=html)
   7. Arbitrary property operations. See the Hyper
      Estraier [9] attribute search conditons section.


I'm not sure exactly what all the applications of the different matching types
are, but I've included support for it. Mostly because I was curious what they
did and still can't really figure out the difference.

Supported matching types:

  * simple - Default
  * rought - Different NOT syntax (-term instead of !term) (I think)
  * union - Default OR
  * isect - Default AND

Limit - Skip

I ran into issues testing this bit based on ordering. There were a few oddities
with results being returned in different orders etc. I know that Hyper Estraier
uses a call to quicksort in the internals which isn't a stable sort so I guess
technically this could be part of the issue. Let it be said, I only test
limit/offset for a specified ordering.


You can specify an 'order=[STRA|STRD|NUMA|NUMD] prop_name' to receive results in
'string ascending', 'string descending', 'number ascending', or 'number
descending' order. From testing, I can say that it appears that Hyper Estraier
appears to be doing an internal conversion so you shouldn't have to worry about
proper typing, though you might get unexpected results if you number sort
something that can't be converted to a number.


If you specify 'highlight=html' in the query string each returned document will
contain a `highlight` member that is an HTML snippet of the indexed document. It
was easy to implement and isn't thoroughly tested, but it's there.

Arbitrary Properties

You can specify arbitrary property limiting using the operators specified in the
Hyper Estraier Search [9] docs. Each doc can have an arbitrary number of
properties associated with it. You can limit and combine any number of limits to
properties etc. For those of you reading ahead in the Hyper Estraier docs, the
proper format for the query string is to do a 'property_name=operator argument'.
Ie, if you called property("foo", doc.foo_value) in your ft_index method, you
can specify 'foo=NUMLT 3' in the URL to receive documents that only contain a
foo value less than three. There are approximately fifteen or so different
operators you can use for limiting both string and numeric properties.

Operator types for property matching taken from the Hyper Estraier docs:

    * STREQ - is equal to the string
    * STRNE - is not equal to the string
    * STRINC - includes the string
    * STRBW - begins with the string
    * STREW - ends with the string
    * STRAND - includes all tokens in the string
    * STROR - includes at least one token in the string
    * STROREQ - is equal to at least one token in the string
    * STRRX - matches regular expressions of the string
    * NUMEQ - is equal to the number or date
    * NUMNE - is not equal to the number or date
    * NUMGT - is greater than the number or date
    * NUMGE - is greater than or equal to the number or date
    * NUMLT - is less than the number or date
    * NUMLE - is less than or equal to the number or date
    * NUMBT - is between the two numbers or dates


Returned Structure

The returned data should look something like this:

        "total_rows": 2,
        "rows": [
            {"id": "doc_id", "prop1": "val1", "prop2": "val2"},
            {"id": "doc_id", "prop3": "val_schrodinger"}

The structure is probably going to have some refinements and there are a few
caveats in property names for indexing, but all in all it should be fairly easy
to figure out.

Remember that total_rows is the number of documents matching the query. At the
moment there is no way to get the total number of indexed documents for a given

That is All

Hopefully that's enough of a description to whet your appetite. I'll be adding
more features and better error messages as I go along. Hopefully I can trick a
few people into using it and sending me feed back to make it better. Like I said
feel free to email me [8] with questions or suggestions.



Copyright Notice

Copyright 2008-2010 Paul Joseph Davis