Introducing EFTI - Full Text Indexing in Erlang
===============================================
I've just posted the first version of EFTI [1] (Full Text Indexing in
Erlang [2]) project on GitHub [3]. Its a first stab/proof of concept
project, so don't laugh too hard if you look at it. As per suggestion of Chris
Anderson [4] I'm sitting down to write a hopefully explanatory post on the
design decisions etc.
Indexing
--------
The indexing is based on an inverted index [5] using CouchDB's [6] b-tree [7]
implementation. I chose using the b-tree code instead of Mnesia [8] due to the
fact that I'm aiming to integrate this into CouchDB as a plugin indexer (The
actual plugin framework doesn't exist yet, but Damien Katz [10] is working on
the necessary refactoring to make this a breeze).
Index Steps
-----------
1. Tokenize the input document
2. Apply processor list to each token in the document
3. Store each processed token with the list of positions it occurs at in the
inverted index.
The tokenizer and list of processors applied to the input document are both
configurable. Currently I have two tokenizers that support either splitting
tokens on any whitespace element or splitting tokens on any character not
provided in a custom alphabet.
The processors that I've been using include a length threshold, ignore words
list, and Porter stemming [11]. I borrowed the Erlang Porter stemmer
implementation from Alden Dima [12] which can be found at [13].
An example index configuration [14] is available.
Querying
--------
Querying is a bit more complex. First, an example configuration so we have
some idea what we're talking about:
{config, reader, nil}.
{runner, plain, nil}.
{tokenizer, whitespace, nil}.
{processor, word_length, 3}.
{processor, stemmer, {en_us, nil}}.
{matcher, trigram, {0.5, 1.0}}.
{rank, hit_count, nil}.
Query Steps
-----------
1. Parse the input query into clauses
2. For each clause:
1. Tokenize
2. Apply list of processors
3. Match processed tokens to database
3. Combine matches to the database
4. Rank results
Referencing our example config above we can pretty much see how these things
might be configured. The only thing that's a bit tricky is the fact that the
runner is responsible for both the clause and the combine steps. This is to
eventually support boolean logic in queries. The code that split the query
should be associated with the code that combines the results etc.
Notes/Caveats
-------------
Obviously there's a lot of work left. Adding support for more powerful
processors, tokenizers, runners, matchers, and rankers. As well as optimizing
the common internals.
Also, I need to work on adding things in like OTP gen_server stuffs. Right now
its serial execution. Yes. Serial in Erlang. I don't have much experience with
the OTP stuff and what not, but that's on the agenda after a few more tweaks.
Also, the build system blows. Deal with it. I'll get around to making actual
driver programs before too long.
Request for Comment (Plea for Help)
-----------------------------------
If you look at efti_query.erl [15] you'll notice that it would be
prohibitively expensive for a large dataset. As it is now all results are stored
into a dict keyed by {document id, query_word, index_word} with position
information as a value. This will eventually blow up trying to store an entire
result set in memory I imagine.
Thing is, I have no idea on how people stream and/or rank results without doing
something similar. Obviously, I could just hold the document id's in memory and
go back to disk for position information if/when needed, but that just delays
the inevitable of getting to the point where too many document id's are stored
in memory.
If anyone out there has any idea on how to design this part of the system to Not
Suck ™ I would be much obliged. Keep in mind that ideally we would be able
to stream and seek through the result set to make things down the road more
efficient. (Think index set math in CouchDB)
References
----------
[1]: http://github.com/davisp/efti
[2]: http://erlang.org
[3]: http://github.com
[4]: http://jchris.mfdz.com/
[5]: http://en.wikipedia.org/wiki/Inverted_index
[6]: http://incubator.apache.org/couchdb/
[7]: http://en.wikipedia.org/wiki/B-tree
[8]: http://www.erlang.org/doc/apps/mnesia/index.html
[9]: http://damienkatz.net/
[10]: http://en.wikipedia.org/wiki/Stemming
[11]: mailto:alden.dima@nist.gov
[12]: http://tartarus.org/~martin/PorterStemmer/
[13]: http://github.com/davisp/efti/tree/master/index.conf
[14]: http://github.com/davisp/efti/tree/master/src/efti/efti_query.erl
Copyright Notice
----------------
Copyright 2008-2010 Paul Joseph Davis
License
-------
http://creativecommons.org/licenses/by/3.0/