How does Lucene work


Translate

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.

A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.


All Answers
  • Translate

    Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.

    Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.

    However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.


  • Translate

    Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present. Thus Lucene is very very fast.

    For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html


  • Translate

    In a word: indexing.

    Lucene creates an index of your document that allows it to search much more quickly.

    It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.

    Update:

    I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."

    My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).

    Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.


  • Translate

    Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.

    Example :

    File 1 : Random Access Memory is the main memory.

    File 2 : Hard disks are secondary memory.

    Lucene creates a reverse index something like

    File 1 :

    Term : Random

    Frequency : 1

    Position : 0

    Term : Memory

    Frequency : 2

    Position : 3

    Position : 6

    So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,

    Main

    File 1 : Frequency - 1

    Memory

    File 1 : Frequency - 2

    File 2 : Frequency - 1

    The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).