Reading through lucene wiki, I came across a nice list of things to try for improving indexing performance. I am listing some of the most striking ones from the page

  • Flush by RAM usage instead of document count.
    Call writer.ramSizeInBytes() after every added doc then call flush() when it’s using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don’t set it too large otherwise you may hit. Somewhere around 2-3X your “typical” flush count should be OK.
  • Turn off compound file format.
    Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.
  • Re-use Document and Field instances
    As of Lucene 2.3 (not yet released) there are new setValue(…) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost.

    It’s best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(…), etc), and then re-add your Document instance.

    Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field’s value until the Document containing that Field has been added to the index. See Field for details.

  • Re-use a single Token instance in your analyzer
    Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.
  • Use the char[] API in Token instead of the String API to represent token Text
    As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char array, which saves the GC cost of new’ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new’ing any objects for each term. See Token for details.
  • Shamelessly plugged from here

    Advertisements