There are hundred different ways of distributing search traffic over a farm of search engines. From hard-coded configuration to multi-casted message-bus. If not hard to implement, they are hard to understand for someone starting out with search-engines. But not so with bonjour, which is P2P service announcement of sorts. Assuming bonjour server/client you are using is avahi, just run avahi-browse to find what services are running in current network. Offcourse, not great if farm spans more than one network. But its so easy. I am surprised, its so easy to announce and discover.

Its easy to think up something and keep stroking the thought to conclusion which looks aesthetically proper to you. Its entirely different game to actually convince someone else of your argument. But public speaking is like a barometer what you think and believe. These thoughts relate to the fact that I gave a talk at OSIW about search engines.

The talk was controversially called ‘Who is scared of google?’, with a sincere believe that search engines are going to evolve and stop where they are right now. Also that one could come up with specialized search engines using FOSS tools. The presentation available here

Verdict is in. People want ‘normal’ looking search engine. A search engine invokes a mental map which is getting re-enforced in our mind. Even separating out certain search results in a box entails a risk of users overlooking those links assuming it to be ads.

In his article Jacob Neilson warns against trying to change the search user interface. This argues that search engines should not try to distinguish themselves with fancy front ends.
Article available here

Grapeshot blitz
Grapeshot is a SDK providing advanced concept-based bayesian search methods for developers to insert “implicit search” capabilities inside application. In plain english, a promising search engine library for developers.

The technology section summarizes various aspects of the library which puts it apart from other similar projects. Some interesting features are:

  • Document clustering
  • Sentences or paragraphs can be used as queries
  • Word ranking

One feature that has been highlighted is its small footprint. Grapeshot claims to be 300K binary.
small footprint
The bar graph shows, what grapeshot claims to be sizes of binaries for various similar software libraries. The footprint of lucene specifically is of interest. Unlike claimed by the site 11+MB, lucene core jar file as of 2.2.0 version is about 526K only. Which could also be reduced depending on the users requirement.

Reducing binary footprint of lucene
Although 526K doesn’t seem like a large footprint. As an exercise, one can reduce it for embedded or mobile device like grapeshot claims. To reduce binary size:

  • Run the java application of interest with -verbose:class flag. This produces verbose output of class loading details on stdout
  • Run the output through
    cat * |grep lucene-core|cut -f2 -d' '|uniq|tr '.' '/'| awk '{printf "%s.class\n", $1}'
    command. This will filter out all the classes from lucene library loaded at runtime
  • Create a custom jar file by deleting all .class files which are not in the list.

Following this procedure for demo application bundled with lucene core binary, custom jar was reduced by half to 262k. Less than Grapeshot binary.

As side note this python script can be used to deleted files from extracted jar.

Jython 2.2 released!! Woohoo!!.

Jython is a great tool for introspection of lucene indices with full-fledged programming language backing.

Reading through lucene wiki, I came across a nice list of things to try for improving indexing performance. I am listing some of the most striking ones from the page

  • Flush by RAM usage instead of document count.
    Call writer.ramSizeInBytes() after every added doc then call flush() when it’s using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don’t set it too large otherwise you may hit. Somewhere around 2-3X your “typical” flush count should be OK.
  • Turn off compound file format.
    Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.
  • Re-use Document and Field instances
    As of Lucene 2.3 (not yet released) there are new setValue(…) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost.

    It’s best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(…), etc), and then re-add your Document instance.

    Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field’s value until the Document containing that Field has been added to the index. See Field for details.

  • Re-use a single Token instance in your analyzer
    Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.
  • Use the char[] API in Token instead of the String API to represent token Text
    As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char array, which saves the GC cost of new’ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new’ing any objects for each term. See Token for details.
  • Shamelessly plugged from here

A simple keyword search “looking for a job as a fashion designer for an import/export company” on big three job search engines in India gives interesting results:

  • Naukri which claims to be number one jobs site provides no results for this query.
  • Timesjobs which takes ions to provide the results, which are way off from the theme of the query.
  • Monster India barely provides decent results for the query.

Going into the reasons why this query results in abject failure from such premiere jobs sites requires bit of dis-integration of the query.

  • We have a well formed sentence with lots of what are called Stopwords. After query parsing phase ideally query should be left with job, fashion designer, import/export and company. These keywords are only relevant to the query. This is where TimesJobs fails.
  • Most search engines set equal priority field priority. Monsterindia brings itself apart by giving higher priority to title of the jobs.
  • Detecting domain and job type would be a great way of enhancing keyword search. None of the engines do that till now.
  • import/export has a special character ‘/’ which is not handled well by search engines.

A good way to get these thing sorted would be to pre-process queries with appropriate analyzer.