Time is son of a bitch. More you think, more you realize, time is a constraint. Ever so true for search engines. Time is used to restrict query bounds. It is used often and frequently the way time is stored in indices is botched up.

Frequently used way of storing date and time
Date: 12-03-2007
Time: 12:40:10
Its great from viewing point of view, but from search engine perspective its plain old stupid. Search engine would need to do a full identifier match through out the index to find a particular date and time. Lets assume a case of three dates.

  • 12-04-2007 22:00
  • 12-03-2006 10:00
  • 12-03-2007 22:00

Now if search query is looking for 12-03-2007 22:00 it will talk through all the fields to reach last row. Something on lines of:

  • 12-04-2007 22:00 not a match
  • 12-03-2006 10:00not a match
  • 12-03-2007 22:00 a match

Search engine walked about 33 characters on index to reach a conclusion that third row is a match.

Magic of morphological ordering
By changing the date and time a little to something like YYYYMMDDHHMMSS we can get a fair bit of speed advantage. So above date and time would look like:

  • 200704122200
  • 200603121000
  • 200703122200

Looking at number of operations for same query

  • 200704122200 not a match
  • 200603121000 not a match
  • 200703122200 a match

Search engine walked about 24 characters on index to reach a conclusion that third row is a match. If you notice, in case of second row it took 4 characters for search engine to conclude a mismatch.

Range Query
Range query is a search query with constraint value bounds. Lets assume we need something between 12-03-2007 to 12-04-2007. With morphologically ordered date/time we convert the values in the index into integers and calculate if a row is between 20070312000000 and 20070412240000. This operation is by many orders simpler than doing a string match.

A design problem many sites deploying search engine would face, using SingleSearcher vs. MultiSearcher. Lucene gives access to search capability using a Searcher class. Searcher class accepts a query and returns list of Hits sorted by default by relevance. Searcher is an abstract class with possibility of wrangling up customized concrete Searcher. Two already available Searcher classes are IndexSearcher which loads an lucene index from disk and MultiSearcher which loads a list of lucene indices. MultiSearcher does an additional step of running merge sort after indices return the results.

Why the question of IndexSearcher Vs. MultiSearcher
While pondering in a meeting room with nothing but an empty drawing board, it wouldn’t take much time for a design team to come to the conclusion that certain search criterion would be used more than other. Now simple thing would be to make a small manageable indices for that specific criterion and a separate index for general search.

Why not to take this decision on outset

  • Lucene in default configuration is fast enough for most search requirements. Don’t use it as a premature optimization
  • It is not good option for distibuting indices over many disks. Its easier to put disks in RAID 0 configuration
  • Its simpler to maintain single index configuration
  • It involves extra cost of running a merge sort

Some situations it makes sense to distribute indices because the frequency on particular search criterion is too skewed. Still in that case using many indices with load balancer would be better. MultiSearcher does fulfills certain niche, its a premature optimization for most.

Most of you are probably familiar with 80/20 rule. The rule states that 80% of results come from 20% of causes. In job search this rule is even more extreme. A great search engine can quickly becomes addictive for a head-hunter.

A smashing search engine for the portal can help grow the site so rapidly, so its important to do everything to make search, from good enough to great. If you are starting out, you will need to do more to make an impact.

What makes a good job search engine
Jobs search comes in all shapes and sizes but they share important qualities.

  • Simple The search engine needs to be simple to use. Complex forms are disturbing. The level of complexity could be viewed if required. Instead of bringing up 40 inputs in one go, a logical set of related fields could be made hidden or visible according to user input.
  • Fast The search data can become large, yet being able to sail through it to provide the relevant. Faster search allows user to run more searches and refine search better.
  • Saved Search Being able to define a query and run it frequently is a great option. Many individuals look for same kind of profile over and over, looking up most relevant resumes.
  • Sub-query Being able refine query and search through set made through previous search. An individual for example searched for Java and from the result set of that query find person who also happens to be well versed with C++.

Using Lucene
Lucene is open source search engine backend library. Lucene could be used for indexing GBs of data.

Lucene Indexes
Lucene stores data in a search index. Lucene is index is very similar to ‘Index’ section of a book. Lets assume 4 documents containing various set of words.

Normal index
Doc1 – Software Engineer, Java, C++
Doc2 – Sales, Tele-Sales
Doc3 – HR, Headhunting
Doc4 – Sales, Manager

Inverted Index
C++ – Doc1
Headhunting – Doc3
HR – Doc3
Java – Doc1
Manager – Doc4
Sales – Doc2, Doc4
Software Engineer – Doc1
Tele-Sales – Doc2

Lucene uses inverted index which as you can see is easy to lookup for a word ‘Tele’. We can quickly work out Doc2 contains it. In normal index all documents would be needed to be read to get to same conclusion. Lucene indexes are FAST

Storing data in indexes
While fast, indexes can be bogged down in case, those are not used correctly. Lucene indexes gives five options for field type to store the search data

  • String field type is used for keyword identifiers. Most pertinent usage is for proper nouns which independently identify a context. Someones name, location, job profile.
  • Numeric field is bunch of field types. One could store them as text version of number. But best option is to convert into string numeric type. Doing this means, lucene changes the number into morphologically ordered text making querying fast.
  • Date field should be stored with DateField class, which converts date/time into YYYYMMDDHHMMSS form which speeds up morphological search and range queries.
  • SortField field is a tricky business. A good example of SortField is to use it when search requires sorting other than relevance based like date of resume posted.
  • Text field is where heart and soul of lucene rests. Text fields are just large unstructured text which could be analyzed using various analysis sequences in lucene and indexed. This allows you to run full text query of these fields. What is of vital importance is to find analysis sequence which best suits your domain. If minimal analysis is used the index can become large and irrelevant, if its made to be too aggressive, it can leave blind spots on important search terms.

You can also set flags on fields which tells lucene how to treat the field.

  • Stored should be set to True in case a field needs to be displayed.
  • Indexed should be set to True in case a field needs to be search-able.
  • Tokenized should be set to True in case a field needs to go through analysis process before indexing
  • Compressed should be set to True if the field need to be compressed on disk. Lucene can search through compressed fields

Although it does not fulfill all the areas but Lucene provides a great starting point for a smashingly great search engine component for job search.