calendar.jpg
Time is son of a bitch. More you think, more you realize, time is a constraint. Ever so true for search engines. Time is used to restrict query bounds. It is used often and frequently the way time is stored in indices is botched up.

Frequently used way of storing date and time
Date: 12-03-2007
Time: 12:40:10
Its great from viewing point of view, but from search engine perspective its plain old stupid. Search engine would need to do a full identifier match through out the index to find a particular date and time. Lets assume a case of three dates.

  • 12-04-2007 22:00
  • 12-03-2006 10:00
  • 12-03-2007 22:00

Now if search query is looking for 12-03-2007 22:00 it will talk through all the fields to reach last row. Something on lines of:

  • 12-04-2007 22:00 not a match
  • 12-03-2006 10:00not a match
  • 12-03-2007 22:00 a match

Search engine walked about 33 characters on index to reach a conclusion that third row is a match.

Magic of morphological ordering
By changing the date and time a little to something like YYYYMMDDHHMMSS we can get a fair bit of speed advantage. So above date and time would look like:

  • 200704122200
  • 200603121000
  • 200703122200

Looking at number of operations for same query

  • 200704122200 not a match
  • 200603121000 not a match
  • 200703122200 a match

Search engine walked about 24 characters on index to reach a conclusion that third row is a match. If you notice, in case of second row it took 4 characters for search engine to conclude a mismatch.

Range Query
Range query is a search query with constraint value bounds. Lets assume we need something between 12-03-2007 to 12-04-2007. With morphologically ordered date/time we convert the values in the index into integers and calculate if a row is between 20070312000000 and 20070412240000. This operation is by many orders simpler than doing a string match.

Advertisements