Simple Ruby on Rails Full Text Search Using Xapian

In my previous blog post, I wrote about the several full-text search engines that are available to us as Rails programmers and how we ultimately decided to go with Xapian.

Background

I am by no means a Xapian expert. I first heard about it less than a week ago. However, we have been very impressed with its combination of ease of installation and robust features.

Xapian is written in C++ and has been around in its current form since 2001. Its code base is derived from a previous search tool that went closed source.

  • Written in C++
  • Feature rich including relevance feedback, boolean search operators, stemming, wildcard searching, and similar results

Its user list includes del.icio.us and GMane both of which are indexing tens of millions of records, so scalability should not be an issue. Here is a quote about scalability from the site:

People often want to know how Xapian will scale. The short answer is "very well" - a previous version of the software
powered BrightStation's Webtop search engine, which offered a search over around 500 million web pages (around 1.5 terabytes of database files). 
Searches took less than a second.

The largest recent installation we're aware of is probably gmane, which currently indexes over 50 million mail messages.

On the other hand, indexing time could take a while if you have a huge set of documents. According to this article, it took about 5 hours to index the entire Wikipedia database using Xapian for offline Wikipedia access and searching.

Acts_As_Xapian is the Rails plugin for the Xapian. It was written in May 2008 by Francis Irving after he experienced some frustrations with Solr. Acts_As_Xapian has a google group at http://groups.google.com/group/acts_as_xapian.

Xapian Installation:

For Ubuntu Users:

sudo apt-get install libxapian15 libxapian-ruby1.8

For Mac / Other: (taken from the xapian.org install docs)

(version = 1.0.6 at the time of this writing)

wget http://oligarchy.co.uk/xapian/1.0.6/xapian-core-1.0.6.tar.gz
wget http://oligarchy.co.uk/xapian/1.0.6/xapian-bindings-1.0.6.tar.gz
Alternately if you haven’t installed wget on the mac:

curl -O http://oligarchy.co.uk/xapian/1.0.6/xapian-core-1.0.6.tar.gz
curl -O http://oligarchy.co.uk/xapian/1.0.6/xapian-bindings-1.0.6.tar.gz

tar zxvf xapian-core-versionnumber.tar.gz
tar zxvf xapian-bindings-versionnumber.tar.gz

cd xapian-core-versionnumber
./configure --prefix=/opt
make
sudo make install

If you don’t have root access to install Xapian, you can specify a prefix in your home directory, for example:
./configure—prefix=/home/jenny/xapian-install

cd xapian-bindings-<version>
./configure XAPIAN_CONFIG=/opt/bin/xapian-config
make
sudo make install

Acts_As_Xapian Installation:

Install the plugin:

script/plugin install git://github.com/frabcus/acts_as_xapian.git

Generate and execute the acts_as_xapian migration:

script/generate acts_as_xapian
rake db:migrate

Basic Model Code

To include fields to your index, you will need to add an “acts_as_xapian” call to the model you want to index.

1 class Lesson < ActiveRecord::Base
2     acts_as_xapian :texts => [:name, :description]
3 end

In this case, only the data found in the “name” and “description” attributes of our Lesson model will be searched.

A Quick Test Of Your Index

At this point, you can quickly test that your Xapian install, plugin, and model code are all working together via the following rake commands:

To build the index:

rake xapian:rebuild_index models=“Lesson” RAILS_ENV=development

(You can add more models with a space delimeter. “Lesson User Tag”)
To update index:
rake xapian:update_index RAILS_ENV=development

To test index:
rake xapian:query models=“Lesson” query=“golf” RAILS_ENV=development

The Basic Search

Adding the this line of code

1 @search = ActsAsXapian::Search.new([Lesson], @lesson_search_params)
will return a ActsAsXapian::Search object with the following (From the acts_as_xapian README):

  • description – a techy one, to check how the query has been parsed
  • matches_estimated – a guesstimate at the total number of hits
  • spelling_correction – the corrected query string if there is a correction, otherwise nil
  • words_to_highlight – list of words for you to highlight, perhaps with TextHelper::highlight
  • results – an array of hashes each containing:

    • :model – your Rails model, this is what you most want!
    • :weight – relevancy measure
    • :percent – the weight as a %, 0 meaning the item did not match the query at all
    • :collapse_count – number of results with the same prefix, if you specified collapse_by_prefix

So, if we want to return an ActiveRecord model object we could do something like this:

1 @lessons = @search.results.collect {|r| r[:model]}

To get the recommended spelling for a misspelled word(s) in a search, we would do:
1 @corrections = @search.spelling_correction

Similar Results

If you have a requirement to show similar or “You may also like” results based on a search result set or a single result, this can be done with a single line of code:

1 @similar_lessons = ActsAsXapian::Similar.new([Lesson], @lessons).results.collect {|r| r[:model]}
where @lessons in this case would be the list of lessons returned in the original search.

Updating The Index

Unfortunately, one drawback to Acts_As_Xapian is that the indexers are not updated automatically when data in an indexed model is added, modified, or deleted. When model data is changed, Xapian will put a record in the table ‘acts_as_xapian_jobs’ in your database to notify the update task to update the index. A cron job or something of the like will be needed to call the following rake task periodically to keep the index up-to-date:

rake xapian:update_index

Francis, the acts_as_xapian creator, commented on this on my original Mulling Over Our Ruby On Rails Full Text Search Options blog post.

with Xapian you can update and search simultaneously, and updates are immediate. However, only one thread can update a Xapian database at the same time. Since I wanted offline indexing anyway (as my index operation is risky, complex and slow, involving parsing Word documents, PDFs etc.), I didn’t try to find a solution that causes a second thread in the web application to, say, wait for the database lock. So acts_as_xapian currently only supports offline indexing.

Conclusion

This is a very basic tutorial meant to highlight the steps needed to get up and running with simple Xapian searches quickly. There are quite a few topics that I did not touch on such as:


  • Sorting and grouping with the :values option in the model

  • Advanced searching with the :terms option in the model

  • Adding extended attributes from other models into a searched index. (Such as adding lesson comments to the Lesson index)

  • Filtering searched data (in our case, we had to filter our results so only “active” lessons were returned)

  • Highlighting keywords in the search results