Simple Ruby On Rails Full Text Search Using Xapian
by Jim Mulholland Wed, 23 Jul 2008 12:49:00 GMT was interesting enough to generate 14 comments so far
In my previous blog post, I wrote about the several full-text search engines that are available to us as Rails programmers and how we ultimately decided to go with Xapian.
Background
I am by no means a Xapian expert. I first heard about it less than a week ago. However, we have been very impressed with its combination of ease of installation and robust features.
Xapian is written in C++ and has been around in its current form since 2001. Its code base is derived from a previous search tool that went closed source.
- Written in C++
- Feature rich including relevance feedback, boolean search operators, stemming, wildcard searching, and similar results
Its user list includes del.icio.us and GMane both of which are indexing tens of millions of records, so scalability should not be an issue. Here is a quote about scalability from the site:
People often want to know how Xapian will scale. The short answer is "very well" - a previous version of the software powered BrightStation's Webtop search engine, which offered a search over around 500 million web pages (around 1.5 terabytes of database files). Searches took less than a second. The largest recent installation we're aware of is probably gmane, which currently indexes over 50 million mail messages.
On the other hand, indexing time could take a while if you have a huge set of documents. According to this article, it took about 5 hours to index the entire Wikipedia database using Xapian for offline Wikipedia access and searching.
Acts_As_Xapian is the Rails plugin for the Xapian. It was written in May 2008 by Francis Irving after he experienced some frustrations with Solr. Acts_As_Xapian has a google group at http://groups.google.com/group/acts_as_xapian.
Xapian Installation:
For Ubuntu Users:sudo apt-get install libxapian15 libxapian-ruby1.8_For Mac / Other: (taken from the xapian.org install docs (version = 1.0.6 at the time of this writing)
wget http://oligarchy.co.uk/xapian/1.0.6/xapian-core-1.0.6.tar.gz
wget http://oligarchy.co.uk/xapian/1.0.6/xapian-bindings-1.0.6.tar.gz
Alternately if you haven’t installed wget on the mac:
curl -O http://oligarchy.co.uk/xapian/1.0.6/xapian-core-1.0.6.tar.gz
curl -O http://oligarchy.co.uk/xapian/1.0.6/xapian-bindings-1.0.6.tar.gz
tar zxvf xapian-core-versionnumber.tar.gz
tar zxvf xapian-bindings-versionnumber.tar.gz
cd xapian-core-versionnumber
./configure --prefix=/opt
make
sudo make install
If you don’t have root access to install Xapian, you can specify a prefix in your home directory, for example: ./configure—prefix=/home/jenny/xapian-install
cd xapian-bindings-<version>
./configure XAPIAN_CONFIG=/opt/bin/xapian-config
make
sudo make install
Acts_As_Xapian Installation:
Install the plugin:script/plugin install git://github.com/frabcus/acts_as_xapian.git
Generate and execute the acts_as_xapian migration:
script/generate acts_as_xapian
rake db:migrate
Basic Model Code
To include fields to your index, you will need to add an “acts_as_xapian” call to the model you want to index.
class Lesson < ActiveRecord::Base
acts_as_xapian :texts => [:name, :description]
end
In this case, only the data found in the “name” and “description” attributes of our Lesson model will be searched.
A Quick Test Of Your Index
At this point, you can quickly test that your Xapian install, plugin, and model code are all working together via the following rake commands:
To build the index:rake xapian:rebuild_index models="Lesson" RAILS_ENV=development(You can add more models with a space delimeter. “Lesson User Tag”) To update index:
rake xapian:update_index RAILS_ENV=developmentTo test index:
rake xapian:query models="Lesson" query="golf" RAILS_ENV=development
The Basic Search
Adding the this line of code@search = ActsAsXapian::Search.new([Lesson], @lesson_search_params)
will return a ActsAsXapian::Search object with the following (From the acts_as_xapian README):
- description – a techy one, to check how the query has been parsed
- matches_estimated – a guesstimate at the total number of hits
- spelling_correction – the corrected query string if there is a correction, otherwise nil
- words_to_highlight – list of words for you to highlight, perhaps with TextHelper::highlight
- results – an array of hashes each containing:
- :model – your Rails model, this is what you most want!
- :weight – relevancy measure
- :percent – the weight as a %, 0 meaning the item did not match the query at all
- :collapse_count – number of results with the same prefix, if you specified collapse_by_prefix
@lessons = @search.results.collect {|r| r[:model]}
To get the recommended spelling for a misspelled word(s) in a search, we would do:
@corrections = @search.spelling_correction
Similar Results
If you have a requirement to show similar or “You may also like” results based on a search result set or a single result, this can be done with a single line of code:@similar_lessons = ActsAsXapian::Similar.new([Lesson], @lessons).results.collect {|r| r[:model]}
where @lessons in this case would be the list of lessons returned in the original search.
Updating The Index
Unfortunately, one drawback to Acts_As_Xapian is that the indexers are not updated automatically when data in an indexed model is added, modified, or deleted. When model data is changed, Xapian will put a record in the table ‘acts_as_xapian_jobs’ in your database to notify the update task to update the index. A cron job or something of the like will be needed to call the following rake task periodically to keep the index up-to-date:
rake xapian:update_index
Francis, the acts_as_xapian creator, commented on this on my original Mulling Over Our Ruby On Rails Full Text Search Options blog post.
with Xapian you can update and search simultaneously, and updates are immediate. However, only one thread can update a Xapian database at the same time. Since I wanted offline indexing anyway (as my index operation is risky, complex and slow, involving parsing Word documents, PDFs etc.), I didn’t try to find a solution that causes a second thread in the web application to, say, wait for the database lock. So acts_as_xapian currently only supports offline indexing.
Conclusion
This is a very basic tutorial meant to highlight the steps needed to get up and running with simple Xapian searches quickly. There are quite a few topics that I did not touch on such as:- Sorting and grouping with the :values option in the model
- Advanced searching with the :terms option in the model
- Adding extended attributes from other models into a searched index. (Such as adding lesson comments to the Lesson index)
- Filtering searched data (in our case, we had to filter our results so only “active” lessons were returned)
- Highlighting keywords in the search results
Trackbacks
Use the following link to trackback from your own site:
http://locomotivation.com/trackbacks?article_id=simple-ruby-on-rails-full-text-search-using-xapian&day=23&month=07&year=2008
Comments
about 13 hours later:
you should try sphinx
about 14 hours later:
@holyts – We tried sphinx. That is what drove us to Xapian. ;-)
Though to Sphinx’s defense, we did not try ThinkingSphinx which appears to be more in favor over UltraSphinx now a days.
1 day later:
Can you elaborate? I am prototyping with Sphinx and ThinkingSphinx with an eye to migrating away from Ferret. Xapian sounds interesting, but I haven’t yet run across much information on why it would be better and Sphinx seems to have more development and adoption momentum.
1 day later:
@Kevin – As I mentioned in my previous post we decided against Ultrasphinx due to its configuration system and relative complexity to get it up and running.
Xapian didn’t have any config files to mess with and came with a lot out of the box like spell checking and similar result queries that we were wanting to implement.
With that being said, it is still relatively unknown in the Ruby / Rails circles especially when compared to Sphinx, Ferrett, and Solr. Francis did recently start a acts_as_xapian google group to hopefully help out with questions.
1 day later:
As I see Xapian doesn’t need server running. Is that right?
It means no monitoring and problems with failing requests :)
1 day later:
Xapian looks really cool. I would like to see a variation of acts_as_xapian which has a closer interface to Thinking Sphinx in the way it defines indexes and performs searches.
Full text searching bliss!
1 day later:
Hi,
in one big projects we had to evaluate the different full text search engines.
We use a lot of STI models in our projects and every plugin we tried fail to correctly handle this case.
Does someone already use a plugin that correctly works with STI models ?
Thanks, slainer68.
2 days later:
@Ryan – I believe that Francis based acts_as_xapian very closely on acts_as_solr. A “thinking_xapian” type plugin is probably a great idea being how popular thinking_sphinx has become.
@slainer68 – How are these plugins failing? We have an “Author” model that is an STI to our “User” model and acts_as_xapian worked out of the box.
We simply added the following to our Author model and search worked as it did on our other indexed models.
2 days later:
@slainer68, have you tried acts_as_xapian? It looks like it identifies a model by its class name internally, not the name of the table. So I think it would work well with STI.
2 days later:
Hi, thanks for your answers. I have not tested acts_as_xapian yet.
I have tried UltraSphinx et ThinkingSphinx.
In the last days there were some commits in TS to enhance support for STI models but the support is not complete yet.
Will try to remember to post my conclusion :).
5 days later:
I am using acts_as_ferret, but only in development. After reading all the blogs, I am interested trying Xapian before we move to full production. With Xapian, is it possible to define other qualifiers in the sql? eg in the example above we have Lesson:
class Lesson < ActiveRecord::Base acts_as_xapian :texts => [:name, :description] end
Lets say Lesson also includes teacher id and I want to filter the name description on a particular teacher. Is this possible?
Thanks
K.
9 days later:
Thanks for this thorough write-up! Will have to give Xapian a try.
9 days later:
Yeah acts_as_xapian looks cool. Thanks for the tutorial!
21 days later:
Nice post! Very helpful.
Have a take?