Mulling Over Our Ruby On Rails Full Text Search Options
by Jim Mulholland Tue, 15 Jul 2008 22:25:00 GMT was interesting enough to generate 14 comments so far
There are quite a few choices when it comes to adding a full text search in a Ruby on Rails application. We thought that had considered all of our options when we ultimately settled on using Sphinx / Ultrasphinx. We learned otherwise, though, after stumbling across Xapian / Acts_As_Xapian while trying to find a fix for our Sphinx implementation after a production build.
Here are the details of our thought process and how we ultimately ended up deploying with Xapian.
While looking for a full-text search engine when Mindbites was released a year ago, we were looking for something easy and quick. We ended up going with Douglas Shearer’s Acts_As_Indexed which worked out great. It was written entirely in Ruby and very easy to implement with automatic indexing. (ie No cron jobs needed to keep the index up to date.) If you have a simple site and want to implement a basic search very quickly, definitely give Acts_As_Indexed a look.
However, over the past month or so we decided it was time to find something a little more robust. We wanted the features that the full-blown full-text search engines give such as spell correction, stemming (ie “connection / connecting / connected” would all search for words containing “connect”), finding “similar” results, and the ability to work across multiple servers. We were also having a few index corruptions with Acts_As_Indexed that we were wanting to get away from with another tool.
We went through the usual suspects that are often seen in the Rails Community:
Solr / Acts_As_Solr
Very robust search server based on the Lucene Java library with a mature acts_as_solr Rails plugin. I was blown away by Erik Hatcher’s “Solr On Rails” talk at RailsConf 2007 and thought this could be a good fit. However, we decided to check out other options because, all things being equal, we would rather not deal with installing Java on our servers.
Ferret / Acts_As_Ferret
We nixed Ferret fairly quickly after hearing a few horror stories about corrupted indexes among other issues with Ferret on production servers.
Sphinx / Ultrasphinx
Sphinx appears to be the new defacto standard for full text search among Rails developers. From what we read, it is very powerful, very fast indexing, and easy to use with Ultrasphinx. This was our choice to replace Acts_As_Indexed. However, after about a week’s worth of development and a deployment to our production server, there still were a lot of mysteries to the Sphinx.
We were not thrilled with all of the config files involved. With Ultrasphinx, you will need a xxx.base file which is accessed via a rake task to generate a config file that Sphinx can use. This works, but I was hoping for something a little bit simpler.
We were also not big fans of the daemons that run in the background. In our ITG environment, the daemons decided to stop running on a couple of occasions which caused a 500 error to be thrown when searching. (We later fixed this issue by reindexing and restarting the daemon with each Capistrano deployment.) Lastly, I read this quote from the Ultrasphinx deployment notes about recommended cron jobs for Sphinx on your production server:
“The first line reindexes the delta index every 10 minutes. The second line reindexes the main index once a day at 4am. The third line will try to restart the search daemon every three minutes. If it is already running, nothing happens.”
Kicking off a job every 3 minutes just to make sure another job is running did not seem right to me.
Xapian / Acts_As_Xapian
I had never heard of Xapian before this past Thursday. I happened to stumble across this article on on Evan Weaver’s blog while I was researching a Sphinx issue with our production build. Always looking for new and better ways to do things, I took a closer look at Xapian. Within 15 minutes, I had Xapian installed on my local Ubuntu machine and was successfully searching Mindbites lessons using the acts_as_xapian plugin. I spent the rest of the weekend replacing our Sphinx code with Xapian.
Installation, configuration, and deployment all went so well over the weekend that we deployed our completely revamped search code to our Mindbites production server this past Monday with spelling corrections, stemming, and “You may also like” functionality intact.
As a side note, that same Evan Weaver blog post has a comment about a plugin called act_as_searchable using the Hyper Estraier full-text search system. I have not looked into this solution, but I would be curious to hear from other readers who have tried this.
In part 2 of this blog post, I will go into detail about our Rails Xapian implementation.
Trackbacks
Use the following link to trackback from your own site:
http://locomotivation.com/trackbacks?article_id=mulling-over-our-ruby-on-rails-full-text-search-options&day=15&month=07&year=2008
Comments
about 9 hours later:
As a response to the whole Ultrasphinx cron issue, I’d recommend monit or equivalent to make sure your daemond don’t die. It’s also handy for keeping your mongrels/thin instances in line. It works quite decently for me.
I found Ultrasphinx’s config prohibitive, so I use the thinking-sphinx plugin. It’s a fantastic piece of work, and coming from ferret, “just works” beautifully.
Xapian seems to have the same issue that Sphinx does, though – offline indexing and only one client at a time writing to the index – except that Sphinx does bridge it with delta indexes for realtime index updates, and thinking-sphinx’s daemon effectively proxies between Sphinx itself and multiple app instances – a step that a Xapian implementation would seem to need as well.
The “you may also like” sounds really cool, though.
about 10 hours later:
re solr: just saw this: http://groups.google.com/group/acts_as_solr/browse_thread/thread/7568ba90be8ce0d5#
about 11 hours later:
I use acts_as_tsearch with Postgresql’s tsearch2. It’s not very portable, and I don’t know if I’d use it in a big production system, but for my needs, it works pretty well.
about 11 hours later:
Nice article. I’m using ferret fairly heavily at the moment and while it hasn’t caused me any problems so far I quite keen to replace it with something else.
I’ve had my eye on Xapian for a while so it’s good to hear that others have has success with it.
Chris: Unless I misunderstand your comments I don’t think either of the points about xapian are true. From the web site:
And from the API docs:
2 days later:
Busy week for search blog posts.
Rein Henrichs just posted about moving from UltraSphinx to ThinkingSphinx, and Mike Hartl just moved from Ferret to UltraSphinx.
5 days later:
Thanks for the nice writeup Jim!
Chris – if you have multiple app instances, say on different front end servers, then you can get Xapian’s remote backend working http://xapian.org/docs/remote.html i.e. it has a daemon if you really need it.
Richard/Chris – with Xapian you can update and search simultaneously, and updates are immediate. However, only one thread can update a Xapian database at the same time. Since I wanted offline indexing anyway (as my index operation is risky, complex and slow, involving parsing Word documents, PDFs etc.), I didn’t try to find a solution that causes a second thread in the web application to, say, wait for the database lock. So acts_as_xapian currently only supports offline indexing.
8 days later:
I’m the original author of the acts_as_zoom plugin which allows use of the ZOOM API for a Z39.50 server like Zebra (http://www.indexdata.com/zebra/).
The Z39.50 standard protocol is popular in the library and museums worlds for providing behind the scenes machine to machine access to search indexes.
We use acts_as_zoom as a part of the Kete open source Rails app. You can find out about Kete and acts_as_zoom here:
http://kete.net.nz/ # software community site http://github.com/kete/ # source for both Kete and acts_as_zoom
Note that there is some newer refactoring of acts_as_zoom in the version included in Kete’s source. We’ll eventually update the plugin (or someone can fork and we’ll pull it) to include those changes.
Cheers, Walter McGinnis Kete Project Lead
P.S. – oh yeah, if you are interested in Ruby ZOOM API support, you’ll probably be interested in ruby-zoom project at http://ruby-zoom.rubyforge.org/
10 days later:
I run ferret on several production servers. It is a bummer you didn’t give it a try.
The one trick is you have to run it as a separate server—if you have more than one instance of your Rails app.
Other than that I find it to be easy to use and integrates very well into Rails. But I haven’t tried anything else in while, Ferret just works.
11 days later:
Re: acts_as_solr being discontinued. Quite a few people are working on improving the plugin.
Notably JobsGoPublic http://github.com/jgp/acts_as_solr
and a guy called Look http://github.com/look/acts_as_solr who is trying to combine the best commits from all of the github forks.
We have a Solr server running off our DB machine, and will hopefully move all of our fulltext searching away from Ferret and onto Solr in the next couple of weeks.
Will get back to you if it all goes wrong. But I hope it wont.
12 days later:
I have to second Mr. Khan’s comment, I also use Ferret in several production servers and have only good things to say about it. Unfortunately plenty of ppl looking for a quick & easy solution tried deploying Ferret without using DRb (which is just silly and explained in the docs) and got corrupted indexes.
Ferret is very fast and although it’s a bit complex it is also very flexible. Or rather, it’s simple if all you care about is indexing documents, but if you want to have very independent search fields with different indexing strategies and ranking weights, it is flexible enough to do it but will require some work and lots of reading of the API.
12 days later:
I have to chime in. Ferret works great and I find FAR more flexible to and easy to configure than Sphinx. If you are running a mongrel cluster or something, you have to use a DRB server as mentioned in the docs (which is also trivial to set up).
It seems that some time in the past the Rails Envy guys tried to move ferret to clustered server without a using DRB and gave ferret a bad name by having high profile blog posts about it’s “instability in production” due to their improper implementation.
I like the RailsEnvy guys, but I find that really unfortunate because it’s a great search solution.
12 days later:
I’ve used Ferret in a production environment (yes, with DRB) and it was a nightmare. It wasn’t a bad setup, it wasn’t bad usage, it was just Ferret and DRB, both of them. The problems didn’t start right away, they emerged at the worst possible time: when the app was running in production for a few weeks, holding quite a bit of data. We rebuilt the index, a few weeks later, same problem (newer version of acts_as_ferret and ferret were out and installed by then) and that just kept on going. We were getting frustrated at that time and when two of us had a corrupted index on their development machines (one person interfacing with the index), Ferret was out.
And we are not the only one having those problems btw. A collegue (who actually recommended Ferret to us like some of the previous posters did) started having problems on a production app that had been running fine for over a year. He didn’t want to believe Ferret was the culprit, but after a long and time consuming search he knew he had to switch to something else, he chose acts_as_solr.
Solr was also the first thing thing came to our mind, but we decided to spend a few days trying out the different fulltext indexers before making a final choice.In short, acts_as_searchable (Hyperestraier) should be considered just as much of an option as Solr and Sphinx, it’s great and has been so good to us and our customers.
12 days later:
@peter – Thanks for the detailed analysis on your Hyperestraier / acts_as_searchable experience. As I mentioned at the end of this blog post, I had seen it mentioned before but have not heard of anybody using it on a Rails project before.
It is good to know that it is working out very well for you and gives us Ruby / Rails developers another full-text search option.
27 days later:
“Kicking off a job every 3 minutes just to make sure another job is running did not seem right to me.”
I don’t know what seems not-right about it.
If you’ve ever watched a movie on Unix and had XScreensaver not kick in, it’s using the same method.
Apache solves this by continuously creating new processes, and only handling a few requests per process before preemptively killing it. I suspect that’s because it has much less (no?) state to share.
It seems like a good policy to anticipate things which can go wrong before they do. The alternative is “force users to hook it up to their monitoring system before they even know what’s up or down”. You’ll have to that anyway, but at least the auto-restart means you’re not totally dead before then.
Have a take?