{"id":396,"date":"2010-08-03T16:03:26","date_gmt":"2010-08-03T21:03:26","guid":{"rendered":"http:\/\/dilettantes.code4lib.org\/blog\/?p=396"},"modified":"2010-08-03T16:03:26","modified_gmt":"2010-08-03T21:03:26","slug":"faceted-search-on-a-shoestring","status":"publish","type":"post","link":"https:\/\/rossfsinger.me\/blog\/2010\/08\/faceted-search-on-a-shoestring\/","title":{"rendered":"Faceted Search on a Shoestring"},"content":{"rendered":"<p>There are any number of reasons that you can attribute to <a href=\"http:\/\/lucene.apache.org\/solr\/\" target=\"_blank\">Solr<\/a>&#8216;s status as the standard bearer of faceted full-text searching: \u00c2\u00a0it&#8217;s free, fast, works shockingly well out of the box without any tweaking, has a simple and intuitive HTTP\u00c2\u00a0API (making it available in the programming language of your choice) and is, by far, the easiest &#8220;enterprise-level&#8221; application to get up and running. \u00c2\u00a0None of its &#8220;competitors&#8221; (<a href=\"http:\/\/www.sphinxsearch.com\/\" target=\"_blank\">Sphinx<\/a>, <a href=\"http:\/\/xapian.org\/\" target=\"_blank\">Xapian<\/a>, <a href=\"http:\/\/www.endeca.com\/\" target=\"_blank\">Endeca<\/a>, etc.), despite any individual advantages they might have, can claim all of these features, which goes a long way towards explaining Solr&#8217;s popularity.<\/p>\n<p>The library world has definitely taken a shine to Solr: \u00c2\u00a0from discovery interfaces like <a href=\"http:\/\/vufind.org\/\" target=\"_blank\">VuFind<\/a> and <a href=\"http:\/\/www.exlibrisgroup.com\/category\/PrimoOverview\" target=\"_blank\">Primo<\/a>, to repositories like <a href=\"http:\/\/www.fedora-commons.org\/\" target=\"_blank\">Fedora<\/a>, to full-text aggregators like <a href=\"http:\/\/www.serialssolutions.com\/summon\/\" target=\"_blank\">Summon<\/a>, you can find Solr under the hood of most of the hot products and services available right now. \u00c2\u00a0The fact that a library can install VuFind and have a slick, jaw-droppingly powerful OPAC-replacement that puts their legacy interface to shame <em>in about an hour<\/em> is almost completely the by-product of Solr&#8217;s amazing simplicity to get up and running. \u00c2\u00a0It&#8217;s no wonder why so many libraries are adopting it (compare it to <a href=\"http:\/\/thesocialopac.net\/\" target=\"_blank\">SOPAC<\/a>, also built in PHP and about as old, but uses Sphinx for the full-text indexing and is hardly ever seen in the wild).<\/p>\n<p>Without a doubt, Solr is pretty much a no-brainer if you are able to run Jetty (or Tomcat or JBoss or Glassfish or whatever): \u00c2\u00a0with enough hardware, Solr can scale up to pretty much whatever your need might be. \u00c2\u00a0The problem (at least the problem in my mind) is that Solr doesn&#8217;t scale <em>down<\/em> terribly well. \u00c2\u00a0If you host your content from a cheap, shared web hosting provider or a VPS, for example, Solr is not available or not practical (it doesn&#8217;t live in small memory environments well). \u00c2\u00a0The <a href=\"http:\/\/websolr.com\/\">hosted Solr<\/a> <a href=\"http:\/\/acquia.com\/products-services\/acquia-search\" target=\"_blank\">options<\/a> are fairly expensive and while there are <a href=\"http:\/\/www.google.com\/search?ie=UTF-8&amp;q=shared+tomcat+web+hosting\" target=\"_blank\">cheap, shared web hosting providers that do provide Java Application Servers<\/a>, switching vendors to provide faceted search for your mid-size <a href=\"http:\/\/drupal.org\/\" target=\"_blank\">Drupal<\/a> or <a href=\"http:\/\/omeka.org\/\" target=\"_blank\">Omeka<\/a> site might not be entirely practical or desirable.<\/p>\n<p>I find myself proof-of-concept-ing a lot of hacks to projects like VuFind, <a href=\"http:\/\/projectblacklight.org\/\" target=\"_blank\">Blacklight<\/a>, <a href=\"http:\/\/code.google.com\/p\/kochief\/\" target=\"_blank\">Kochief<\/a> and whatnot and run these things off of my shared web server. \u00c2\u00a0It&#8217;s older, underpowered and only has 1GB of RAM. \u00c2\u00a0Since I&#8217;m not running any of these projects in production (just really making things available for others to see), it was really annoying to have Solr gobbling up 20% of the available RAM for these little pet projects. \u00c2\u00a0What I wanted was something that acted more or less like Solr when you pointed an application that expected Solr to be there, but I wanted it to have a small footprint that could run (almost) anywhere and more or less disappear when it was idle.<\/p>\n<p>So it was for this scenario that I wrote <a href=\"http:\/\/github.com\/rsinger\/CheapSkate\" target=\"_blank\">CheapSkate<\/a>: a Solr emulator written in Ruby. \u00c2\u00a0It uses <a href=\"http:\/\/ferret.davebalmain.com\/\" target=\"_blank\">Ferret<\/a>, the Ruby port of Lucene, as the full-text indexing engine and <a href=\"http:\/\/sinatrarb.com\/\" target=\"_blank\">Sinatra<\/a> to supply the HTTP API. \u00c2\u00a0Ferret is fast, scales quite well and responds to the same search syntax as Solr, so I knew it could handle the search aspect pretty easily. \u00c2\u00a0Faceting (as can be expected) proved the harder part. \u00c2\u00a0Originally, I was storing the values of fields in an RDBMS and using that to provide the facets. \u00c2\u00a0Read performance was ok, although anything over 5,000 results would start to bog down &#8211; the real problem was the write performance, which was simply woeful. \u00c2\u00a0Part of the issue was that this design was completely schemaless: \u00c2\u00a0you could send anything to CheapSkate and facet on any field, regardless of size. \u00c2\u00a0It also tried to maintain the type of the incoming field value: \u00c2\u00a0dates were stored as dates, numbers stored as integers and so on. \u00c2\u00a0Basically the lack of constraints made it wildly inefficient.<\/p>\n<p>Eventually, I dropped the RDBMS component, and started playing around <a href=\"http:\/\/ferret.davebalmain.com\/api\/classes\/Ferret\/Index\/IndexReader.html#M000149\" target=\"_blank\">Ferret&#8217;s terms capabilities<\/a>. \u00c2\u00a0If you set a particular field to be untokenized, your field values appear exactly as you put them in. \u00c2\u00a0This is perfect for faceting (since you don&#8217;t want stemming and whatnot on your query filters and your strings aren&#8217;t normalized or downcased or anything so they look right in the UI) and is basically the same thing Solr itself does. \u00c2\u00a0Instead of a schema.xml, CheapSkate has a schema.yml, but it works essentially the same way: \u00c2\u00a0you define your fields, what should be tokenized (that is, which fields allow full-text search) or not (i.e. facet fields) and what datatype the field should be.<\/p>\n<p>CheapSkate doesn&#8217;t support all of the field types that Solr does, but it supports strings, numbers, dates and booleans.<\/p>\n<p>One neat thing about Ferret is that you can pass a Ruby Proc to the <a href=\"http:\/\/ferret.davebalmain.com\/api\/classes\/Ferret\/Search\/Searcher.html#M000238\" target=\"_blank\">search method<\/a> as a search option. \u00c2\u00a0This proc then has access to the search results as Ferret is finding them. \u00c2\u00a0CheapSkate uses this find the terms in the untokenized fields for each search hit, throws them in a Hash and generates a hit count for each term. \u00c2\u00a0This is a <em>lot<\/em> faster than getting all the document ids from the search, looping them and generating your term hash <em>after<\/em> the search is completed. \u00c2\u00a0That said, this is still definitely the bottleneck for CheapSkate. \u00c2\u00a0If the search result has more than 10-15,000 hits, performance begins to get pretty heavily impacted by grabbing the facets. \u00c2\u00a0I&#8217;m not <em>terribly<\/em> concerned by this, data sets with search results in the 20,000+ range start to creep into the &#8220;you would be better off just using Solr&#8221; domain. \u00c2\u00a0For my proofs-of-concepts, this has only really raised its head in VuFind when filtering on something like &#8220;Book&#8221; (with no search terms) for a 50,000 record collection. \u00c2\u00a0What I mean to say is, this happens for fairly non-useful searches.<\/p>\n<p>Overall, I&#8217;ve been pretty happy with how CheapSkate is working. \u00c2\u00a0For regular searching it does pretty well (although, like I said, I&#8217;m not trying to run a production discovery system that pleases both librarians and users). \u00c2\u00a0There&#8217;s a very poorly designed &#8220;more like this&#8221; handler that really needs an overhaul and there is no &#8220;did you mean&#8221; (spellcheck). \u00c2\u00a0This hasn&#8217;t been a huge priority, because I don&#8217;t really like the spellcheck in Solr all that much, anyway. \u00c2\u00a0That said, if somebody really wanted this and had an idea of how it would be implemented in Ferret, I&#8217;d be happy to add it.<\/p>\n<p>Ideally, I&#8217;d like to see something like CheapSkate in PHP using <a href=\"http:\/\/framework.zend.com\/manual\/en\/zend.search.lucene.html\" target=\"_blank\">Zend_Search_Lucene<\/a>, since that would be accessible to virtually everybody, but that&#8217;s a project for somebody else.<\/p>\n<p>In the meantime, if you want to see some examples of CheapSkate in action:<\/p>\n<ul>\n<li>Here&#8217;s that <a href=\"http:\/\/dilettantes.code4lib.org\/vufind\/\" target=\"_blank\">VuFind instance with 50,000 MARC<\/a> records (from the California College of the Arts)<\/li>\n<li><a href=\"http:\/\/dilettantes.code4lib.org\/kochief\/\" target=\"_blank\">Kochief with around 10,000 MARC records<\/a> (from the Library of Congress, via Blacklight)<\/li>\n<li><a href=\"http:\/\/jangle.org\/\" target=\"_blank\">Drupal with just over 50 nodes<\/a><\/li>\n<li><a href=\"http:\/\/dilettantes.code4lib.org\/blog\/\" target=\"_blank\">WordPress with just under 150 posts &amp; pages<\/a> (this blog).<\/li>\n<\/ul>\n<p><em>One important caveat to projects like VuFind and Blacklight: \u00c2\u00a0CheapSkate doesn&#8217;t work with <\/em><a href=\"http:\/\/code.google.com\/p\/solrmarc\/\" target=\"_blank\"><em>Solrmarc<\/em><\/a><em>, which requires Solr to return responses in the javabin format (which may be possible to hack out something that looks enough like javabin to fool Solrmarc, I just haven&#8217;t figured it out). \u00c2\u00a0 My workaround has been to populate a local Solr index with Solrmarc and then just dump all of the documents out of Solr into CheapSkate. <\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are any number of reasons that you can attribute to Solr&#8216;s status as the standard bearer of faceted full-text searching: \u00c2\u00a0it&#8217;s free, fast, works shockingly well out of the box without any tweaking, has a simple and intuitive HTTP\u00c2\u00a0API (making it available in the programming language of your choice) and is, by far, the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,39,73,66],"tags":[],"class_list":["post-396","post","type-post","status-publish","format-standard","hentry","category-coding","category-ruby","category-search","category-solr"],"_links":{"self":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/396","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/comments?post=396"}],"version-history":[{"count":2,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/396\/revisions"}],"predecessor-version":[{"id":398,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/396\/revisions\/398"}],"wp:attachment":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/media?parent=396"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/categories?post=396"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/tags?post=396"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}