In search of… Bigfoot

Before I left for Guatemala, Ian Davis at Talis asked if I could give him a dump of our MARC records to load into Talis Platform. I had been talking in the #code4lib channel about how I was pushing the idea of using Talis Source to make simple, ad-hoc union catalogs; we could make one for Georgia Tech & Emory (we have joint degree programs) or Arche or Georgia Tech/Atlanta-Fulton Public Library, etc. My thinking was that by utilizing the Talis Platform, we could forgo much of the headache in actually making a union catalog for somewhat marginal use cases (the public library one notwithstanding).

About a week after I got back from Guatemala, I had an email from Richard Wallis with some urls to play around with to access my Bigfoot store. He showed me search services, facet services and augment services. I was unable to be really dive into it much at the time but since I’m working on a total site search project for the library, I thought this would be a good chance to kick the tires a bit to include catalog results.

After two days of poking around, I have made some opinions of it, have some recommendations for it, and wrote a Ruby library to access it.

1) The Item Service

This is certainly the most straightforward and for many people, the most useful service of the bunch. The easiest way to think of the item service is an HTTP based Lucene service (a la Solr or Lucene-WS) of your bib records. It returns something OpenSearch-y (it claims to be a RSS 1.0 document), but it doesn’t validate. That being said, FeedTools happily consumed it (more on that later) and the semantics should be familiar to anyone that has looked at OpenSearch before. Each item node also contains a Dublin Core representation of the record and a link to a marcxml representation. I’m not sure if there’s a description document for Bigfoot.
Although the query syntax is pure Lucene (title:”The Lexus and the Olive Tree”), the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones (for example, my guess is I wouldn’t be able to get an index for 490/440$v that I use for the Umlaut). I don’t see returning the results as OAI_DC being too much of a problem, since the RSS item includes a title (which would have been tricky between the DC and the marcxml). My Ruby library might not generate valid DC, I haven’t really looked into it.

The docs also mention you can POST items to your Bigfoot store, but they don’t mention what your data needs to look like (MARC?) or what credentials you need to add something (I mean, it must be more than just your store name, right?). My hope is to add this functionality to bigfoot-ruby soon (especially since my data is from a bulk export from last October).

2) The Facet Service

This one is intriguing, definitely, since Faceted searching is all the rage right now.  The search syntax is basically the same as the Item Service, except you also send a comma delimited list of the fields you would like to query.  What you get back is either an XML or XHTML document of your results.

For each field you request, you get back a set of terms (you can specify how many you want, with a default of 5) that appear most frequently in your field.  You also get an approximation for how many results you would get in that facet and a url to search on that facet.  It’s quite fast, although, realistically, you can’t do much with the output of facet search alone.

Again, it’s difficult to know what you can facet on (subject, creator and date are all useful — I’m sure there are others) and the facet that (for me, at least) held the most promise — type — is too overly broad to do much with (it uses Leader position 7, but lumps the BKS and SER types all in a label called “text”).  I would like to see Talis implement something like my MARC::TypedRecord concept so one could facet on things like government document or conference.  You could separate newspapers from journals and globes from maps.  Still, the text analysis of the non-fixed fields is powerful and useful and beats the hell out of trying to implement something like that locally.

In bigfoot-ruby, I have provided two ways to do a faceted search:  you can just do the search and get back Facet objects containing the terms and search urls or you can facet with items which executes the item searches automatically (in turn getting a definitive number of results for the query, as well).  Since I didn’t bother to implement threading, getting facets with items can be pretty slow.

3) The Augment Service

To be honest, I’m having a hard time figuring out useful scenarios for the augment service.  The idea is that you give it the URI of an RSS feed, and this service will enhance it with data from your Bigfoot store (at least, that is sort of how I understand it works).  Richard’s example for me was to feed it the output of an xISBN query (which isn’t in RSS 1.0, AFAIK, but, for the sake of example…) and the augment service would fill in the data for ISBNs your library holds.  The API example page mentions Wikipedia, but I don’t know where other than the Talis Platform that you can get Wikipedia entries formatted properly.  I tried sending it the results of an Umlaut2 OpenSearch query, but it didn’t do anything with it.  Presumably this RSS 1.0 feed needs the bib data to be sent in a certain way (my guess is in OAI_DC, like the Item Service), but I’m not sure.  The only use case I can think of for this service is a much simpler way to check for ISBN concordance (rather than isbn:(123456789X|223456789X|323456789X|etc.))

Overall, I’m really impressed with the Talis API.  It is a LOT easier to use than, say, Z39.50 and by using OpenSearch seems more natural to integrate into existing web services than SRU.

Bigfoot-ruby is definitely a work in progress.  I think I would like to split the Search class into ItemService and FacetService.  I don’t like how results is an Array for items and a Hash for facets. Just seems sloppy.  I need to document it, of course and I would like to implement Item POST.  This project also made me realize how bloody slow FeedTools is.  I am currently using it in both the Umlaut and the Finding Aids to provide OpenSearch, but I think it’s really too sluggish to justify itself.

Thanks, Talis, for getting me started with Bigfoot and giving me the opportunity to play around with it.  Also, thanks to Ed Summers for fixing SVN on Code4lib.org.  You wouldn’t be able to download it and futz around with it yourself, otherwise.


Posted

in

, ,

by

Tags:

Comments

3 responses to “In search of… Bigfoot”

  1. […] Interesting post from a Bigfoot user Ross Singer makes some interesting observations over on his blog about his experiences with the Talis Platform. […]

  2. Richard Wallis Avatar

    Ross, what an excellent review of what your first experiences in using the Talis Platform and what you achieved in just a couple of days.

    It’s great to see that you achieved much in those two days, including making a significant start on Bigfoot-ruby.

    You raise a few important questions; let me try to clarify a few of these:

    1) The Item Service.

    the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones – Yes it is, currently, an issue that the indexes are not documented. The reason for this is that because a Bigfoot Store is data type agnostic and the indexes for say reviews, or images, or citations, etc., will all differ from bibliographic information derived from Marc, or Dublin Core, or Mods, etc. If you are storing and indexing in this way, you cannot pre-define what those indexes will be.

    As I inferred in my code4lib presentation, several APIs to the stores are still under development and will be available soon. On of these will return information about the configuration and contents of the stores. Thus making them effectively self-documenting. I envisage that when you arrive at the default search UI for a store, it will be populated with information such as data types, quantities and indexes. Of course the APIs called to display that information could be called externally, enabling intelligence in the client.

    In the same way a store owner will have access to API calls to fully control the configuration of a store. So if you want an index for the 490/440$v fields from records in a store containing Marc you will be able to have one.

    There will obviously be some default configurations that will be used for some well know data types that will be documented as time goes on. For your store we configured the following indexes: title, description, author, id, publisher, date, subject, type, format, language.

    As to POSTing data to your store – documentation is light on this at the moment as we are still working in that area, but you will be able to http POST a file of an increasing set of formats (ISO2709, XMLMarc, etc. for bib data, jpg files for images, etc.) to the item API, and the contents will be adsorbed in to the store and indexed. Much more on this soon.

    2) The Facet Service
    As per the indexes, as the stores are generic there are no pre-documented fields but you will be able find out which are available in real-time by using API calls in the future. These fields are based upon the indexes that the owner of the store will have set-up using the controlling API.

    So for your store in its current configuration those facets are: title, description, author, id, publisher, date, subject, type, format, language.

    How to build facets on top of a basic item search is covered very well in Rob Styles’ Twenty Minute Union video.

    3)The Augment Service
    Only working with one store, I agree that it is difficult to conceptualize a scenario for the augment service, albeit that this is potentially one of the most useful and powerful services.
    Let me try one for you. Imagine you had a store that contained scholarly reviews of books of holdings in you library from academics at your University. And say another one containing a collection of portrait images of authors. By passing the results of an item search of you bibliographic store through the augmentation API of each of those other stores, you final result set which you can render in your UI will contain where they exist references to relevant author portraits and the contents of the relevant scholarly reviews.

    The Wikipedia example is based upon loading a store with the publicly available Wikipedia extracts data. The augmentation API for that store uses DC Creator values to match the records and augment where appropriate.

    With the addition of other output formats, JSON for example, and the addition of the informational and configuration APIs I hope you will see how the Bigfoot Store and its APIs are going to develop to be even simples and more powerful.

    Thank you for your invaluable feedback which the developers are already learning from.

    Richard.

  3. art Avatar

    What about an option to get a lucene index (that contains your data) from the platform as well? That would open the door to mobile apps and deduping collections, and provide an alternative if network latency or system availability is ever a problem. I don’t know if this would fall under augmentation, but I think there’s lots of traction to be had in indexing the content of the objects in addition to the metadata. That’s not possible with the majority of the objects in most library collections, but there are subsets where the digital content is starting to match the paper, technology titles for example. If full text indexing was possible, then maybe there could be some options for compressing what gets sent over the wire for indexing at the item level as well. In Ontario, where I am, there seems to be about 10K public provincial gov docs that represent something like 15 gig of PDF content in the last few years, and in cases like this, leveraging the availability of the material in an indexable format would seem to be almost a requirement in a time of Google. Questions aside, I think the platform looks really cool, and I am glad to see a developer’s notes on how it comes together, I really couldn’t get my mind around it before. I am hopeful the the platform will encourage discussion on what the metrics should be for the value of an API beyond the “got that, next question” response that has been the typical situation for too long in the library community.

Leave a Reply

Your email address will not be published. Required fields are marked *