{"id":120,"date":"2008-09-16T16:16:53","date_gmt":"2008-09-16T21:16:53","guid":{"rendered":"http:\/\/dilettantes.code4lib.org\/?p=120"},"modified":"2008-09-16T16:16:53","modified_gmt":"2008-09-16T21:16:53","slug":"working-around-ruby-with-xslt","status":"publish","type":"post","link":"https:\/\/rossfsinger.me\/blog\/2008\/09\/working-around-ruby-with-xslt\/","title":{"rendered":"Working around Ruby with XSLT"},"content":{"rendered":"<p>My relationship with Ruby nowadays is roughly akin to somebody addicted to pain killers.\u00c2\u00a0 I know it&#8217;s not good for me (since <em>everything<\/em> I work on nowadays is RDF, XML or both) but I&#8217;m able to still be productive and the pain of quitting, while in the long run would be better for everybody, just isn&#8217;t something I have time for right now.\u00c2\u00a0 Maybe someday I&#8217;ll make the jump back to Python (since it&#8217;s actually pretty good at dealing with both RDF and XML), but for now I&#8217;ll just find workarounds to my problems (unlike others, I am completely incapable of juggling more than one language).<\/p>\n<p>I first ran into my big XML and Ruby problem a couple of weeks ago while working on the TalisLMS connector for Jangle.\u00c2\u00a0 I&#8217;ve, of course, run into it before, but it has never been a total show stopper like this.\u00c2\u00a0 In order to add the Resource entity to the TalisLMS (Jangle-ese for bibliographic records) connector, I am querying the Platform store the OPAC uses.\u00c2\u00a0 I&#8217;m using the Platform rather than the Zebra index that comes with Alto (the records are indexed in both places) because the modified date isn&#8217;t sortable in Zebra and that would be an issue when serializing everything to Atom.\u00c2\u00a0 The records are transformed into a proprietary RDF format (called BibRDF) when loaded into the Platform (this is for the benefit of Prism, our OPAC).\u00c2\u00a0 In order to get the MARC records (there&#8217;s no route back to the MARC from BibRDF), I have to pull the UniqueIdentifer (which is the mapped 001)\u00c2\u00a0 field out of the BibRDF and throw them in a Z39.50 client (<a href=\"http:\/\/ruby-zoom.rubyforge.org\/\" target=\"_blank\">Ruby\/ZOOM<\/a>) and query the Zebra index.\u00c2\u00a0 In order to get enough metadata to create a valid Atom entry, I needed to be able to parse the BibRDF (which comes out of the Platform as RDF\/XML), since that is the default record format.<\/p>\n<p>And this is where I&#8217;d run into problems.\u00c2\u00a0 I have the default number of records set to be returned by the Jangle to 100.\u00c2\u00a0 That&#8217;s a pretty sweet spot for both servers to handle the load and clients to deal with resulting Atom document.\u00c2\u00a0 Well, you&#8217;d think it was, anyway, except REXML was taking about 10 seconds to parse the Platform response into Ruby objects.<\/p>\n<p>I realize the Rubyists out there are already dismissing this and scrolling down to the comment box to write &#8220;well don&#8217;t use REXML, you dumbass&#8221;, but let me explain.\u00c2\u00a0 I generally <em>don&#8217;t<\/em> use REXML (unless it&#8217;s something very small and simple), instead opting for Hpricot for parsing XML.\u00c2\u00a0 I&#8217;ve tended to avoid LibXML in Ruby, when I first tried it, it segfaulted a lot, but that was the past&#8230; my reasons for avoiding it lately is because I have this stubborn ideal about having things work with JRuby and that&#8217;s just not going to be an option with LibXML (before you scroll down and add another comment about the Ruby\/ZOOM requirement, it will eventually be replaced with Ruby-SRU&#8230; probably).\u00c2\u00a0 Hpricot was falling flat on its face with the BibRDF namespace prefixes, though (j.0:UniqueIdentifier).\u00c2\u00a0 It seems to have problems with periods in the prefix, so that was a no go.<\/p>\n<p>So I had REXML and I had horrible performance.\u00c2\u00a0 Now what?<\/p>\n<p>Well, JSON is fast in Ruby, so I thought that might be an option.\u00c2\u00a0 The Platform has a transform service, if you pass an argument with the URL for an XSLT stylesheet, it will output the result in the format you want.\u00c2\u00a0 <a href=\"http:\/\/www.google.com\/search?q=xml+to+json+xslt\" target=\"_blank\">Googling found several projects<\/a> that would turn XML into JSON via XSLT (<a href=\"http:\/\/www.bramstein.com\/projects\/xsltjson\/\" target=\"_blank\">this one seems the best<\/a> if you have an XSLT 2.0 parser), but they weren&#8217;t <em>quite<\/em> what I needed.\u00c2\u00a0 I wanted to preserve the original RDF\/XML since I was just going to be turning around and regurgitating it back to the Jangle server, anyway.\u00c2\u00a0 I just needed a quick way to grab the UniqueIdentifier, MainAuthor and LastModified fields and shove the rest of the XML into an object attribute.<\/p>\n<p>I have always chafed at the thought of actually doing anything in XSLT.\u00c2\u00a0 In retrospect (after I&#8217;ve been using almost exclusively for a month, now), I realize that my opinion was probably actually the result of the data that I was trying to transform (EAD, the metadata format designed to punish technologists) rather than XSLT itself (the project got sucked into a vortex when I tried working with the EAD directly with Ruby, too).\u00c2\u00a0 Still, I had always resisted.\u00c2\u00a0 The syntax is weird, variables confused me, I just never got the hang of it.<\/p>\n<p>But, damn, it&#8217;s fast.<\/p>\n<p>And, when I turned the XML into JSON (with XML), it was perfect.\u00c2\u00a0 <a href=\"http:\/\/jangle.googlecode.com\/svn\/trunk\/xsl\/skywalk2json.xsl\" target=\"_blank\">Here&#8217;s my stylesheet<\/a>.\u00c2\u00a0 <a href=\"http:\/\/api.talis.com\/tx?xsl-uri=http:\/\/jangle.googlecode.com\/svn\/trunk\/xsl\/skywalk2json.xsl&amp;xml-uri=http:\/\/api.talis.com\/stores\/bib-demo-2\/items?query%3D*:*%26sort%3Ddisplayaslastmodified:d%26offset%3D0%26max%3D100\" target=\"_blank\">Here&#8217;s what the output from the Platform<\/a> looks like.\u00c2\u00a0 <a href=\"http:\/\/anvil.lisforge.net:4567\/resources\/\" target=\"_blank\">Here&#8217;s the output<\/a> from the TalisLMS connector.<\/p>\n<p>I wasn&#8217;t done, yet, though.\u00c2\u00a0 The DLF ILS-DI Adapter for Jangle&#8217;s OAI-PMH service was sooooo slow.\u00c2\u00a0 Requests were literally taking around 35 seconds each.\u00c2\u00a0 This was because I was using <a href=\"http:\/\/www.sporkmonger.com\/projects\/feedtools\/\">FeedTools<\/a> to parse the Atom documents and Builder::XmlMarkup to generate the OAI-PMH output.\u00c2\u00a0 And this was silly.\u00c2\u00a0 Atom is a very short hop to OAI-PMH, and there was really no need to manipulate the data itself at all.\u00c2\u00a0 However, I did need to add stuff to the final XML output that I wouldn&#8217;t know until it was time to render.\u00c2\u00a0 So I wrote <a href=\"http:\/\/code.google.com\/p\/jangle\/source\/browse\/#svn\/trunk\/external_interfaces\/xsl\" target=\"_blank\">these two XSLTs<\/a>.\u00c2\u00a0 I have patterns in there which are identified by &#8220;##verb##&#8221; or &#8220;##requestUrl##&#8221;, etc.\u00c2\u00a0 This way, I can load the XSLT file into my Ruby script, replace the patterns with their real values via regex, and then transform the Atom to OAI-PMH using <a href=\"http:\/\/libxsl.rubyforge.org\/\" target=\"_blank\">libxslt-ruby<\/a>.\u00c2\u00a0 Requests are now down to about 5 seconds.\u00c2\u00a0 Not bad.<\/p>\n<p>All in all I&#8217;m pretty happy with this.\u00c2\u00a0 And I don&#8217;t have to quit my addiction just yet.<\/p>\n<p>For those of you that noticed that libxslt-ruby doesn&#8217;t quite jibe with my JRuby requirement, well, I guess I&#8217;m not a very dogmatic at the end of the day (which is right about now).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My relationship with Ruby nowadays is roughly akin to somebody addicted to pain killers.\u00c2\u00a0 I know it&#8217;s not good for me (since everything I work on nowadays is RDF, XML or both) but I&#8217;m able to still be productive and the pain of quitting, while in the long run would be better for everybody, just [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[47,39,52],"tags":[],"class_list":["post-120","post","type-post","status-publish","format-standard","hentry","category-jangle","category-ruby","category-xslt"],"_links":{"self":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/120","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/comments?post=120"}],"version-history":[{"count":2,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/120\/revisions"}],"predecessor-version":[{"id":122,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/120\/revisions\/122"}],"wp:attachment":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/media?parent=120"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/categories?post=120"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/tags?post=120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}