{"id":301,"date":"2009-04-16T10:30:52","date_gmt":"2009-04-16T15:30:52","guid":{"rendered":"http:\/\/dilettantes.code4lib.org\/?p=301"},"modified":"2009-04-16T10:30:52","modified_gmt":"2009-04-16T15:30:52","slug":"parsing-escaped-unicode-in-ruby","status":"publish","type":"post","link":"https:\/\/rossfsinger.me\/blog\/2009\/04\/parsing-escaped-unicode-in-ruby\/","title":{"rendered":"Parsing escaped unicode in Ruby"},"content":{"rendered":"<p>While what I&#8217;m posting here might be incredibly obvious to anyone that understands unicode or Ruby better than me, it was new to me and might be new to you, so I&#8217;ll share.<\/p>\n<p>Since <a href=\"http:\/\/inkdroid.org\/\" target=\"_blank\">Ed<\/a> already <a href=\"http:\/\/delicious.com\/url\/64ad6c81446e299f983894d4e18a011a\" target=\"_blank\">let the cat out of the bag<\/a> about <a href=\"http:\/\/lcsubjects.org\/\" target=\"_blank\">LCSubjects.org<\/a>, I can explain the backstory here.\u00c2\u00a0 At <a href=\"http:\/\/lcsh.info\/\" target=\"_blank\">lcsh.info<\/a>, Ed made the <a href=\"http:\/\/inkdroid.org\/bzr\/lcsh\/web\/static\/lcsh.nt\" target=\"_blank\">entire dataset available<\/a> as <a href=\"http:\/\/www.w3.org\/2001\/sw\/RDFCore\/ntriples\/\" target=\"_blank\">N-Triples<\/a>, so just before he yanked the site, I grabbed the data and have been holding onto it since.\u00c2\u00a0 I wrote a simple little N-Triples parser in Ruby to rewrite some of the data before I loaded it into the platform store I have.\u00c2\u00a0 My first pass at this was really buggy, I wasn&#8217;t parsing N-Triple literals well at all and was leaving out quoted text within the literal and whatnot.\u00c2\u00a0 I also, inadvertantly, was completely ignoring the escaped unicode within the literals and sending them verbatim.<\/p>\n<p>N-Triples escapes unicode the same way Python string literals do (<a href=\"http:\/\/www.w3.org\/2001\/sw\/RDFCore\/ntriples\/#character\" target=\"_blank\">or at least this is how I&#8217;ve understood it<\/a>), so 7\u00e2\u0081\u00b003\u00ca\u00b943\u00ca\u00baN 151\u00e2\u0081\u00b056\u00ca\u00b925\u00ca\u00baE is serialized into nt like: 7\\\\u207003\\\\u02B943\\\\u02BAN 151\\\\u207056\\\\u02B925\\\\u02BAE.\u00c2\u00a0 Try as I might, I could not figure out how to turn that back into unicode.<\/p>\n<p><a href=\"http:\/\/bibwild.wordpress.com\/\" target=\"_blank\">Jonathan Rochkind<\/a> recommended that I look at the <a href=\"http:\/\/json.rubyforge.org\/\" target=\"_blank\">Ruby JSON library<\/a> for some guidance, since JSON also encodes this way.\u00c2\u00a0 With that, I took a peek in <a href=\"http:\/\/json.rubyforge.org\/doc\/classes\/JSON\/Pure\/Parser.html\" target=\"_blank\">JSON::Pure::Parser<\/a> and modified parse_string for my needs.\u00c2\u00a0 So, if you have escaped unicode strings like this, and want them to be unicode, here&#8217;s a simple class to handle it.<\/p>\n<pre>$KCODE = 'u'\r\nrequire 'strscan'\r\nrequire 'iconv'\r\nrequire 'jcode'\r\nclass UTF8Parser &lt; StringScanner\r\n  STRING = \/(([\\x0-\\x1f]|[\\\\\\\/bfnrt]|\\\\u[0-9a-fA-F]{4}|[\\x20-\\xff])*)\/nx\r\n  UNPARSED = Object.new\r\n  UNESCAPE_MAP = Hash.new { |h, k| h[k] = k.chr }\r\n  UNESCAPE_MAP.update({\r\n    ?\"  =&gt; '\"',\r\n    ?\\\\ =&gt; '\\\\',\r\n    ?\/  =&gt; '\/',\r\n    ?b  =&gt; \"\\b\",\r\n    ?f  =&gt; \"\\f\",\r\n    ?n  =&gt; \"\\n\",\r\n    ?r  =&gt; \"\\r\",\r\n    ?t  =&gt; \"\\t\",\r\n    ?u  =&gt; nil,\r\n  })\r\n  UTF16toUTF8 = Iconv.new('utf-8', 'utf-16be')\r\n  def initialize(str)\r\n    super(str)\r\n    @string = str\r\n  end\r\n  def parse_string\r\n    if scan(STRING)\r\n      return '' if self[1].empty?\r\n      string = self[1].gsub(%r((?:\\\\[\\\\bfnrt\"\/]|(?:\\\\u(?:[A-Fa-f\\d]{4}))+|\\\\[\\x20-\\xff]))n) do |c|\r\n        if u = UNESCAPE_MAP[$&amp;[1]]\r\n          u\r\n        else # \\uXXXX\r\n          bytes = ''\r\n          i = 0\r\n          while c[6 * i] == ?\\\\ &amp;&amp; c[6 * i + 1] == ?u\r\n            bytes &lt;&lt; c[6 * i + 2, 2].to_i(16) &lt;&lt; c[6 * i + 4, 2].to_i(16)\r\n            i += 1\r\n          end\r\n          UTF16toUTF8.iconv(bytes)\r\n        end\r\n      end\r\n      if string.respond_to?(:force_encoding)\r\n        string.force_encoding(Encoding::UTF_8)\r\n      end\r\n      string\r\n    else\r\n      UNPARSED\r\n    end\r\n  rescue Iconv::Failure =&gt; e\r\n    raise GeneratorError, \"Caught #{e.class}: #{e}\"\r\n  end\r\nend<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>While what I&#8217;m posting here might be incredibly obvious to anyone that understands unicode or Ruby better than me, it was new to me and might be new to you, so I&#8217;ll share. Since Ed already let the cat out of the bag about LCSubjects.org, I can explain the backstory here.\u00c2\u00a0 At lcsh.info, Ed made [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,39,62],"tags":[],"class_list":["post-301","post","type-post","status-publish","format-standard","hentry","category-coding","category-ruby","category-unicode"],"_links":{"self":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/301","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/comments?post=301"}],"version-history":[{"count":2,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/301\/revisions"}],"predecessor-version":[{"id":303,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/posts\/301\/revisions\/303"}],"wp:attachment":[{"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/media?parent=301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/categories?post=301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rossfsinger.me\/blog\/wp-json\/wp\/v2\/tags?post=301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}