While what I’m posting here might be incredibly obvious to anyone that understands unicode or Ruby better than me, it was new to me and might be new to you, so I’ll share.
Since Ed already let the cat out of the bag about LCSubjects.org, I can explain the backstory here. At lcsh.info, Ed made the entire dataset available as N-Triples, so just before he yanked the site, I grabbed the data and have been holding onto it since. I wrote a simple little N-Triples parser in Ruby to rewrite some of the data before I loaded it into the platform store I have. My first pass at this was really buggy, I wasn’t parsing N-Triple literals well at all and was leaving out quoted text within the literal and whatnot. I also, inadvertantly, was completely ignoring the escaped unicode within the literals and sending them verbatim.
N-Triples escapes unicode the same way Python string literals do (or at least this is how I’ve understood it), so 7â°03ʹ43ʺN 151â°56ʹ25ʺE is serialized into nt like: 7\\u207003\\u02B943\\u02BAN 151\\u207056\\u02B925\\u02BAE. Try as I might, I could not figure out how to turn that back into unicode.
Jonathan Rochkind recommended that I look at the Ruby JSON library for some guidance, since JSON also encodes this way. With that, I took a peek in JSON::Pure::Parser and modified parse_string for my needs. So, if you have escaped unicode strings like this, and want them to be unicode, here’s a simple class to handle it.
$KCODE = 'u'
require 'strscan'
require 'iconv'
require 'jcode'
class UTF8Parser < StringScanner
STRING = /(([\x0-\x1f]|[\\\/bfnrt]|\\u[0-9a-fA-F]{4}|[\x20-\xff])*)/nx
UNPARSED = Object.new
UNESCAPE_MAP = Hash.new { |h, k| h[k] = k.chr }
UNESCAPE_MAP.update({
?" => '"',
?\\ => '\\',
?/ => '/',
?b => "\b",
?f => "\f",
?n => "\n",
?r => "\r",
?t => "\t",
?u => nil,
})
UTF16toUTF8 = Iconv.new('utf-8', 'utf-16be')
def initialize(str)
super(str)
@string = str
end
def parse_string
if scan(STRING)
return '' if self[1].empty?
string = self[1].gsub(%r((?:\\[\\bfnrt"/]|(?:\\u(?:[A-Fa-f\d]{4}))+|\\[\x20-\xff]))n) do |c|
if u = UNESCAPE_MAP[$&[1]]
u
else # \uXXXX
bytes = ''
i = 0
while c[6 * i] == ?\\ && c[6 * i + 1] == ?u
bytes << c[6 * i + 2, 2].to_i(16) << c[6 * i + 4, 2].to_i(16)
i += 1
end
UTF16toUTF8.iconv(bytes)
end
end
if string.respond_to?(:force_encoding)
string.force_encoding(Encoding::UTF_8)
end
string
else
UNPARSED
end
rescue Iconv::Failure => e
raise GeneratorError, "Caught #{e.class}: #{e}"
end
end
Leave a Reply