A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Tuesday, May 12, 2009

Character Encodings and AWS SimpleDB

Just been bitten by an ISO-8859-1 character lurking in a string I was trying to load into a Amazon Web Services SimpleDB domain.

It is actually the same darn character that gave me issues in Ruby 1.9 the other week.

In this case the non-UTF8 character resulted in the request being sent to SimpleDB not matching the Signature that is computed from the request. Here is the error message that I kept getting with certain records I was trying to upload.
<?xml version="1.0"?>
<Response><Errors><Error>
<Code>SignatureDoesNotMatch</Code>
<Message>The request signature we calculated does not match the signature you provided.
Check your AWS Secret Access Key and signing method. Consult the service documentation for details.</Message>
</Error></Errors>
<RequestID>021ef5c0-b1fd-73ed-b376-bc292d9736cf</RequestID></Response>


So I spent a good few hours checking my AWS code for errors, trying a different AWS sdb library, trying the latest version of the Signature protocol, etc., etc. - all to no avail. Trying to pinpoint the problem I tried running the code again with a different set of input data and it was working fine. That told me it was data related and not code, per se. Looking more cloesely at the data with 'pp' I saw a non-ascii/non-utf-8 char code. Turns out SimpleDB has known issues when these appear in input queries.

For me, the fix was fairly simple. I'm 99% sure that all I need worry about are ISO-8859-1 codes in my input. So look for a non-ASCII code (> 127) in my input strings and if I find any I use iconv to convert to UTF-8, which SimpleDB can handle. Note that this is for Ruby 1.8 - It does not work in 1.9 - I'll post a fix for that in due course.
require 'iconv'

def convert(in_str)
ascii = 1
in_str.length.times do |i|
if in_str[i] > 127
ascii = 0
break
end
end
out_str = String.new(in_str)
if ascii == 0
in_encoding = 'iso-8859-1' # just a guess
out_encoding = 'utf-8'
out_str = Iconv.new(out_encoding, in_encoding).iconv(in_str)
end
out_str
end

[...]
input_hash.keys.each {|key| input_hash[key] = convert(input_hash[key]) }
[...]


Not pretty but it has got me back up and running.

Moral of this story: When arbitrary weird effects arise halfway through processing a large dataset, look closely at the data. Chances are the problem lies in there somewhere. Make sure you have some accurate log to help you pinpoint exactly where things went wrong.

No comments:

Archive of Tips