A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Thursday, February 24, 2011

Rails, UTF-8 and Heroku

I've had problems with Ruby character encodings over the years, especially when pulling text with non-ASCII characters in from remote sites. I thought I had it mostly sorted but the past few days showed me that was not the case.

I have a MySQL database on my local machine and a Rails 3 app that pulls in text from remote sources, stores it in the database and does stuff with it. I am deploying the application on Heroku prior to public release. Heroku uses PostgreSQL exclusively.

I was under the belief that all the components in my system were set up to use UTF-8 encoding and therefore moving text with non-ASCII characters around should be fine. But in practice that was not the case - characters like 'α' that looked fine on my local machine showed up as 'α' on Heroku, etc. So clearly I was doing something wrong. Rather than go into all the gory details, this is the way to do it right...

Bottom-line: Make everything use UTF-8 explicitly ... EVERYTHING

0: Backup your current database!

1: MySQL
Here are the contents of my /etc/my.cnf file:
[mysqld]
datadir=/usr/local/mysql/data
bind-address = 127.0.0.1
character-set-server = utf8
max_allowed_packet = 32M
[client]
default-character-set = utf8
[mysql]
max_allowed_packet = 32M

Even if your tables are held in utf8, you should add these lines. You want the following mysql command to look as shown:
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+

2: Rails
I'm using Rails 3 - can't tell you how this works in Rails 2.x
a: In config/application.rb make sure this line is uncommented:
    # Configure the default encoding used in templates for Ruby 1.9.
config.encoding = "utf-8"
MySQL and Rails use different variants of utf8/utf-8 - make sure you are using the right one. And note the comment above this line - this sets up utf-8 encoding for templates ONLY.
b: In your database.yml, specify the encoding for the databases - for example:
development:
adapter: mysql2
host: localhost
encoding: utf8
[...]
Here you are telling the database adapter that the database uses utf-8.
c: mysql2
Notice that I am using the mysql2 adapter instead of mysql. At this point (Feb 2011) the mysql gem is NOT encoding aware. Replace mysql with mysql2 in your Gemfile and run bundle.
d: In each Model that uses text add this line at the very top of the file:
# encoding: UTF-8
This tells Ruby that we're using utf-8 in this model. I don't see a way to set this at the application level so you have to have to add it in all relevant model .rb file. I also don't like defining something with a comment line. I can't see how to define this in, say, an irb interactive session.

With all those components in place, you should be all set. Try entering non-ASCII characters into a form - such as accented characters or greek/math symbols. These should be displayed correctly in the browser and in the mysql command line client.

With regards to Heroku, assuming you have your app already set up, you should be able to do a 'heroku db:push' to copy the database into PostgreSQL on Heroku and the characters should display correctly on the remote pages. You will see reference to using 'heroku db:push' with explicit database URLs that include an encoding option, such as '?encoding=utf8'. If your MySQL is set up correctly then this should be unnecessary.

A critical part of running apps on Heroku is the ability to pull the database back to your local database using 'heroku db:pull'. Before getting all my components set up with utf-8, this step failed for me. With everything using utf-8, and after adding the 'max_allowed_packet' lines to my my.cnf file, this process works fine.

But because I was working with data in before everything was truly utf-8, I had some instances of text in the database that had been incorrectly encoded - and thereby effectively corrupted. I could see what the 'corrupt' characters looked like and I knew what the correct versions should be. Because everything is now using utf-8 I could simply do a substitution on the text. For example:
str.sub!(/ß/, 'β')
I gathered up the character mappings that I needed (which were not may in my case) and wrote up a class method that I cloned in each model with the issue. I then ran those in the Rails console to correct the bad characters. The method is:
  def self.make_utf8_clean
mappings = [ ['α', 'α'],
['ß', 'β'],
['β', 'β'],
['’', '’'],
['“', '“'],
['â€\u009D;', '”'],
['â€', '”'],
['ö', 'ö'],
['®', '®']
]
# Get the list of String columns
columns = Array.new
self.columns.each do |column|
if column.type.to_s == 'string'
columns << column
end
end

# Go through each object -> column against all mappings
self.all.each do |obj|
columns.each do |column|
mappings.each do |mapping|
value = obj.attributes[column.name]
if value =~ /#{mapping[0]}/
s = value.gsub(/#{mapping[0]}/, "#{mapping[1]}")
obj.update_attribute(column.name.to_sym, s)
end
end
end
end
end
This looks at your model and figures out which columns are of type String. It goes through all records and all character mappings, replacing text and updating the database as needed. Your mappings array could be much larger. There may be a better source of these, but this is a start.
You run this in a rails console like this:
Loading development environment (Rails 3.0.4)
ruby-1.9.2-p0 > YourModel.make_utf8_clean
It's a hack but it helped my 'fix' quite a few records that would have been a pain to recreate.

Character encodings are HARD - Yehuda Katz wrote a nice article on the issues. For most purposes (unless you work with Japanese text) UTF-8 is your best choice for encoding and so I'm using it exclusively. Java and Python both made the same choice and things are probably easier to set up in those worlds. Ruby has it's roots in Japan and so it is not surprising that it could not go down that path.

From now on, I'm going to make sure everything I touch is configured for UTF-8. There are fews reasons not to at this stage and it allows you to handle most languages.



 

3 comments:

daze said...

Hey, this was a great post. I'm facing a utf-8 encoding issue right now, and I think your blog post could significantly help.

Great job!

Mario Zigliotto said...

Yeah, this was just a fantastic post. Running into a lot of utf-8 issues from scraped data sources and this helps.

Thanks for taking the time to publish it!

Unknown said...

Hi,

I am facing the same issue right ow with the rails application on Heroku.
Did you find any permanent solution for this other than substitution?

Thanks,
Mounika

Archive of Tips