I work with large text files representing DNA sequences, patents, etc. These are typically plain ASCII text and that is how I treat them. Under Ruby 1.8 everything seemed fine. But running the same code on a 2GB text file I got this error:
$ ./test.rb myfile
./test.rb:9:in `block in <main>': invalid byte sequence in US-ASCII (ArgumentError)
from ./test.rb:3:in `each_line'
from ./test.rb:3:in `<main>'
Here is code that gave rise to that:
#!/usr/bin/env ruby
open(ARGV[0], 'r').each_line do |line|
if line =~ />(\S+)/
puts line
end
end
Somewhere in the middle of the input file is a non-ASCII character and Ruby 1.9 won't take it. It turns out that 1.9 takes a much stricter line on interpreting text. Unless you tell it otherwise, it expects plain ASCII and anything else is an error. 1.8 just took what you gave it.
If you know you will be reading UTF-8 or ISO-8859-1 text then you can explicitly tell your script to handle it. There are several ways to do this but in this simple example you can change the 'r' in the open statement like this:
#!/usr/bin/env ruby
open(ARGV[0], 'r:utf-8').each_line do |line|
That's OK if you know the encoding, but in my work I see occasional non-ASCII characters, such as German umlauts, that have crept into public data files that I work with. I don't know what to expect and I don't want to clutter my code with rescue clauses to handle all possibilities.
The solution for my problem is to treat the text as binary by using the 'rb' modifier in the File.open statement. I can still process text data line by line but Ruby will swallow non-ASCII characters. So this version of the code takes the input data with no problems:
#!/usr/bin/env ruby
open(ARGV[0], 'rb').each_line do |line|
if line =~ />(\S+)/
puts line
end
end
My problem stemmed from two umlaut characters buried deep in the file. To figure out which lines were causing the problem I used this variant of the code to output bad lines.
#!/usr/bin/env ruby
open(ARGV[0], 'r').each_line do |line|
begin
if line =~ />(\S+)/
end
rescue
puts line
end
end
Look up the issue and you'll find plenty of debate on the merits or otherwise of this new feature in 1.9. It took me by surprise.