I need to get rid of occasional non-ASCII characters in otherwise plain ASCII text, such as 'curly quotes' like “ and ”. I don't know the real encoding of my source text but I can tell that the characters are encoded as hexadecimal characters such as \x94
Here is the regular expression I use to remove them:
str.gsub!(/[\x80-\xff]/, '')
I'm sure this won't work in many cases but with my text it does the job just fine.
No comments:
Post a Comment