Chances are, some of you have run into the issue with the
invalid byte sequence in UTF-8 error when dealing with user-submitted data. A Google search shows that my hunch isn’t off.
Among the search results are plenty of answers—some using the deprecated iconv library—that might lead you to a sufficient fix. However, among the slew of queries are few answers on how to reliably replicate and test the issue.
In developing the Griddler gem we ran into some cases where the data being posted back to our controller had invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body of an email having an invalid byte, and encoded as UTF-8.
What are valid and invalid bytes? This table on Wikipedia tells us bytes 192, 193, and 245-255 are off limits. In ruby’s string literal we can represent this by escaping one of those numbers:
> "hi \255" => "hi \xAD"
There’s our string with the invalid byte! How do we know for sure? In that IRB session we can simulate a comparable issue by sending a message to the string it won’t like - like
> "hi \255".split(' ') ArgumentError: invalid byte sequence in UTF-8 from (irb):9:in `split' from (irb):9 from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'
Yup. It certainly does not like that.
Let’s create a very real-world, enterprise-level, business-critical test case:
require 'rspec' def replace_name(body, name) body.gsub(/joel/, name) end describe 'replace_name' do it 'removes my name' do body = "hello joel" replace_name(body, 'hank').should eq "hello hank" end it 'clears out invalid UTF-8 bytes' do body = "hello joel\255" replace_name(body, 'hank').should eq "hello hank" end end
The first test passes as expected, and the second will fail as expected but not with the error we want. By adding that extra byte we should see an exception raised similar to what we simulated in IRB. Instead it’s failing in the comparison with the expected value.
1) replace_name clears out invalid UTF-8 bytes Failure/Error: replace_name(body, 'hank').should eq "hello hank" expected: "hello hank" got: "hello hank\xAD" (compared using ==) # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'
Why isn’t it failing properly? If we pry into our running test we find out that inside our file the strings being passed around are encoded as
ASCII-8BIT instead of
 pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding => #<Encoding:ASCII-8BIT>
As a result we’ll have to force that string’s encoding to UTF-8:
it 'clears out invalid UTF-8 bytes' do body = "hello joel\255".force_encoding('UTF-8') replace_name(body, 'hank').should_not raise_error(ArgumentError) replace_name(body, 'hank').should eq "hello hank" end
By running the test now we will see our desired exception
1) replace_name clears out invalid UTF-8 bytes Failure/Error: body.gsub(/joel/, name) ArgumentError: invalid byte sequence in UTF-8 # ./invalid_byte_spec.rb:4:in `gsub' # ./invalid_byte_spec.rb:4:in `replace_name' # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>' Finished in 0.00426 seconds 2 examples, 1 failure
Now that we’re comfortably in the red part of red/green/refactor we can move on to getting this passing by updating our
def replace_name(body, name) body .encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') .gsub(/joel/, name) end
And the test?
Finished in 0.04252 seconds 2 examples, 0 failures
For such a small piece of code we admittedly had to jump through some hoops. Through that process, however, we learned a bit about character encoding and how to put ourselves in the right position—through the red/green/refactor cycle—to fix bugs we will undoubtedly run into while writing software.
Todays’ release of Ruby Science includes two new chapters. If you’re already reading Ruby Science, make sure to log into GitHub and download the latest version.
In this week’s updates, we cover composition and inheritance. You’ll learn about the uses and drawbacks of Single Table Inheritance (STI), as well as how to convert an STI hierarchy to use composition through polymorphic associations.
The book is a work in progress, and currently contains around 104 pages of content. A $49 purchase gets you access to the current release of the book, all future updates, and the companion example application. In addition, purchasers have the ability to send thoughtbot their toughest Ruby, Rails, and refactoring questions.
Get your copy of Ruby Science today.
For all the likes, shares, tweets, pokes, follows, and friends, there’s a fundamental core to the internet that, no matter how hard some might hope, will never go away—email. Rails has built-in support for outgoing mail with ActionMailer, but nothing on the omakase menu handles incoming mail. To help with that, we extracted Griddler from Trajectory and are now happy to release it—hot off the… ahem… presses.
Griddler is a Rails engine that provides an endpoint for the SendGrid Parse API. It hands off a preprocessed email object to a class implemented by you. We’re happy to look at pull requests that interface with other email services.
To get Griddler integrated with your app, add Griddler to your
Griddler automatically adds an endpoint to your routes table resembling the following:
post '/email_processor' => 'griddler/emails#create'
But you may copy, paste, and modify that anywhere else in your routes for the purposes of your application.
Once Sendgrid posts to your endpoint Griddler will take care of packaging up the important bits of that data and providing a nice
Griddler::Email object for you. The contract we expect you to go in on with Griddler at this point is that you will implement a class called
EmailProcessor, containing a class method called
process, which we will be passing that packaged up instance of
For example, in
class EmailProcessor def self.process(email) # all of your application-specific code here - creating models, # processing reports, etc end end
The email object contains the following attributes:
subject fall on the obvious side as to their purpose.
What isn’t entirely obvious (but very cool) is that Griddler helps you handle the email body by cleaning up replies and providing the important parts of an email before
-- Reply ABOVE THIS LINE -- in the
body attribute. Note that the reply delimeter is adjustable in the configuration. We keep
raw_body around, as contains everything before Griddler scrubs it into
body so that you may use the contents for other purposes.
There is much more information in the Griddler README explaining the details, configuration, testing, and other bits.
If you like it, let us know what you think! As always, you can find the code on GitHub. We look forward to hearing all of the ways you use Griddler!
We have two new chapters to announce this week in Ruby Science. If you’re already reading Ruby Science, make sure to log into GitHub and download the latest version.
With this week’s updates, you’ll learn how to keep your classes from becoming junk drawers by learning to avoid Divergent Change. You’ll also see an example of using Convention Over Configuration to remove tedious boilerplate and avoid Duplicated Code.
The book is a work in progress, and currently contains around 82 pages of content. Purchasing the book also gets you access to the companion example application, as well as the ability to send thoughtbot your toughest Ruby, Rails, and refactoring questions.
If you haven’t already purchased it, you can still get access for the early purchase price of $39.
This Friday, the price will increase to $49.
Get your copy of Ruby Science today.