As we wrote just over two weeks ago, Hoptoad was having a hard time keeping up performance when certain websites were submitting thousands of errors at the same time.
Fixing this became out highest priority and, as I promised then, we will outline the changes we made that have helped us to be able to weather the error storm.
Delayed Counter Cache Updating
The biggest problem we were having is that when a lot of errors come in at the same time, it caused the database to back up and slow way down. Even though saving one particular record wasn't slow at all, saving thousands at once was problematic. Turns out that Rails was working against us here: the caching that normally speeds everything up was causing a bottleneck. Each time an error came in, we updated the counter cache (and some other caches) on the error's group record.
The problem was that it locks the row when that happens, and since a flood of one particular error is going to want to update that row a lot, it meant that all the other error notifications had to wait for the one before it to be done. If you've ever stood in line for the bathroom, you know how impatient queueing can make you.
Thankfully, there's a way to use the counter caches without actually using the counter caches. You can have them in the database and even have ActiveRecord respect them when calling
#size, but at the same time not actually update the columns. Simply pass
:counter_cache => false as an option to the association.
Of course this doesn't get you the actual caching you want. To do that, we now have a rake task that we run every minute that updates these counter caches. Fortunately it's pretty speedy, so we don't have to worry about it overloading everything. The task counts up the errors that came in since the last time it ran and updates the counter caches on the related error groups accordingly.
Extra Saves and Database Queries
To assist with identifying performance issues we've been using New Relic. When you host with Engine Yard, you get a bronze level account free, but we've upgraded our account to Silver in order to get transaction traces.
Transaction traces allow you to see the specific SQL that's being called for slow action. As a result of working with New Relic, we were able to identify the fact that an after_create callback was causing an unnecessary extra save of each error, and that a stray call to
current_user could be removed. While neither of these things were huge problems, it allows us to have the breathing room we need when traffic spikes.
Once we had made those changes, we saw a fairly dramatic decrease in the amount of latency of the error creation action (the green vertical lines are our deploys of these changes).
Changed the Error Matching Mechanism
Hoptoad groups duplicate notices so that you don't get bombarded with e-mails when a single issue causes hundreds of exceptions. When grouping notices, Hoptoad identifies unique exceptions based on their error class, file, line number, action, controller, and Rails environment. This isn't particularly complicated behavior, but it resulted in a search for all those properties every time a new exception came in. Eventually, the indexes for this operation started to get out of hand, and operations on notice groups started to get expensive.
In order to ease the congestion, we came up with a hashing mechanism to throw out most of the notices right away when matching. Each notice now has a fingerprint that results from its unique columns, so most notices can be ignored as potential matches quickly by comparing fingerprints. This allowed us to remove several indexes, speeding up inserts and other selects on that table.
Things Are Looking Up
Since we've made these changes, we have been able to successfully weather several surges of error traffic, and performance was not adversely affected.
While things are looking good now we're keeping our eye on the ball and have other changes in the works to make sure that Hoptoad's performance keeps pace with the number of errors you all are creating out there.