Hopping in the cloud
Hoptoad has been running on the Engine Yard cloud for more than a week now, with excellent performance. We've been looking forward to this for quite some time, and want to share a little about our motivations and experiences.
## Why'd we do it?
Currently, Hoptoad is a monolithic application, with the notification API, data API, and user-facing website all running in the same Rails application. This fits well into a server environment with several identical application servers and a shared database server.
(Click for a larger image.)
However, this coupled application design has a few implications:
- Traffic in one component, like the exception notifier API, can adversely affect the performance of another component, like the web interface.
- You can't tune the resource allocation of one component, like the notifier API endpoint, independently of other components, like the web interface.
Now, the performance characteristics of the application and the error notifier endpoints are drastically different. The notifier endpoint is a very performance-sensitive high-write component that processes a few thousand requests per minute. It is responsible for validating the incoming error XML, applying any rate-limiting rules, and determining the correct error group to bucket the error into.
The web interface, on the other hand, is almost entirely database reads. It also has much lower traffic overall, compared to the notifier endpoint. However, it is probably the place where you'll notice performance dips the most.
Together, these properties make the Hoptoad a ripe target for modularization. Our goal is to move toward a system where the notification endpoint is separated from the user-facing web UI, so as to isolate them for scaling and performance.
(Click for a larger image.)
Having a flexible architecture will also allow us to move some processing steps into a separate queue, which opens up an interesting possibility for batch processing.
Batch processing with utility slices
Hoptoad's focus is on identifying and reducing duplicate information into unique records. Currently, Hoptoad will insert your exception into the database, and then determine the group of similar exceptions that it would be assigned to. When your application issues a high volume of exceptions, this can result in a large number of INSERT and UPDATE statements. We've had to rate limit these cases, simply because the high INSERT/UPDATE rates would otherwise make the site unusable for other users. This is still less than ideal though, as high-volume bursts will lose some exception instances due to rate limiting.
(Click for a larger image.)
But! We can use this to our advantage, and are working on a queue processing system that works on batches of exceptions at once, identifying duplicates in the Redis queue and folding them down before inserting into the database. This should dramatically ease the INSERT/UPDATE rate for high-volume exception situations, making Hoptoad much better equipped to handle bursts of duplicate exceptions.
For example, if your application's database goes down, your application may send thousands of exceptions to Hoptad per minute. Currently, that would result in a similar rate of database INSERTs and UPDATEs to record and group each exception individually, which is very disk intensive. If these duplicates queue up over the course of a few seconds, it can precompute duplicate counts and fold hundreds of duplicates down into a handful of INSERT and UPDATE statements.
(Click for a larger image.)
The ability to add utility slices for longer-running reporting and processing tasks also opens up the door for some interesting features that could not currently be computed during the notification request/response lifecycle.
Realistic benchmarking with environment cloning
With the clone environment operation available in the Engine Yard cloud dashboard (video), it's appealingly straightforward to performance test a new feature by duplicating your production environment, and forking live traffic to the clone in realtime using a tool like em-proxy to see how it performs.
We've hit a few stumbling blocks with this approach, mostly due to having a large-ish database (several hundred GB) to clone. We're currently in the process of benchmarking the addition of a Redis-backed worker queue to implement the "Batch processor workers" component of the above diagram.
## What did we learn?
The first time we planned to move over the the cloud, we ran into unplanned performance issues, and had to roll back to our prior hosting. We learned a few things from this, and our second cutover was smooth as silk.
Realistic load testing is invaluable
We load tested the production configuration with synthetic traffic that approximated our live traffic. The synthetic load testing indicated a large amount of performance headroom.
Later, we load tested against live traffic in realtime, using em-proxy to fork live traffic over to the cloud environment in parallel to the previous environment, discarding the cloud responses. This tactic revealed a different performance picture, and allowed us to benchmark various hardware configurations and choose one with a great deal more confidence.
However, when we ran these live parallel load tests, we intentionally disabled email delivery so as not to deliver duplicate exception notifications to our customers. This could have left us blind to the performace of a critical component, and it's easy to assume that delivering an email will be a low latency operation. By default, we were using ssmtp in our new configuration, which only runs in interactive mode. In that setup, our application would block on every SMTP delivery request, for hundreds of milliseconds; much too long. We switched from ssmtp to exim as a queueing MTA in front of SendGrid, to minimize the time to deliver transactional emails.
Buttressing the DNS cascade: swinging with iptables
When the scheduled cutover approached, we made sure to drop our DNS TTL, so that when it came time to repoint our DNS for hoptoadapp.com to the new server, the DNS change would cascade quickly. Engine Yard went one step further, though, and used iptables to redirect traffic from our old IP to the new IP, so that users would have a seamless experience after we brought up the cloud environment, regardless of whether the new DNS entry had reached them or not.
Database caches are key
The first time we switched over, we roughly took the following steps:
- Use MySQL replication to keep the new cloud database in sync with the old production database.
- Put up a maintenance page on the old web server
- Allow replication to catch up
- Redirect traffic (DNS and iptables)
- Receive new traffic on the cloud environment
Once we opened the floodgates of traffic, the database thrashed as its query caches were completely cold. The caches began to fill up as we handled traffic, but the user experience was very poor, and we watched the response time grow to hundreds of seconds in NewRelic.
The second time we switched, we still kept the cloud database up to date with MySQL replication. However, we also removed the application's write privileges to the cloud database, and ran em-proxy to fork live traffic to the cloud application servers. This ensured that the read caches were full, all the way up to the point of cutover. When we completed the replication during downtime, we did not have to restart the database server, leaving the caches fat and happy, ready to serve normal traffic levels.
Doing lots with lots of email
Hoptoad sends a reasonably large amount of email - tens of thousands of messages per day.
The switch to Engine Yard cloud also afforded us a convenient time to reconsider our email delivery. We were previously using an internal Engine Yard SMTP server that is not available to cloud customers. We wanted the switch to be as low-impact as possible, so we went with SMTP provider SendGrid. We evaluated a variety of other hosted transactional SMTP providers. We also checked out Postmark which looks very promising. Postmark provides an HTTP API for mail delivery, and we decided to stick with an SMTP interface to minimize the impact on our codebase.
So that's where we're at. We're looking forward to improving the architecture as we handle more traffic, being able to add interesting features that take advantage of our new flexibility, and continuing to refine Hoptoad as a useful service.
What have your experiences been with hosting "in the cloud?" What have you learned? What benefits have you gained, or would you like to gain, with flexible hosting?