We've been working with a client who recently launched a new service. The launch entailed their marketing team sending batches of emails to a 1 million+ person mailing list over 2 days. In the email, there's a link to the homepage.
The client wanted some confidence that the home page of the Rails app, which is hosted on Heroku, would be able to handle the load generated from that traffic.
They didn't need a heavy-duty load test, just a little assurance. In turn, I wanted something that was quick to set up and execute.
It doesn't get quicker than apache bench:
ab command I ended up with:
ab -n 50000 -c 50 -A user:password https://staging.ourapp.com/
50000 requests with 50 concurrent users. Basic auth is used on staging to keep the outside world from seeing the app before it's unveiled. The trailing / is necessary.
I maxed out at 50 concurrent users because I read in Deploying Rails Applications by Ezra Zygmuntowicz that's about the most that apache bench can reasonably simulate.
If I was testing a particular workflow, I may have used the
-C flag with a
session value grabbed from a browser. That way, every test would use the same
session. For this scenario, however, I wanted to generate a new session on each
request because I was testing many new users hitting the home page.
To get more visibility into what was happening, I added a logging add-on:
heroku addons:upgrade logging:expanded --remote staging
While the tasks ran, I had a shell open tailing the log:
heroku logs -t --remote staging
It was mildly entertaining to watch the foreman-style logs fly by:
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.9 queue=0 wait=0ms service=49ms status=200 bytes=11322 2011-07-12T16:43:37+00:00 app[web.6]: 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.6 queue=0 wait=0ms service=156ms status=200 bytes=11323 2011-07-12T16:43:37+00:00 app[web.6]: 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.2 queue=0 wait=0ms service=51ms status=200 bytes=11322 2011-07-12T16:43:37+00:00 app[web.6]: Started GET "/" for 188.8.131.52 at 2011-07-12 09:43:37 -07002011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.15 queue=0 wait=0ms service=29ms status=200 bytes=11322 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.7 queue=0 wait=0ms service=162ms status=200 bytes=11323 2011-07-12T16:43:37+00:00 app[web.7]: Started GET "/" for 184.108.40.206 at 2011-07-12 09:43:37 -0700 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.10 queue=0 wait=0ms service=73ms status=200 bytes=11322 2011-07-12T16:43:37+00:00 app[web.10]: Started GET "/" for 220.127.116.11 at 2011-07-12 09:43:37 -0700 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.12 queue=0 wait=0ms service=179ms status=200 bytes=1132 2 2011-07-12T16:43:37+00:00 app[web.3]: Started GET "/" for 18.104.22.168 at 2011-07-12 09:43:37 -0700
We use New Relic in production so I figured we should use it for these tests:
heroku addons:add newrelic:standard --remote staging
I started small: 5000 requests, 5 concurrent users, 2 dynos. Then, I added concurrent users until I could see the "request queuing" portion of the New Relic add-on:
The left-hand mountains represent when I got up to 4 dynos and was hitting the app with unlikely amounts of traffic. The green portion is the "request queuing" time.
The right hand hills represent when I cranked the dynos up to 12 and was hitting the app with best-case scenario traffic (100% click-through rate on the emails) from three laptops. No request queuing time and pretty nice numbers:
- 5,250 requests per minute
- 50ms average response time
Those numbers and the chart above come from what New Relic calls the "app server" stats. The "end user" stats look a little different:
You can see that even though we're use the Rails asset pipeline asset packaging, there's still an opportunity to improve DOM processing and page rendering.
Ideally, we'd be under 2 seconds end user time.
However, this was enough information in combination with their historical email click-through rates to give the team confidence. In total, this took less than half an hour and most of that time was spent working on other things while the tests ran.
Post Script: Right action, right time
I didn't add caching (page, action, fragment, or otherwise) at all. Split testing code already kept the homepage from being trivial to cache so if it wasn't necessary, I wanted to avoid it. The data said it wasn't necessary.