We’re pleased to announce our newest screencast video on Learn, Improving Performance for Real-time Requests.
This video, with thoughtbot CTO Joe Ferris, shows you how to tune performance using a real world example. Starting with an action that takes 3 seconds to respond and performs over 1600 database queries, Joe is able to reduce response time to 300ms and drop database usage to just 5 queries.
This video specifically focuses on what Joe calls ‘real-time requests’. That is, requests that must return all of their data to the user in the request-response cycle, without relying on background jobs or persistent caching.
Prime subscribers get this new video included with their monthly subscription. You can also purchase it individually for $15 or for your whole company for just $49.
Slow Rails applications annoy users and lead to lost revenue. Achieve a competitive advantage by making your application UNCOMFORTABLY FAST. Learn how today!
In the previous article, we explored different techniques to customize the look and feel of UIButton, assigning to each a difficulty level based on the complexity of the Objective-C code involved in the implementation. What I intentionally left out mentioning however, is that some of these methods come with non-trivial performance ramifications that should be taken into consideration when choosing one over another.
In order to understand how performance is affected, we need to have a closer look at the technology stack behind graphics in iOS. This block diagram represents the different frameworks and libraries and how they relate to each other:

In the topmost layer, there is UIKit—a high-level Objective-C framework that manages the graphical user interface in iOS. It is made up of a collection of classes, each corresponding to a specific UI control such as UIButton and UILabel. UIKit itself is built on top of Core Animation, a framework introduced in OS X Leopard and ported to iOS to power the smooth transitions that it later became known for.
Deeper in the stack we have OpenGL ES, an open-standard library for rendering 2D and 3D computer graphics on mobile devices. It is widely used for game graphics and powers both Core Animation and UIKit. The last piece in the software stack is Core Graphics—historically referred to as Quartz—which is a CPU-based drawing engine that made its debut on OS X. These two low-level frameworks are both written in the C programming language.
The bottom row in the diagram represents the hardware stack, composed of the the graphics card (GPU) and the main processor (CPU).
We talk about hardware acceleration when the GPU is used for compositing and rendering graphics, such as the case for OpenGL and the Core Animation/UIKit implementations built on top of it. Until recently, hardware acceleration was a major advantage that iOS held over Android; most animations in the latter felt noticeably choppier as a result of its reliance on the CPU for drawing.
Offscreen drawing on the other hand refers to the process of generating bitmap graphics in the background using the CPU before handing them off to the GPU for onscreen rendering. In iOS, offscreen drawing occurs automatically in any of the following cases:
drawRect() method, even with an empty implementation.shouldRasterize property set to YES.setMasksToBounds) and dynamic shadows (setShadow*).UIViewGroupOpacity).As a general rule, offscreen drawing affects performance when animation is involved. You can inspect which parts of the UI are being drawn offscreen using Instruments with an iOS device:



Update: As Alex pointed out in the comments, you can also inspect offscreen rendering by checking the Debug > Color Offscreen-Rendered option in the iOS Simulator. Unless you are doing performance tests—which was the case here—using the simulator is the easiest and most straightforward way to inspect offscreen rendering.

Let’s now have a look at the performance footprint of each of the previously introduced approaches.
Customizing our button with a UIImage background relies entirely on the GPU for rendering the image assets saved on disk. The resizable background image variant is considered the least resource-hungry approach since it results in smaller app bundles and takes advantage of hardware acceleration when stretching or tiling pixels.
The CALayer-based method we implemented requires offscreen-drawing passes as it uses masking to render rounded corners. We also had to explicitly disable the animation that comes turned on by default when using Core Animation. Bottom line, unless you need animated transitions, this technique is not adequate for custom drawing.
The drawRect method relies on Core Graphics to do the custom drawing, but its main drawback lies in the way it handles touch events: each time the button is pressed, setNeedsDisplay forces it to redraw; not only once, but twice for every single tap. This is not a good use of CPU and memory, especially if there are multiple instances of our UIButton in the interface.
So, does this mean that using pre-rendered assets is the only viable solution? The short answer is no. If you still need the flexibility of drawing with code, there are techniques to optimize your code and reduce its performance footprint. One way is to generate a stretchable bitmap image and reuse it across all instances.
We’ll start by creating a new subclass of UIButton following the same steps detailed in the previous tutorial, then we’ll define our class-level static variables:
// In CBHybrid.m
#import "CBHybrid.h"
@implementation CBHybrid
// Resizable background image for normal state
static UIImage *gBackgroundImage;
// Resizable background image for highlighted state
static UIImage *gBackgroundImageHighlighted;
// Background image border radius and height
static int borderRadius = 5;
static int height = 37;
Next we will move our drawing code from drawRect in CBBezier to a new helper method, with a couple of changes: we will generate a resizable image instead of a full-sized one, then we will save the output to a static variable for later reuse:
- (UIImage *)drawBackgroundImageHighlighted:(BOOL)highlighted {
// Drawing code goes here
}
First, we need to get the width of our resizable image. For optimal performance, we want a 1pt stretchable area in the vertical center of the image.
float width = 1 + (borderRadius * 2);
The height matters less in this case, as long as the button is tall enough for the gradient to be visible. The value of 37pt was picked to match the height of the other buttons.
Moving on, we need a bitmap context to draw into, so let’s create one:
UIGraphicsBeginImageContextWithOptions(CGSizeMake(width, height), NO, 0.0);
CGContextRef context = UIGraphicsGetCurrentContext();
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
Setting the second boolean argument to NO will ensure that our image context is not opaque. The last argument is for the scale factor (screen density). When set to 0.0 it defaults the scale factor of the device.
The next block will be exactly like our previous Core Graphics implementation in CBBezier, save for updated values and the use of the highlighted argument instead of the default self.highlighted property:
// Gradient Declarations
// NSArray *gradientColors = ...
// Draw rounded rectangle bezier path
UIBezierPath *roundedRectanglePath = [UIBezierPath bezierPathWithRoundedRect: CGRectMake(0, 0, width, height) cornerRadius: borderRadius];
// Use the bezier as a clipping path
[roundedRectanglePath addClip];
// Use one of the two gradients depending on the state of the button
CGGradientRef background = highlighted? highlightedGradient : gradient;
// Draw gradient within the path
CGContextDrawLinearGradient(context, background, CGPointMake(140, 0), CGPointMake(140, height-1), 0);
// Draw border
// [borderColor setStroke...
// Draw Inner Glow
// UIBezierPath *innerGlowRect...
The only step we will need to add compared to CBBezier is a method that saves the output in a UIImage and a call to UIGraphicsEndImageContext to clean up after us.
UIImage* backgroundImage = UIGraphicsGetImageFromCurrentImageContext();
// Cleanup
UIGraphicsEndImageContext();
Now that we have a method to generate our background images, we will have to implement a common initializer method that will instantiate these images and set them up as the background for our CBHybrid instance.
- (void)setupBackgrounds {
// Generate background images if necessary
if (!gBackgroundImage && !gBackgroundImageHighlighted) {
gBackgroundImage = [[self drawBackgroundImageHighlighted:NO] resizableImageWithCapInsets:UIEdgeInsetsMake(borderRadius, borderRadius, borderRadius, borderRadius) resizingMode:UIImageResizingModeStretch];
gBackgroundImageHighlighted = [[self drawBackgroundImageHighlighted:YES] resizableImageWithCapInsets:UIEdgeInsetsMake(borderRadius, borderRadius, borderRadius, borderRadius) resizingMode:UIImageResizingModeStretch];
}
// Set background for the button instance
[self setBackgroundImage:gBackgroundImage forState:UIControlStateNormal];
[self setBackgroundImage:gBackgroundImageHighlighted forState:UIControlStateHighlighted];
}
We’ll proceed by setting the button type to custom and implementing initWithCoder (or initWithFrame if the button instance is created in code):
+ (CBHybrid *)buttonWithType:(UIButtonType)type
{
return [super buttonWithType:UIButtonTypeCustom];
}
- (id)initWithCoder:(NSCoder *)aDecoder {
self = [super initWithCoder:aDecoder];
if (self) {
[self setupBackgrounds];
}
return self;
}
To make sure that the new subclass is working properly, duplicate one of the buttons in Interface Builder and change its class to CBHybrid. Change the button content to CGContext-generated image then build and run.

The full subclass code can be found here.
When all is said and done, pre-rendered assets would still perform better than any code-based solution. Then again, there is much to gain in terms of flexibility and efficiency once Core Graphicsis tamed—that and a hybrid approach like the one we just covered would not affect performance to any noticeable degree on today’s hardware.
Update: Andy Matuschak, a member of the UIKit team, was kind enough to provide more clarifications about offscreen rendering as well as some good insights about cache-purging in the comments section.
This February 6th we are launching a new workshop: Advanced Rails. The workshop draws content from the Scaling Rails and Rails Antipatterns workshops, replacing them and creating best-of-breed content that will take your skill to the next level in creating well-crafted Rails applications that scale.
One of the topics we touch on is profiling and benchmarking your app. There are a number of tools available to achieve this, one of which is baked into Rails itself. Although we do discuss all of the great ways you can perform caching in a Rails app, experience has shown us that caching should be your last resource in your scaling strategy. Remember the two hardest things in computer science: Cache invalidation, naming things and off-by-one errors.
On to benchmarking, say you have identified an expensive method in one of your models that needs to be tuned. One easy and straight-forward way to measure your refactored process is to use the built in benchmarker to run quick tests. To get set up, you need to add the ruby-prof gem to your Gemfile, and have a properly patched ruby interpreter. I’m using the gcdata patch for MRI 1.9.2:
rvm install 1.9.2-p180 --patch gcdata --name gcdata
Now let’s assume the following expensive method in the Account class:
class Account
def self.expensive_method
sleep(1)
end
end
We can now run a quick benchmark on that method by running it 10 times and taking some benchmarking measurements:
bundle exec rails benchmarker --runs 10 'Account.expensive_method'
Loaded suite script/rails
Started
BenchmarkerTest#test_10 (0 ms warmup)
wall_time: 0 ms
memory: 0 Bytes
objects: 0
gc_runs: 0
gc_time: 0 ms
BenchmarkerTest#test_user_expensive_method (1.10 sec warmup)
wall_time: 1.00 sec
memory: 0 Bytes
objects: 0
gc_runs: 0
gc_time: 0 ms
Finished in 24.933979 seconds.
You can even run a profiler with rails profiler 'Account.expensive_method' 10 flat and get more information on what’s being called and which components in your system are taking longer.
With this quick benchmark, you can now create a second, hopefully optimized, Account.expensive_method_fast and run them side-by-side, allowing you to quickly measure two implementations of the same behavior, and allowing you to quickly iterate to find the best solution.
This is just the tip of the iceberg. If you have some Rails experience and want to take it to the next level to grow your app into a well-factored and scalable system, check out our new Advanced Rails workshop.
We’ve been working with a client who recently launched a new service. The launch entailed their marketing team sending batches of emails to a 1 million+ person mailing list over 2 days. In the email, there’s a link to the homepage.
The client wanted some confidence that the home page of the Rails app, which is hosted on Heroku, would be able to handle the load generated from that traffic.
They didn’t need a heavy-duty load test, just a little assurance. In turn, I wanted something that was quick to set up and execute.
To repeat, I don’t consider this a rigorous load test. For that, look at something like Blitz.io or Tsung. This is a quick-and-dirty alternative.
It doesn’t get quicker than apache bench:
man ab
The ab command I ended up with:
ab -n 50000 -c 50 -A user:password https://staging.ourapp.com/
50000 requests with 50 concurrent users. Basic auth is used on staging to keep the outside world from seeing the app before it’s unveiled. The trailing / is necessary.
I maxed out at 50 concurrent users because I read in Deploying Rails Applications by Ezra Zygmuntowicz that’s about the most that apache bench can reasonably simulate.
If I was testing a particular workflow, I may have used the -C flag with a session value grabbed from a browser. That way, every test would use the same session. For this scenario, however, I wanted to generate a new session on each request because I was testing many new users hitting the home page.
To get more visibility into what was happening, I added a logging add-on:
heroku addons:upgrade logging:expanded --remote staging
While the tasks ran, I had a shell open tailing the log:
heroku logs -t --remote staging
It was mildly entertaining to watch the foreman-style logs fly by:
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.9 queue=0 wait=0ms service=49ms status=200 bytes=11322
2011-07-12T16:43:37+00:00 app[web.6]: 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.6 queue=0 wait=0ms service=156ms status=200 bytes=11323
2011-07-12T16:43:37+00:00 app[web.6]: 2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.2 queue=0 wait=0ms service=51ms status=200 bytes=11322
2011-07-12T16:43:37+00:00 app[web.6]: Started GET "/" for 75.150.96.93 at 2011-07-12 09:43:37 -07002011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.15 queue=0 wait=0ms service=29ms status=200 bytes=11322
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.16 queue=0 wait=0ms service=58ms status=200 bytes=113222011-07-12T16:43:37+00:00 heroku[router]: GET dev.testkitchenschool.com/ dyno=web.3 queue=0 wait=0ms service=159ms status=200 bytes=11323
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.7 queue=0 wait=0ms service=162ms status=200 bytes=11323
2011-07-12T16:43:37+00:00 app[web.7]: Started GET "/" for 75.150.96.93 at 2011-07-12 09:43:37 -0700
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.10 queue=0 wait=0ms service=73ms status=200 bytes=11322
2011-07-12T16:43:37+00:00 app[web.10]: Started GET "/" for 75.150.96.93 at 2011-07-12 09:43:37 -0700
2011-07-12T16:43:37+00:00 heroku[router]: GET staging.ourapp.com/ dyno=web.12 queue=0 wait=0ms service=179ms status=200 bytes=1132
2
2011-07-12T16:43:37+00:00 app[web.3]: Started GET "/" for 75.150.96.93 at 2011-07-12 09:43:37 -0700
We use New Relic in production so I figured we should use it for these tests:
heroku addons:add newrelic:standard --remote staging
I started small: 5000 requests, 5 concurrent users, 2 dynos. Then, I added concurrent users until I could see the “request queuing” portion of the New Relic add-on:

The left-hand mountains represent when I got up to 4 dynos and was hitting the app with unlikely amounts of traffic. The green portion is the “request queuing” time.
The right hand hills represent when I cranked the dynos up to 12 and was hitting the app with best-case scenario traffic (100% click-through rate on the emails) from three laptops. No request queuing time and pretty nice numbers:
Those numbers and the chart above come from what New Relic calls the “app server” stats. The “end user” stats look a little different:

You can see that even though we’re use the Rails asset pipeline asset packaging, there’s still an opportunity to improve DOM processing and page rendering.
Ideally, we’d be under 2 seconds end user time.
However, this was enough information in combination with their historical email click-through rates to give the team confidence. In total, this took less than half an hour and most of that time was spent working on other things while the tests ran.
I didn’t add caching (page, action, fragment, or otherwise) at all. Split testing code already kept the homepage from being trivial to cache so if it wasn’t necessary, I wanted to avoid it. The data said it wasn’t necessary.
Written by Dan Croak.
At thoughtbot, we’re working on an exciting piece of code that will someday be shared with the rest of the world. Suffice to say, its a very intensive bit of javascript code that stresses the boundaries of all browsers. After initially solving a lot of the performance problems by offloading a lot of the calculations to CSS instead of javascript (lots of the elements on the page effect the position of all the others), we were cruising along, only to hit a brick wall.
While performance in IE, Firefox on Windows, and Safari was acceptable (the performance of Safari has been running circles around the other browsers), performance on Firefox on Mac was incredibly poor. Amazingly so.
After the initial panic, I set to tracking the precise cause of the performance issues. By commenting out large sections of the code, I was able to determine that we were calling offsetHeight on some DIVs repeatedly each time an event fired (and it fires a lot).
A quick google search indicated that yes, some people have documented performance issues with offsetHeight (here and here).
While I can understand why offsetHeight is slow, I don’t understand why the performance of it in Firefox on Mac (Macbook Pro) was so much worse than any other browser.