Graceful Switching of Worker Processes

A recent client project featured a subtle misconfiguration which left the application susceptible to losing jobs performed by worker processes. This post explores the issue and builds a demonstration on Heroku.

A Primer: Why Use Worker Processes?

When a user makes an HTTP request sometimes it’s necessary to perform a slow task like sending an email or generating a report. If this slow task is performed by the web server before sending an HTTP response then the user can be left staring at a blank page. That’s a poor experience for the user.

It’s also accompanied by a bunch of potential headaches: What if the request times out? And what happens if the user hits refresh in their browser?

To keep a web application responsive for users it’s common to offload slow tasks to worker processes which are distinct from the web processes handling HTTP requests & responses. This is outlined in the twelve-factor app guide:

[…] the developer can architect their app to handle diverse workloads by assigning each type of work to a process type. For example, HTTP requests may be handled by a web process, and long-running background tasks handled by a worker process

Using worker processes can provide a slicker user experience; allowing the web processes to return an HTTP response quickly — stating the task is underway — followed by updates with the outcome of the asynchronous task. Behind the scenes one or more workers then crunch through the tasks which are passed via a job queue.

Offloading tasks to workers is particularly important if your application runs on a Platform-as-a-Service (PaaS) like Heroku or Google App Engine because request durations are capped at the routing layer. Slow responses to HTTP requests will be timed out automatically. For example this timeout is thirty seconds on Heroku.

Reliability Is Key

In adopting this pattern it’s important the background job processing system is robust. We don’t want jobs to be lost or crash without completing the task. Many background processing frameworks, like Sidekiq or Faktory, feature excellent retry mechanisms, integrations with exception trackers, and web dashboards to address and monitor reliability.

One under-appreciated aspect of reliable job processing is having a clean switchover when a new version of the code is deployed. Teams embracing continuous deployment may release new code to production anywhere from tens to hundreds of times per day.

Often new code is deployed for both web and worker processes at the same time, even if only the web-related code has changed, to ensure compatibility. This involves stopping the existing processes running the previous version and starting fresh processes with the new version.

As we are about to discover the switching technique is important…

A graceful switch by Indiana Jones in Raiders of the Lost Ark

& Now… A Problem

The misconfiguration in question lay in the Procfile, a common file format used on platforms like Heroku to describe which process types should be run as part of the application. The worker definition was launching two processes:

web: bundle exec puma -C ./config/puma.rb
worker: bundle exec sidekiq -q critical & bundle exec sidekiq -q default

What’s happening on line #2 is:

The first Sidekiq process (handling the ‘critical’ queue) is started in the background
The second Sidekiq process (handling the ‘default’ queue) is then started

The & informs the shell to run the preceding command in the background (a feature of Unix job control). Background in this context relates to the execution of the Unix process, not to be confused with ‘background’ workers, as in the code which is being run. Author mops brow as the overlapping terms pile up left, right and center. Deep breaths.

What’s wrong with this? When a new version of the code is deployed the existing processes won’t be shutdown gracefully. Graceful in this context means either the work being finished within the grace period or the job can be put back on the queue to be picked up by another worker.

If the two worker processes don’t shutdown gracefully then there’s potential for jobs to be lost between deployments. Running workers in the background can also lead to crashed processes not being restarted by the platform.

Is This Really a Problem? Prove It!

For this experiment highly sophisticated tick & tock worker scripts are required. Here’s tick.rb (and you can imagine how tock.rb works):

#!/usr/bin/env ruby

STDOUT.sync = true

def info
  "PID=#{Process.pid} WORKER_VERSION=1"
end

puts "TICK: Starting process... #{info}"

trap "TERM" do
  puts "TICK: Received SIGTERM #{info}"
  sleep 5 # Pretend to be finishing some hard work
  puts "TICK: Graceful shutdown #{info}"
  exit
end

loop do
  puts "TICK #{info}"
  sleep 2 # Pretend to be doing hard work
end

If I run this script locally in a terminal window:

chmod +x ./tick.rb
./tick.rb

It produces output like this indefinitely:

TICK: Starting process... PID=10346 WORKER_VERSION=4
TICK PID=10346 WORKER_VERSION=1
TICK PID=10346 WORKER_VERSION=1
TICK PID=10346 WORKER_VERSION=1

If one terminates the tick process by reference to its PID (in another terminal window):

kill 10346

Then the original terminal window displays:

TICK: Received SIGTERM PID=10346 WORKER_VERSION=1
TICK: Graceful shutdown PID=10346 WORKER_VERSION=1

The first line is displayed immediately - showing the request for termination has been received by the trap line of the Ruby script. The second line is displayed five seconds later, which is the delay the script requested. This proves the tick process was given time to complete its ‘hard work’.

Lift-off: To the Cloud!

Now it’s time to deploy tick and tock to Heroku. First up craft a Procfile like this:

worker: ./tick.rb & ./tock.rb

After committing the Procfile along with tick.rb and tock.rb the app is ready to be deployed. Once the worker process type is scaled to one dyno, then the Heroku logs should show alternating lines from the tick and tock processes:

app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1

This works as expected until it’s time to deploy the next version of the workers:

app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[api]: Build started by user oli@redacted.com
app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[api]: Deploy 46be478c by user oli@redacted.com
app[api]: Release v2 created by user oli@redacted.com
heroku[worker.1]: Restarting
heroku[worker.1]: State changed from up to starting
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[worker.1]: TOCK: Received SIGTERM PID=6 WORKER_VERSION=1
heroku[worker.1]: Stopping all processes with SIGTERM
heroku[worker.1]: Process exited with status 143
heroku[worker.1]: Starting process with command `./tick.rb & ./tock.rb`
heroku[worker.1]: State changed from starting to up
app[worker.1]: TICK: Starting process... PID=5 WORKER_VERSION=2
app[worker.1]: TICK PID=5 WORKER_VERSION=2
app[worker.1]: TOCK: Starting process... PID=6 WORKER_VERSION=2
app[worker.1]: TOCK PID=6 WORKER_VERSION=2

The exit status was 143 which suggests an error occurred. Also there’s no graceful shutdown message from either the tick or tock process. If the processes had been handling real jobs these may have been lost.

To address this the Procfile needs to be updated to declare two separate process types:

tick: ./tick.rb
tock: ./tock.rb

After scaling these new process types to one dyno each, it’s time to deploy yet another new version to observe the deployment transition:

app[tick.1]: TICK PID=4 WORKER_VERSION=3
app[tock.1]: TOCK PID=4 WORKER_VERSION=3
app[api]: Build started by user oli@redacted.com
app[tick.1]: TICK PID=4 WORKER_VERSION=3
app[tock.1]: TOCK PID=4 WORKER_VERSION=3
app[api]: Deploy 3546fe29 by user oli@redacted.com
app[api]: Release v4 created by user oli@redacted.com
heroku[tick.1]: Restarting
heroku[tick.1]: State changed from up to starting
heroku[tock.1]: Restarting
heroku[tock.1]: State changed from up to starting
heroku[tick.1]: Stopping all processes with SIGTERM
heroku[tock.1]: Stopping all processes with SIGTERM
app[tick.1]: TICK: Received SIGTERM PID=4 WORKER_VERSION=3
app[tock.1]: TOCK: Received SIGTERM PID=4 WORKER_VERSION=3
heroku[tick.1]: Starting process with command `./tick.rb`
heroku[tock.1]: Starting process with command `./tock.rb`
heroku[tick.1]: State changed from starting to up
heroku[tock.1]: State changed from starting to up
app[tick.1]: TICK: Starting process... PID=4 WORKER_VERSION=4
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tock.1]: TOCK: Starting process... PID=4 WORKER_VERSION=4
app[tock.1]: TOCK PID=4 WORKER_VERSION=4
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tick.1]: TICK: Graceful shutdown PID=4 WORKER_VERSION=3
app[tock.1]: TOCK: Graceful shutdown PID=4 WORKER_VERSION=3
heroku[tick.1]: Process exited with status 0
heroku[tock.1]: Process exited with status 0
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tock.1]: TOCK PID=4 WORKER_VERSION=4

Hooray! Now the graceful shutdown message is printed for both tick and tock worker processes. The exit status of both of the outgoing processes was 0, indicating there was no error. This means Heroku waited five seconds for the workers to shutdown (as requested by our scripts).

On Heroku you get up to 30 seconds to shut down a process:

The application processes have 30 seconds to shut down cleanly (ideally, they will do so more quickly than that). During this time they should stop accepting new requests or jobs and attempt to finish their current requests, or put jobs back on the queue for other worker processes to handle. If any processes remain after that time period, the dyno manager will terminate them forcefully with SIGKILL.

Multiple Processes Per Dyno

Why were two processes being declared together on one line of the Procfile? This decision was probably made to increase the resource utilization of the dyno running the worker process, which can be more cost effective than paying for more dynos which are less heavily utilized. In practice this implementation made the system less robust so I’d argue it’s not a good trade-off.

It is safe to run multiple processes per dyno if there’s a master process to coordinate a bunch of child processes. This is how Puma webserver runs in clustered mode. Sidekiq also has the ability to coordinate multiple processes in the enterprise version.

Time for a Checkup

It’s worth checking any long running process exits gracefully when it receives a message to shutdown, not just job queue workers. Time to go review the Procfiles (and equivalents) in your codebases!