hendra.dev - Supervisord and Job Retries

When you are deploying an application to production that includes a background queue, it is important to also setup a process monitor to make sure the worker stays up.

One of the most popular process monitor is supervisord, which also happens to be the one included in the Laravel guide on queues which makes it pretty much the default process monitor on a Laravel deployment.

Many background queue libraries, including the one that came with Laravel will automatically retry jobs that fails due to an exception. They usually also include a way to control how many times the jobs should be retried before giving up and moving it to the failed jobs store.

php artisan queue:listen connection-name --tries=3

One thing to keep in mind is that supervisord also includes a way to control the maximum number of attempts to start a process after serial crashes. When this limit is reached, it will stop trying to restart the process and the process will be put into the FATAL state. This number is controlled via the configuration entry called startretries, and it defaults to 3.

In most cases, one wouldn’t need to change this setting. But, if you set the number of retries on job queue to be larger than 3, you need to make sure that the startretries is also increased accordingly. If the startretries is set to a number that is less than the tries attempt, it could cause an issue where the supervisord retry limit is reached before the job retries limit is reached. When this happens, the process is put into a FATAL state, and the worker process doesn’t get started anymore, halting the entire job queue.

[program:laravel-worker]
process_name=%(program_name)s_%(process_num)02d
command=php /home/forge/app.com/artisan queue:work sqs --sleep=3 --tries=5
autostart=true
autorestart=true
startretries=5
user=forge
numprocs=8
redirect_stderr=true
stdout_logfile=/home/forge/app.com/worker.log

So, remember, make sure to keep the queue worker tries equal to or smaller than the supervisord’s startretries to make sure the failing jobs get moved to the failed jobs queue before they halt the entire job queue. Of course this won’t help if the queue is full of jobs that keeps failing, but in that scenario, you should really take a look at your queue worker code anyway.