Resilient Queue Handlers
Posted on 14 March 2026 by Mirko Janssen — 7 min

I remember the first time I thought a background worker was "done." The queue was draining, the jobs were being processed, the logs looked clean. Shipped. A few days later, something went wrong in production — a downstream service timed out halfway through a job, and suddenly we had records in a half-updated state, no clear way to tell which jobs had actually completed, and a retry mechanism that cheerfully kept re-running handlers that were never built to run twice. In projects with message queues, I learned early that a handler that simply does its job can quickly become a problem in production.
That experience made me realise something that sounds obvious in retrospect: "runs" and "resilient" are two very different things. A handler can work perfectly in happy-path conditions and still fall apart the moment something unexpected happens. In distributed systems, unexpected things are not the exception, they are the norm.
What finally pushed me to write this down was stumbling across Temporal not too long ago. I came across it almost by accident and was surprised by how elegantly it approaches the problem of long-running, multi-step workflows. But to really appreciate what makes it interesting, you first need to understand the fundamentals it builds on. That is what this post is about.
The Problem with "It Works"
Sooner or later, everyone working with distributed systems ends up at the same question: what actually happens when something goes wrong in the middle? Not before a job starts, and not cleanly after it finishes — but right in the middle, when half the work is done. A payment partially booked. A notification sent but the database not yet updated. An external API call completed but the job marked as failed and therefore queued for retry.
The instinctive answer is often "we just retry on failure" — reasonable, but not enough on its own. Retrying a handler that isn't designed for it can make things worse rather than better. I started with Kafka and BullMQ, later got to know RabbitMQ as well. In all cases, the queue itself was never the problem — if there was one, it was always the handler behind it. The queue just faithfully delivered the message. What happened next was entirely up to the code on the other side.
The Fundamentals of Resilient Handlers
I don't want to go too far into the specifics of any particular queue system here, because the core ideas apply broadly regardless of whether you're using BullMQ, RabbitMQ, or something else entirely. What I do want to talk about are four concepts that, in my experience, make the real difference between a handler that works and one that holds up.
Idempotency
The most important property a resilient handler can have is idempotency: the guarantee that running the same job twice has the same effect as running it once. In any system with retries, which every serious system should have, this isn't optional. Queue systems typically offer at-least-once delivery by design, meaning your handler will occasionally run more than once for the same message.
What this means in practice depends entirely on what your handler does. Reading data and returning a result is trivially safe to repeat. But a handler that charges a credit card, reserves an inventory item, or triggers a shipment is not, unless you explicitly design it that way. If a job processing a purchase fails halfway and gets retried, you don't want the customer charged twice. If a job provisioning a new user crashes after creating the account but before finishing setup, a blind retry shouldn't create a second one. The consequences range from mildly annoying (a duplicate confirmation email) to genuinely damaging (a double charge, corrupted stock counts, or inconsistent data across services).
The good news is that this is mostly a matter of how you structure the logic. Most operations can be made safe to repeat through database constraints, conditional checks before side effects, or writing each step so its outcome is stable when applied more than once. Queue systems often provide helpful mechanisms here, but the responsibility ultimately lies with the handler itself.
Retry-Capable Queues
Most mature queue systems support retries natively, and tools like Kafka, BullMQ, RabbitMQ, and others have mechanisms for this. But there's more to getting retries right than just enabling them. Two things matter a lot: exponential backoff, so that a struggling downstream service isn't immediately hammered with repeated requests, and a sensible maximum attempt count, so that a job that genuinely cannot succeed doesn't loop forever. Both of these are usually configurable, but if you don't think about them, they can lead to some interesting moments ;-).
Explicit Status Tracking

This one gets overlooked surprisingly often, maybe because it feels almost too simple. The queue system knows whether a job is sitting in the queue, but it has no idea what happened inside your handler. A status field in the database changes that. Suddenly you can answer questions like "which jobs are stuck in processing?", "which ones failed three times and need attention?", or simply "did this one actually complete?". In debugging sessions after a production incident, that field has saved me more time than I can count.
But I'd push this a step further: it's not just the overall job status (pending, processing, completed, failed) that matters, but the individual steps within it. A job that processes an order might do several things in sequence — charge the payment, reserve the inventory, trigger the fulfillment, send the confirmation. Tracking which of those steps have succeeded, not just whether the job finished overall, makes a real difference when something goes wrong. If you know the payment went through but the fulfillment was never triggered, you can retry exactly the right piece without risking a double charge. Without that information, you're left guessing.
This kind of step-level tracking also pairs naturally with structured logging. A log entry per completed step, with a consistent job ID and enough context to reconstruct what happened and in what order, turns an incident investigation from archaeology into a straightforward read-through. The combination of explicit intermediate state in the database and structured logs is, in my experience, one of the most underrated things you can build into a background job system.
Dead Letter Queues
Even with retries and backoff, some jobs will never succeed: the data is malformed, the downstream service is gone, or there's a bug in the handler itself. A Dead Letter Queue is the place where these jobs land once they've exhausted their retry attempts. The alternative is silently dropping them, which is almost always the wrong choice. Having a Dead Letter Queue gives you visibility: you can inspect failed jobs, understand why they failed, fix the underlying issue, and reprocess them manually if needed. It's not glamorous, but it's the difference between a system that fails quietly and one that fails in a way you can actually respond to.
What If This Isn't Enough?
The four concepts above cover a lot of ground and will handle the vast majority of real-world scenarios. But there are cases where they start to feel like you're building a state machine by hand: multi-step workflows, long-running jobs, processes that need compensation logic when something fails halfway through. Before thinking about tools like Temporal, you should understand these fundamentals. Not because Temporal is complicated, but because otherwise you don't know what problem it actually solves. That is a topic for a follow-up post.
Lessons Learned
Resilient handlers are not complicated, but they are easy to skip — especially when things work fine in development and staging. Idempotency, properly configured retries, explicit status tracking, and a dead letter queue are the foundation. They don't require a specific framework or infrastructure setup; they're patterns that apply regardless of what queue system you're using. Getting these right won't make your code exciting, but it will make it boring in the best possible way.