Resilient Queue Handlers

Posted on 14 March 2026 by Mirko Janssen 5 min

If you work in software development long enough and deal with larger software systems, you will sooner or later have to deal with event brokers, message queues, and things like that. A phenomenon you come across pretty quickly is that the queue handlers you write work wonderfully at first, almost like magic, until they suddenly turn into monsters and nothing seems to work anymore. All of a sudden something breaks in the middle, the incoming data no longer looks right, and so on...

The worst part is standing there with no information about why nothing seems to work anymore, and in the worst case watching other parts start throwing errors one after another.

Note: In the following I will talk about jobs and queue handlers, but you can apply all of this to events, messages, event brokers, message queues, and so on.

Those experiences quickly teach you the difference between a queue handler that runs and one that is resilient. Of course any worker can handle the happy path just fine, but anyone who has worked with distributed systems knows that unusual events are not the exception but the rule.

But how did I end up writing about this topic in the first place? Simple: I recently stumbled across Temporal and found the system genuinely interesting. Still, I figured that even if the tool makes it fairly easy to re-run jobs at their failure points, it is still very important to have a solid understanding of resilient handlers first.

Running vs. Resilient

So as I said, at some point when implementing queue handlers you have to think about what happens when something goes wrong in the middle. For example: an order was only half processed, the confirmation was sent but the database was not updated, an API request succeeded but the job then fails and goes back into the queue for a retry, and so on...

Then instinct kicks in and says "let's just try again". Understandable, and somehow the typical reaction of every developer :D, but when it comes to queues, that alone rarely leads to success. I have worked with quite a few different queues and in the end the problem was never the queue itself. No, everything stands or falls with the queue handlers.

Idempotency

The most important property a handler should have is idempotency. This strange-sounding word means that it should make no difference whether I run the handler once or twice. In theory I should be able to run it any number of times without the overall system having a problem with it. For example: when processing an order in an online shop, the customer should not have their account charged ten times and possibly receive eight shipments. Setting aside the fact that it would be nonsense to put all of that into a single queue handler, I think you get what I mean.

Making a handler idempotent might take a little practice at first, but it is actually quite straightforward. Reading data and returning it is relatively easy to repeat (though of course you should think about what happens if the data has changed in the meantime). For everything else, the key is to use database constraints or limit actions through conditional checks. You just go through the handler step by step and imagine it has already run successfully once.

Retry-Capable Queues

Tools like Kafka, BullMQ, RabbitMQ, and so on come with built-in support for retrying jobs. The important thing to keep in mind is that a broken job should not keep running in an endless loop, causing even more damage than it already does. That is what exponential backoff and a maximum attempt count are for, so that a job does not run forever.

A simple status field can save a lot of debugging

Status Tracking

From the time a job is created until it finishes processing, it goes through various statuses (pending, processing, completed, failed, ...). But that alone does not help at all when it comes to figuring out what went wrong and where. One thing that helps a lot is logging. Ideally you can see when processing started, when each step ran, and what kind of error a job failed with. Especially when you cannot just watch jobs being processed in real time, logs are the only way to understand what actually happened.

Dead Letter Queues

Finally we need to talk about dead letter queues. Some jobs you can retry as often as you like or adjust the handler, and they will still never complete successfully. Why? Because the data is corrupted, an underlying service is no longer reachable, or there is a deeper bug somewhere. The dead letter queue is exactly the place where those jobs end up. You could of course just delete and forget them, but that is generally the wrong approach. With a dead letter queue you can always go back and look more closely at the failed jobs and the handler, understand what is going wrong, and figure out how to get them to run after all. It is maybe not the most glamorous thing to set up, but it is the difference between "errors don't matter to us" and "we learn from our mistakes".

Lessons Learned

With the four concepts above you should be able to solve the large majority of problems when processing jobs. And building resilient handlers is not even that complicated, you just might not think about these things upfront. The really great thing is that none of this requires special tools or anything like that, it just takes a conscious approach during development, which probably does not hurt in other areas of software development either ;-).

There are also cases where tools like Temporal really do make a difference. If you have more complex workflows with many processing steps, long run times, or processes where you might need to undo steps at some point, then using such a tool makes a lot of sense. Still, it is important to understand these principles and build your handlers as resilient as possible.