Error handling on a service bus

August 31, 2020

In my latest article I described how easy it is to get things wrong when migrating away from a legacy platform. One of the tools that has proved to be very useful to help with scalability (both vertical and horizontal) is a service bus.

A data bus is an architectural piece that allows you to:

  • decouple your services responsibilities
  • offload requests without losing your data
  • avoid capacity bottlenecks

No global standards exist for enterprise service bus concepts or implementations. You can find different paradigms: RabbitMQ, Apache Kafka, Azure service bus, Redis, etc. Some of them are queue-based, some of them find their best usage as streaming platforms…however all of them enforce an event-driven architecture implementation as well as an async API specification for a pub/sub model communication.

One of the most important aspects when implementing an event-driven architecture is to learn how to leverage the service bus to manage errors. Think, for instance of the following scenarios:

  1. your service receives a message, it tries to store some of its data in your database and it turns out the database is not working and throws some error
  2. your service receives a message, it tries to post it to a third-party service and the request fails with a 500 error code
  3. your service gets a corrupted message that can’t be saved because it’s missing its ID

Think about these scenarios for a second. Why are they different? How would you act on each case?

Error handling strategies

The nature of these errors is very different on each case:

  1. The error is generated from infrastructure under my control
  2. The error is generated from infrastructure outside of my control
  3. The error is generated by a scenario that it’s impossible to resolve

Thus, we can observe the errors 1 and 2 could be resolved (provided the operational dependencies start working properly again), whereas scenario 3 is hopeless. Each one of these problems require a different strategy to manage the situation.

Retry strategy

For the first error, this would be a suitable course of action to follow. Essentially once the anomaly and its root cause were detected, we would automatically send the message back to the service bus to reprocess, hoping that the next time we receive the same message, our dependency works again. Typically a fixed delay is maintained between retries (i.e. 1 minute), and there should be a maximum amount of attempts. After that limit is reached, the message is sent to a special “bucket” named the dead letter queue. Messages stored in dead letter queues are messages that were unable to be processed, and typically they require human intervention to analyse the reasons behind. It’s the end of the journey in the system for those events.

Retry strategy with exponential backoff

The only difference between the second type of error and the first one is our knowledge about the resource that is failing. I cannot control a third party service, and therefore I won’t be able to tell whether its health is recoverable or not. For that reason we would apply this second strategy in those cases.

When we retry with exponential backoff, we follow the same principle as we do with the regular “retry” strategy. The main difference is the delay we leave between the retried attempts is not linear (1 minute…1 minute…1 minute…1 minute…) but exponential (1 minute…2 minutes…4 minutes…8 minutes…). The reason why this is a healthy exercise is because if we were stressing a service that is already struggling with tons of thousands of messages being retried too often and at the same time, this would create a bottleneck that would end up breaking the dependency down. The exponential backoff gives it more air to recover.

Equally, when a number of retries has been reached, the message should be sent to the dead letter queue for analysis.

Dead letter queue strategy

What is particular about the third error is the fact that no matter how many times we retry to process the message, it will always produce the same error (as its data in this case misses a fundamental premise). When those cases are detected, don’t bother to retry: it’s wiser to send those failing irrecoverable messages straight to the dead letter queue.You won’t overload the system unnecessarily and you will detect the anomally sooner.

Conclusion

If you want to build a resilient and strong system, errors from different nature should be managed with different strategies. Regardless of the service bus you choose, you could always implement the aforementioned patterns.