Lessons from an event-driven architecture

RGilchrist · ‎Oct 08, 2019

It is tempting. Why not load everything into a database or even into memory if it fits, process the data and store the complete result? It is simple, and quite fast considering how much data there was to process. Now how does it scale? Well we can give it more computing power and memory and if that is not enough then there are always additional databases….right? Now there are quite a few resources to manage but performance is acceptable, but not outstanding. Then we need to aggregate across multiple result sets because….well….we are trying to solve a complex business problem. Here the options are to wait until all the results are ready before the aggregation begins or aggregate frequently every time one of the result sets is ready. That is some heavy processing so let’s add a queue to manage the load. Done!

This is how to design a back-end process that is simple, robust and easy to build. However, it is not likely to perform or scale well. Even worse, if your resources start to reach their limits, all the advantages will disappear: it will fail intermittently so it is no longer robust, and fixing these issues will add complexity so it is no longer simple or easy to maintain. Still, in some cases batch processing may be sufficient. However, if a back-end process is mission critical, has a huge amount of information to process, and scalability is of upmost importance, then an event-driven architecture may well be a better choice.

Of course, every architecture has its own challenges, so here are some of the lessons we have learnt whilst designing and building an event-driven architecture.

Look before you leap

Even the most well-thought-out architecture is not going to behave exactly as expected. Validate your design as early as possible and keep validating whenever a significant piece is added. Gathering metrics in everything you build is a huge help (some would say mandatory) to see where your bottlenecks are, but validation starts even before the ink is dry on that napkin; take an educated guess at your size and shape of data and, if possible, gather statistics from existing data sources or people whose job it is to know. Try to learn the limits of your technology stack so you can quickly discard designs that will simply not scale mathematically. A calculator is a first-class tool for an architect, as it is a lot cheaper to tear up a napkin than it is to waste 6 months of a team’s effort.

Example: You know a computer can have on average 2,000 executables and your largest customer has 1,000,000 computers. Your database of choice gives you about 20K read/write IOPS. A calculator can quickly tell you that a system that stores individual executables will take over 27 hours, and this is likely a best case. Time to rip up that napkin!

It’s good to be lazy

Storage may be cheap but computing power and memory can cost a fortune. Ideally an event-driven architecture should only ever need to process changes. Depending on your use-case, you can get orders of magnitude more performance when compared to processing the complete data set, which means you use far less resources to achieve the same result.

There are things for which you should be on the look out. Firstly, there may be occasions when a system still needs to process a complete data set. Perhaps this data is new and no differential can be calculated. For this reason, you should not depend on the performance benefits of only processing changes in order to have a functioning system; your system needs to be able to scale up and out to handle the occasional spikes. Secondly, what happens if a change is missed? If possible, try to calculate changes based on the current state of the system. This allows for missed events to be replayed at a later time until the change is successfully applied.

Example: Mapping all the software on a device can be a costly process, and most likely 95% of those applications have not changed since yesterday. If hashes are stored with all previously mapped applications, they can be used to identify the small number of applications that have changed and require re-mapping. Adding a new attribute to an application could cause everything to change at once so ensure your system can handle the increase in load.

Not too big, not too small

If possible, an event should contain as much of the information necessary for it to be processed. Every additional lookup reduces throughput or requires complexity to lookup concurrently. If there are a large number of lookups for each event then you will need a very fast cache (a database may not have fast enough random access performance), and then what will happen if the data is not available yet because, it too, is still being processed?

It can be time-consuming and complex to split data into events, then have to group those events together again at a later stage so that they can be processed. To avoid this, consider if related information can be grouped together right from the start into a single event, without becoming too large whereby resource limits will cause failure.

Example: Optimizing the resource utilization of a virtual machine hierarchy consisting of a host and multiple virtual machines may become complex if the hierarchy is first broken up into tiny events per machine, then reconstructed in order to be processed. It would be simpler to represent the hierarchy as a single event for processing, and likely faster as well.

When it all goes pear-shaped

When a system is in production it is inevitable that there will be issues at times, and any robust architecture will need ways to handle these cases. In the case of a catastrophic error, where perhaps a mission-critical service is not responding, it is important to limit and contain the damage. A well-placed queue or stream with a reasonable time of persistence will allow time to resolve the issue without losing data or creating a traffic jam of micro-services. This is a common approach in Domain Driven Design to help separate different sub-systems of a large and complex system.

What if the error only affects a relatively small number of events? This is one area where an event-driven architecture can really shine; breaking up the workload into many events still allows for most of the data to be processed successfully. For the problematic events it is usually better to do nothing than to do the wrong thing. For example, yesterday’s out-of-date but still correct result will likely be more useful to a customer than an incorrect one. If possible, design your system so that failed events are re-tried at a later time. This allows time for the issue to be resolved and the system to self-heal, while keeping the disruption to a minimum.

Example: An application usage event contains an incorrect time range (perhaps there is a bug triggered by a daylight savings clock adjustment) and a usage summary for that one application cannot be updated. All other usage events are successfully processed so the impact is minimal, however the corruption should certainly set off alarms so that developers can begin diagnosing the issue immediately. They may need to stop the system temporarily in order to safely remove the corruption, then resume processing with the knowledge that the application usage will be corrected on the next import. Hopefully having just one out-of-date application for a short time did not impact the customer too greatly.

And finally…

Of course, there is an exception to every rule and every architecture has its own unique challenges, so consider whether these lessons are applicable for your system. No matter what architecture you end up with, ensure you validate its design both before and as you build it.