At State Farm® we’re about helping people. Helping people manage the risks of everyday life and recover from the unexpected is at the core of what we do. As it applies to technology, the Event Driven Architecture (EDA) pattern is especially useful for orchestrating multiple services in unforeseen situations. However, deciding on the best AWS product to use as the event bus can be daunting. This article will cover common considerations (such as scalability), as well as those that can make a bad situation even worse if not planned for.
What is Event Driven Architecture
EDA is a pattern for decoupling the outcome of an event from the process that creates or defines the event. Specifically, EDA will be used here in the context of business events.
Events are the notifications or messages that something has happened, usually associated with a data change. As covered in the sections below, there are different perspectives on what data should be included in the event itself. They typically contain the context of the change, such as “Policy Issued” or “Bill Paid,” and the specific data elements that were altered. For an insurance company like State Farm, these business events can include a policy quote being created, a quote being bound into a policy, or a claim being opened, for example.
To understand why this would be useful, imagine a company that sells cruises. When a customer purchases a cruise, there are several actions that need to be completed, such as reserving a room, sending the customer a boarding pass, and ordering additional food and amenities. Following good design principles, each of these actions is split into separate services. To keep things simple, the Shopping Cart executes these services. This is the typical approach for API based interactions.
However, what happens if one of the services is unavailable? If the room reservation service is unavailable, the order should not go through because the company cannot guarantee a room. However, if the boarding pass and supply services aren’t available at the time of the order, the company may move forward and book a room, as long as the remaining services are executed prior to the cruise. This can be handled by placing a queue, message table, or other asynchronous mechanism in-between the Shopping Cart service and any of the services that don’t need to be completed right away.
That’s better. However, there’s another issue. The company has just decided to run a temporary promotion where each customer is sent a novelty captain hat with each purchase. To enable this, the Shopping Cart service will need to be modified to call the promotion service and then, at some point after the promotion, it will need to be changed again to remove it.
This is where EDA comes in. The Shopping Cart service only needs to call the services that are necessary to complete the order. Once that is done, the Shopping Cart service simply publishes an event, notifying any consuming services (i.e., boarding pass, supply and promotion) about what it has done. The Shopping Cart service no longer needs to be aware of the subscribers or captain hats. The promotion service can listen to the event while active and then stop listening when it is over. Any number of services in the future can do the same; such as an analytic service that wishes to gather data for a dashboard or a service that accrues points for a loyalty program.
In order to discuss EDA, it is important to be familiar with several key terms:
- Business Events - These are changes to the data related to business activity, such as a customer creating a new quote. This will be the main focus of this article, but much of the guidance applies to System Events as well. Business, Domain, and Command events are lumped together in this article.
- System Events - These are changes to infrastructure resources, such as a service being deployed, storage being unavailable, etc. Basically, they are the type of things that will typically show up in logs. Access related system events, such as login attempts, are frequently referred to as Security Events and may be handled separately (ex. CloudWatch versus CloudTrail).
- Event Driven Architecture (EDA) - This is designing a system so that it reacts to events rather than chaining together processes directly. It is important that the event-triggering mechanism supports multiple subscribers, such as leveraging a publish and subscribe or fan-out pattern.
- Event Bus - This is the product or service that takes the events from the producer and makes them available to the subscribers.
- Producer - This is the service or system that adds events to the Event Bus. It is also known as a publisher or source.
- Subscriber - This is the service or system that receives or retrieves the events from the Event Bus. It is also known as a listener, consumer, client, sink, or target. There can be many subscribers per producer, so point-to-point solutions, such as SQS, won’t be covered here.
EDA is certainly not new, and State Farm has used it for decades in many different forms and on many different platforms, with the most recent being AWS. AWS provides a variety of services that can be used for this. The following AWS services are covered in the table below:
EventBridge - This is a lightweight serverless option evolved from handling system events. The service started with CloudWatch Events, where subscribers could define patterns for which AWS system events to receive. These system events are still published to the default bus in every account, but the re-branded offering allows for custom buses which can support custom messages from your own producer or other third-parties. There is also support for discovering and managing schemas for the different events.
Simple Notification Service (SNS) - This is a lightweight serverless option for notifying many subscribers of an event. A producer posts a message to a topic, which can have multiple subscribers that have been signed up to receive it. Subscribers can receive the events through email, a custom http endpoint, one of many AWS services, or as a mobile push notification.
Kinesis - This is a serverless option for handling continuous data streams in real time. At a high level, Kinesis saves all incoming events into a sequential list and retains them for the specified time period regardless of whether any of the subscribers have handled them or not, which makes the product more like a database than a queue. This allows numerous subscribers to consume events at different speeds and also supports the same subscriber going back to a previous point and replaying events. For performance, a stream can be subdivided into shards with the user-provided partition key determining which shard a given event will be added to. Kinesis plugs into several other AWS services for easier integration across their platform but at the cost of lock-in.
Managed Streaming for Apache Kafka (MSK) - This is a server based option for handling continuous data streams in real time using the open source Apache Kafka product. Kafka was the inspiration for Kinesis and they have similar designs with the main difference being their approach to configuration. Kinesis is opinionated but simple to setup and consume, whereas Kafka is highly configurable but more complex to manage and use. MSK manages Kafka on a cluster of servers, but not the configuration. This simplifies aspects of server management but limits extensibility since JARs can’t be installed in the cluster.
While Event Driven Architectures solve some problems nicely, they can introduce their own set of drawbacks depending upon the requirements. Ideal subscribers can tolerate some number of dropped events, can process events in any order, and are idempotent, meaning they can handle safely the same event multiple times. Unfortunately, most subscribers are far from ideal, so the Event Bus implementation can make a big difference. Below are several key considerations when selecting the AWS service.
When events cannot be successfully handled by a subscriber, it is expected they should be retried. All of the event bus services listed have retry capabilities. With EventBridge and SNS the retry is completely automatic and mostly outside of the subscriber’s control. For Kinesis and MSK there are more options in how it can be done.
There are also times where it may be desirable to replay messages that have already been processed. For instance, if data is lost with incomplete or missing backups, it can sometimes be possible to rebuild the data by replaying the events that created it originally. Or there may have been a bug in the event subscriber where it successfully ran but the results were not correct. EventBridge and SNS are similar to queues and delete the events once they have been acknowledged by the subscribers, making it impossible to replay events without republishing. This can often be mitigated by building additional wrappers around the Event Bus that save a copy of the event before it is published. Kinesis and MSK are durable, meaning they support persisting events for some time by design. Subscribers are able to essentially restart at some previous point in the stream, assuming they have not expired. However, that can often lead to replaying other events in order to catch back up, which may lead to duplication.
Whether due to replay mentioned above or the retry process of the service, there are times when an event will be sent again when it has already been acknowledged. Both Kinesis and Kafka support exactly once delivery, where the event will only be acknowledged by the subscriber once, at least in theory, versus the at least once or at most once delivery of the other services, where a subscriber may rarely get duplicates or nothing at all. One relevant scenario to consider is when a subscriber partially handles an event, but fails in a way that can’t be rolled back, such as calling a REST update service.
Duplicates can be mitigated by ensuring the subscriber is designed to be idempotent; meaning it ensures the same result given the same input data regardless of the number of times it is run. However, that usually comes at the expense of additional processing. For instance, an event subscriber that refunds a customer should check that the customer has not already been refunded for that specific policy. Or a subscriber could check the timestamp of the last change committed to the database, ignoring or poisoning (i.e. placing it to the side for a support person to review) any events older than that. That logic really depends on what the subscriber is doing and how precise the data must be. For example, a metrics report may tolerate less precise data than a payment system.
In some systems it is important that events are processed in exactly the same sequence, or order, that they were created in. For instance, ensuring a bill submitted comes before the bill paid event. While this is a great feature, there can be large trade-offs. Since events must be processed in order, it is difficult to parallelize execution, and a single event with a problem will block processing of all of the events after it. For instance, with Kinesis and MSK the subscriber has an offset into the shard/partition which moves sequentially so the subscriber cannot, at least unintentionally, process a later event before handling all those before it.
By default, events are assumed to be in the same order they were published in, so if the source system has no order in how it publishes the event (ex. using different partition keys) then the event bus will not correct that. Similarly, event buses will not catch a producer mistakenly republishing the same events later. Any event that is replayed later, via a republish or otherwise, would be out-of-order since later events for that same entity may have already been processed. Given these issues and support for ordering limiting concurrency and performance, it is typically better to account for unordered events in the workflow than to rely on the ordering promises of the event bus. That being said, State Farm tends to opt for ordering support as an additional layer of protection since it can be difficult to ensure all downstream systems will properly handle events correctly.
If ordering is a priority for your business operations, there are other custom approaches, such as not allowing any new events to be processed for a data entity by locking it until the event that took the lock clears it. These techniques are outside the scope of this article.
The payload of the event creates an informal interface between the producer and subscriber. For synchronous calls between services, the subscriber can reject the call if the data is in an unrecognized or unsupported format; however, in asynchronous systems the producer is not listening for any responses, so it will not know there was a problem. Therefore, it is very important for the producer to implement backward compatibility and for the subscriber to be tolerant of both new elements and events that it does not recognize. A schema, or definition of the message’s data structure, is often used to communicate to the subscriber what will be in the event. It is the equivalent of an API specification for synchronous services. Those unfamiliar with managing and versioning schemas should research this topic further before implementing.
Ideally, data structure problems would be caught prior to the event being published. While some of the services support the notion of a schema, none currently offer the ability to automatically validate it prior to publishing. Producers needing to validate the events against a schema must implement this as custom functionality.
It is important for both producers and subscribers to be in agreement on whether the events will contain a delta, which are just the elements that changed, or a full snapshot of the entity that changed. It can be tempting to pack as much data as possible in the event itself to avoid the subscribers needing to retrieve data from the System of Record (a.k.a. SoR, which is the data store that “owns” the data), but this can cause performance issues both with gathering the data and sending it through the event bus. The other extreme of passing only keys that then must always be looked up elsewhere can also cause extra stress on the System of Record and may return the current state of the entity rather than how it looked at the time of the event. A related approach is to set aside the payload in a different location and just pass keys. For instance, SNS can use the Extended Client Library to offload a large payload to S3. Like most things, finding the middle ground is usually the best solution, but it can be a lengthy process and will likely evolve over time. It is best to start small as it is typically easier to add data than to remove or modify it.
Depending upon the service, there can be a variety of technical boundaries that may bring additional limitations and latency. This can include whether the service will need to publish or consume events across accounts, non-AWS platforms, or networks, such as Virtual Private Clouds (VPCs). For instance, Kinesis can be consumed directly across accounts via a Kinesis Client Library (KCL) subscriber, but not currently from a Lambda.
Many of the services are tuned for handling large amounts of data by encouraging multiple events to be sent together as part of the same request. Depending upon ones use case, this may or may not be desireable. Some producer libraries, such as the Kinesis Producer Library (KPL) and Kafka’s, will hold the requests until they have several queued before sending. This can help performance but can also cause data loss should the producer crash or have problems before the events are acknowledged by the event bus. Likewise, some subscriber libraries will either commit or rollback a batch of events which may cause duplicate processing later or overwhelm the subscriber. For instance, a large batch requiring more memory or time to process than the consuming Lambda was allocated.
The blocking of events used to ensure ordering can cause problems, especially if large batch sizes are used. When handling errors in the subscriber, it is best to offload those that will likely never work (such as malformed events with missing data) to a queue or other location where they can be handled separately rather than failing the whole batch repeatedly. While this will unblock the stream and allow the batch to go through, it also means subsequent events for that entity may now be processed potentially executing them out-of-order.
There are also other ways to group events together. For instance, the Kinesis KPL supports record aggregation, which is packing multiple events into a single record, instead of or in addition to multiple records per request. This is more efficient, but can also be difficult for subscribers to handle if not prepared for it.
Combinations and Augmentation
The aforementioned services can often be combined with themselves or each other. For instance, having a Lambda push from one Kinesis stream to another is a very common pattern. It can be useful to filter down events to limit processing on events a group of subscribers are not interested in or workaround some of the concurrent subscriber limits or account boundaries. In large organizations it is also common to have a mix of different services on different platforms being used by different areas or systems. Having subscribers specific to each of those services that then funnel it back to single bus, topic, or stream for an application can sometimes simplify the design.
Another useful approach is to find a service that mostly meets your requirements and then augment it with additional services. The AWS Event Fork Pipeline patterns show several examples of this for SNS.
Below is the comparison of the AWS products based on the criteria from above. This information is a summarized version of what AWS has published as of late 2020. These services are constantly evolving, so there may be discrepancies from what is shown versus what is available. Some limits can be increased, check with Amazon if uncertain.
|What You Get||Nothing, unless matched - Rules are part of the config||Everything, unless filtered - Filters are part of the config||Everything - Subscribers ignore what they don’t want||Everything - Subscribers ignore what they don’t want|
|Producer||AWS API||AWS API||AWS API, Kinesis Producer Library (KPL)||Not AWS specific - Just need an implementation of the Kafka producer|
|Subscriber||EventBridge, API Gateway, Lambda, Kinesis, Kinesis Firehose, SNS, SQS, Step Functions, etc||Email, SMS, mobile push, http, Lambda, & SQS||Lambda, Kinesis Firehose, Kinesis Analytic App, & Kinesis Client Library (KCL) implementations (KCL has DynamoDB & CloudWatch dependencies)||Not AWS specific - Just need an implementation of the Kafka consumer|
|Scaling||Automatic||Automatic||Manual (add shards)||Manual (add partitions)|
|Message Size Limit||256 KB||256 KB||1 MB||1 MB, is configurable|
|Batch Size Maximums||10 events per req||No batch, 1 record||Lessor of 1000 records OR 1MB per request||Configurable|
|Config Limits||100 buses per account, 200 rules per bus, 5 targets per rule||100K topics per account, 12M subscriptions per topic, 200 filters per account per region.||Unlimited number of streams, 10K shards per stream. 20 subscribers per stream (each subscriber can only be registered for 1 stream)||Technically 200K partitions, but partitions are limited by brokers (90 brokers per account, 30 per cluster, etc)|
|Frequency Limits||2400 publishes/second, 4500 invocations/second||~30K events/second||1MB/sec or 1000 records/second per shard, 5 concurrent reads (@ 2MB each) per second per shard. Enhanced fan-out increases to 2MB/sec per shard per subscriber||Configurable, no hard limits|
|Retry||Yes, up to 24 hours||Yes, up to 24 days||Up to subscriber, but limited to retention of event in stream||Up to subscriber, but limited to retention of event in topic|
|Retry Configurable||No||Yes, but only for http||Up to subscriber||Up to subscriber|
|DR/BC||Multi-AZ, 1 Region||Multi-AZ, 1 Region||Multi-AZ, 1 Region||Configurable|
|Data Encrypted at rest||Unknown (data is not accessible)||Yes||Yes||Yes|
|Order Guaranteed||No, best effort||FIFO supported||Configurable||Configurable|
|Delivery guarantee||At least once||At least once||Exactly once or At least once||Exactly once, At least once, At most once, etc|
|Persist after processing||No||No, but optionally supports writing to SQS dead letter/poison queue||Yes, 24 hours by default, up to 7 days for extra cost||Yes, technically only limited by storage (max 16 TB)|
|Schema Support||Schema discovery, no validation||No||No, but Kinesis Analytics does schema discovery||No, but there are 3rd party extensions|
|Message transform||Yes, by target using input transformers||No||No||No|
As a general rule of thumb, EventBridge or SNS work best with subscribers that are idempotent and can tolerate a lack of ordering and additional latency, otherwise lean towards Kinesis and MSK, especially if ordering or durability is important.
- Great for handling large volumes of messages of which very few are expected to be consumed.
- Easily allows account-to-account sharing of events.
- Currently no ordering or durability support, although the latter can be mitigated with other services.
- Default is NOT to match, so a misconfiguration will result in a lot of lost events before an error is noticed. There is no count for how many events went unmatched and publishing to a non-existent hub just drops the events without any errors.
- Great for handling large volumes of small messages.
- Supports the largest variety of subscribers, include non-AWS http endpoints.
- Currently no durability support, although this can be mitigated by combining with other services.
- Supports more advanced features, like ordering, exactly once delivery, and durability, and ties in additional services such as Kinesis Analytics and Kinesis Firehose.
- Serverless, so minimal support needed especially when paired with Lambdas as subscribers.
- Depending upon the producer and subscriber, it may require the KPL and KCL, which can add complexity and will encourage the service be on AWS. Be aware that KCL currently uses DynamoDB and CloudWatch services behind the scenes to track its position within the stream and provide logging. Also, Lambda subscribers are great; however, they don’t currently support listening across accounts.
- Supports more advanced features, like ordering, exactly once delivery, and durability, and most extensions to Kafka. But, it doesn’t allow uploading JAR files to the cluster, so that does limit which extensions can be used.
- All the functionality and many of the community tools of Kafka, although it doesn’t always support the latest open source versions.
- Producers and subscribers require Kafka libraries which can add complexity. It doesn’t play well with many serverless options.
- Recommended only for the following scenarios: cloud independence is a top priority, maximum performance and flexibility is needed for large volumes of data, the use case is implementing Event Sourcing, or the company using it is already very familiar with supporting Kafka.
To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.