Jun 22, 2022

Orchestrating Event-Driven Analytics with Ease

Evan Stalter

Senior Technology Engineer

Tim Nickel

Senior Technology Engineer

At State Farm our mission is to help more customers in more ways. To achieve that goal, a vast network of components across multiple platforms (PaaS, AWS, Salesforce, and Legacy to name a few) must flawlessly operate behind the scenes to help our agents and call center associates help new and existing customers. Much of this architecture today relies on synchronous point-to-point calls across these platforms, bringing potential challenges such as an outage that could negatively impact our daily operations.

The Event Hub

To increase resiliency, flexibility, availability, and performance of this intricate system, we are exploring the implementation of an asynchronous “event hub” in AWS that can manage several inbound producers and outbound consumers simultaneously. The event hub allows us to guarantee eventual delivery of crucial events related to our customers and our associate-facing systems, regardless of whether any of the related on-prem or AWS services encounter outages or delays. This provides our associates more visibility into the end-to-end customer experience and simplifies the vast amount of tools and systems they must use for their daily work.

The event hub allows us to leverage AWS S3 as an “event reservoir” (smaller in scope than a “data lake”) to retain events for retry processing, posterity, and, most importantly, analytics. Never have we had such an opportunity to gain critical new insights into our business than we do now, because of the analytic capabilities AWS offers for the event hub architecture.

Why Analytics?

Analytics on events that pass through the event hub are uniquely different from analytics on data that is resting in a data store. While this event data that is in transit to a consumer may not be an entire record, it will include pertinent information about the event that includes what data it relates to, why the event is happening, and additional event details that could be used in analytics.

Some initial ideas we had around analytics on the events include:

Overall health of the event hub
Discovery of anomalies including unexpected increases/decreases and gaps in the published events
Analytics on the event detail data
Analytics when there is correlation between different but related events

Performing analytics on events can give us insight into where machine learning could be built into the event hub to take immediate action/automation on events as they are published. Having related but different events flowing through the event hub makes this machine learning idea even more promising as the event hub becomes a “crossroads of events.”

Generating Data with Chaos

The initial proof of concept work with the event hub focused on building out the architecture and simulating events. So how can we really start to envision the potential of analytics on events without having real events flowing through the architecture? To address this, we started generating events with data created using the chance.js open source library. We quickly learned that while unique, these generated events didn’t provide relationships between events.

Take the case of a workflow that would generate multiple events like the order, shipping and delivery of a product. These events are tied together by a customer attribute. However, our initial data generation process was purely random as it wasn’t creating related customer events published at different times. Our approach was to pre-generate many days of events that included the same customer attribute across different workflow stages so that these would be published as events at different times. These pre-generated events are chunked into different files and each file would be “played” at different times to create the entire workflow for a customer.

SF image

While it was helpful to generate this data to begin seeing the potential of event analytics, the data is still fabricated and therefore analytic insights are limited. To help with this we built in variability so that more interesting things would appear in the analytics. We made this variability configurable in the process that determines the likelihood of an event happening. In the case of shipping a product, this variability would be set at a certain percentage. Over time, as more files are created, it would become more likely that the product would be shipped.

By using the chance.js weighted function, the process will still randomly decide if the product is shipped, but the likelihood will increase that it will ship over time. For example, if the variability is set to 80% this would mean:

DAY	CHANCE OF SHIPPING	CHANCE OF DELAY
1	20%	80%
2	60%	40%
3	74%	26%
4	80%	20%

This code makes the determination if a product will be shipped that day:

let chanceTimeFactor = chanceDelayShipping / daysSinceOrdered;

let willShip = chance.weighted(['yes', 'no'], [1 - chanceTimeFactor, chanceTimeFactor]);

With this type of variability it is still somewhat predictable in that everything will eventually be shipped (most likely in the first few days after being ordered). However, it does start to provide variability that will be seen while performing analytics.

To simulate unexpected disruption in a workflow we came up with the idea of causing chaos in the data. This means that when data is pre-generated there will be a period when no shipping will occur, depending on how it’s configured. You could think of this as being caused by a physical disruption in how a product is shipped, or a disruption somewhere in the technical systems that control when things are shipped or how the events are being published into the event hub.

Similar to how the variability is configured, a chaos boolean will first indicate if a chaos event should happen, the earliest day it should start happening, and the chance of it starting after that initial day. These three factors are used in the code to determine when the data chaos starts.

let chanceTimeFactor = 1 / (days - (chaosShippingMinStartFile - 1));

let startChaos = chance.weighted(['yes', 'no'], [1 - chanceTimeFactor, chanceTimeFactor]);

The variability and chaos logic in the code could be much more complex in terms of variables and factors that influence the calculations. The expectation is that much more interesting analytics would be found with real events, which brings its own variability and chaos due to real-world problems, but this at least provides a way to start thinking about the potential for analytics on published events from day one.

Once the events are generated, they are sent to a staging bucket and processed with AWS Step Functions into the event hub, where they are then placed into the S3 reservoir for analytics.

Data Partitioning with Kinesis Firehose

Once our data arrives in our S3 event reservoir, how do we prepare it for analytics? Before the fall of 2021, there was only one conventional answer: create AWS Glue jobs and crawlers to assemble the data into tables within a Glue Catalog so that the Amazon Athena service could query the data using SQL.

SF image

Fast-forward to 2022: AWS has provided a more expedient option to achieve the same end: dynamic partitioning with Kinesis Firehose. This new capability enables us to select the primary “columns” of data within our JSON payloads that we’d like to partition on (such as data source or some other unique id), so that Athena can efficiently query the data. A Glue Catalog is still used, but no Glue crawlers or jobs are required. Firehose also offers advanced inline and newline parsing capabilities so that even the most complicated JSON payloads cannot prevent you from using this new feature. Overall, this approach has saved us considerable time and effort.

Analytics with Athena and QuickSight

As the different workflow events (ordered, shipped, delivered) are published at different times, the data must be prepared through joins to visualize these charts in QuickSight for analytics. This is done with an Athena Query that is configured in a QuickSight Dataset, and results in one row for a customer that includes a timestamp for each workflow event:

clientId	orderTime	shippedTime	deliveredTime
7310221	2022-02-22 00:30:49	2022-02-24 01:30:48	2022-02-28 04:30:48

Joining the events allows for analytics and trend discovery across each published event type by customer. This technique could be helpful not only to see progression of related events through a process like this but also things like:

Correlation between published events coming from different but related producers
Event timelines that may not be captured or readily available in the producer or consumer datasets.

But how can we really understand the data? This is where Amazon QuickSight comes in. At its core, QuickSight is an easily configurable business intelligence tool that can provide invaluable insights about your data in the form of evolving dashboards and machine learning. All we had to do to set it up was sign into QuickSight with our admin role (you do need a role that has the permissions to access QuickSight) and ensure QuickSight had access to both Athena, Athena’s S3 query results bucket, and our S3 data reservoir. Once all that was in place, we were able to start building the following charts for our product shipment events.

These first charts show the counts of events published over time, with the chart on the right showing the type of event: SF image

The next two charts show the average time for hours to ship and time to deliver to the customer. These are only possible by joining the events by customer as discussed earlier. Notice the gaps on both charts where we injected chaos into the data generation process.

QuickSight automatically detected shipping delays on March 8 and prepared an insight using its machine learning capabilities:

SF image

The more data you have from different sources, the more QuickSight can learn, aggregate, and deliver cross-product insights for you. It has been a truly simple way for us to open the door to the possibilities machine learning can provide for our complex system.

Containing Analytic Testing with Lambda Event Filtering

Let’s now focus on how to safely test our analytics flow. For our test, hundreds of events (with chaos) are generated each night just to give us a solid base of data to test how QuickSight performs against our expectations. We don’t want any of our actual downstream consumers to receive these events and have their processing corrupted, even in our test environments.

To solve this issue, we implemented the new “lambda event filtering” pattern offered by AWS into our consuming services. With this pattern, each downstream consumer will only receive events from the source(s) it cares about. The filter is baked into the SQS trigger in the lambda (see below). If an event doesn’t match the required pattern, the SQS queue drops the event by default before it even reaches the lambda.

{
  "filters": [
    {
      "pattern": "{\"body\":{\"source\":[\"productshipment\"]}}"
    }
  ]
}

This is important as it removes the need to implement filtering logic inside of your consuming lambdas, and saves you money because they’re not invoked if there is no match. It also saves you labor costs, because otherwise this filtering logic may have needed to reside within the event hub itself. This could lead to a maintenance nightmare. Instead, we have a clean, cost-effective solution that also enables us to test our analytics events at a large scale, labeled with a different source, in a completely contained fashion.

Wrap-Up

Disruption is an unavoidable part of life. Why not do everything to prepare our systems for it? Event-driven analytics is a way that State Farm can be better prepared to help more customers recover from the unexpected.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

event driven architecture

eda

aws

patterns