Mar 8, 2022

Embracing Chaos

A Product Team's Journey

Ranjita Sahu

Technology Manager

On a frigid night in late February 2012, I received a call for an emergency production issue. One of the critical batch jobs that processes data for the next day’s business had failed. At 9PM, there I was back in the office with 11 other State Farm developers. My then-product manager got us pizzas, coffee, and hot chocolate to cheer up the room, and the team spent the next couple hours pushing an emergency fix to production. Over the past decade, system failure analysis and maintenance strategies have evolved. We have come a long way from documenting failure causes, impacts, and mitigation in Excel spreadsheets to experimenting in production to avoid problems.

I often deal with system complexity as a technology manager at State Farm®. My team is consistently focused on system reliability and the ability to detect and recover issues with minimal to no disruptions. Breaking things on purpose to minimize the unknown is important to help avoid unexpected impacts to our systems. Our goal has been to uncover potential weaknesses in our products and address vulnerabilities prior to pushing code to production. In other words, chaos engineering has been a focus for us. Chaos engineering provides a disciplined approach to test the resiliency of the system, build trust in it, and among the team members that contribute to the system. My hope is that chaos engineering will train our teams to deal with failures in large-scale systems.

Experiments

Chaos engineering … just another buzzword?

Is it an add-on or nice-to-have?
- Every year, State Farm processes millions of insurance policies to help customers in many ways. My team owns the Risk and Consumer Reports products, which are critical components of the quote process. As part of our work, it is crucial to include chaos engineering as it is part of our Site Reliability Engineering effort.
How is it different from testing your systems?
- Chaos engineering focuses on dynamically learning how the system reacts to failures instead of merely validating expected behaviors.
Is it for test, production, or both?
- Chaos engineering provides an opportunity to exercise the steps to prepare us for an emergency production situation. Though not all systems can run chaos engineering in a production environment, it is worth carefully considering this possibility.

The beginning

Our initial focus was to introduce small failures through chaos experiments and observe how the system reacted to them. Starting small not only helped us to manage the resources and competing priorities efficiently, but also helped us understand the value of this new engineering practice. In order to enable chaos engineering for a Spring Boot app, we had to make few modifications to the app. Chaos Monkey for Spring Boot is an open source tool that will attack your Spring Boot applications. With a few changes to the application, we simply added Chaos Monkey as a maven dependency, and configured it to attack our application.

pom.xml

<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>1.0.0</version>
</dependency>

Once we had the maven dependency set up for our project, we made the necessary configuration changes to activate the application with a profile named chaos-monkey. By starting the application with this profile, we could leverage the Chaos Monkey endpoints without stopping or starting the application.

application.yml:

spring:
  profiles:
    active: chaos-monkey
chaos:
  monkey:
    enabled: false
management:
    endpoints:
        web:
            exposure:
                # Include specific endpoints:
                include: health,info,chaosmonkey
    endpoint:
        chaosmonkey:
            enabled: true

By setting the management.endpoint.chaosmonkey.enabled=true, we were able to use the endpoints to change assault settings, set watchers at a granular level, and enable or disable Chaos Monkey. It is also important to note that, chaos.monkey.enabled was initially set to false until we were ready to release chaos on our application. This provided the ability to enable Chaos Monkey externally through a Chaos endpoint as needed.

A sample of how to make the configuration changes in application.properties instead of application.yml is provided below.

application.properties:

spring.profiles.active=chaos-monkey
chaos.monkey.enabled=false
management.endpoint.chaosmonkey.enabled=true

Simulating the failures that help us learn about the system is important. It is crucial to make a solid plan for the experiments that will provide the team the biggest bang for their buck.

Chaos Monkey for Spring Boot provides the following assaults:

Latency Assaults - to apply a specified amount of latency to a request to simulate a delayed request
Exception Assaults - to simulate a database being down or any other type of exception that you might want to force
AppKill Assaults - to observe the app behavior when an application is terminated
Memory Assaults - to attack the memory of the java virtual machine

To continue with our theme of starting small, we chose the following three types of attacks that would help us understand the system behavior in case of failures.

Latency Assaults - added 1-3 secs latency to the request
Exception Assaults - simulated a ConnectException for a downstream database connection
AppKill Assaults - stopped the application to observe the app behavior

Steps to release the monkeys

Assaults can be configured in a POST request with a JSON body to an endpoint such as https://{app_address}/actuator/chaosmonkey/assaults

{
    "level": 1,
    "latencyRangeStart": 1000,
    "latencyRangeEnd": 3000,
    "latencyActive": true,
    "exceptionsActive": false,
    "killApplicationActive": false
}

To enable the Chaos Monkey to “go bananas"(i.e. disrupt the normal application behavior), all you need is a simple POST request to /actuator/chaosmonkey/enable. This was an easy way to try out Chaos Monkey in a local setting before setting up Chaos experiments in a formal test environment. Disabling Chaos Monkey was as easy as running a POST request to /actuator/chaosmonkey/disable.

Chaos GameDays

Breaking things for benefit was never fun before. For those unfamiliar with the term, Chaos GameDay is a dedicated time for teams to collaborate on chaos experiments. Chaos GameDays are a perfect way to become more comfortable with the chaos engineering concept. One of our product team members was designated the Master of Disaster, and planned the failure scenarios that were run on the GameDay. It was all done in secret to simulate an emergency.

GameDays

Dear Chaos Toolkit, Thank you for making our lives easier by giving us the experience of writing and running chaos experiments to test a hypothesis. - Sincerely, Ranjita

While the Chaos Monkey creates the mindset to easily disrupt the application through quick curl commands or postman requests, we were looking for something with more control; a tool with a rollback mechanism that lets you define your own experiments.

Coming up with an experiment requires some brainstorming. We came up with this structure:

Hypothesis: Test to terminate an instance of our app. The goal of this is to see what happens when that instance is terminated. Will the instance automatically re-start the app?

Contributions: This is a high-reliability experiment because we see how reliable the application is. Does the stopped instance automatically restart? Does the application behave well while the stopped instance is getting ready? What happens to the in-flight requests that were getting processed when the server stopped?

Experiment: Get application stats for the running instance, terminate the instance of the application and get the application stats (which should return the status as being down).

Result: When terminated, will the server instance automatically re-start (schedule new instances) the app because it has been terminated? Manually re-creating a new instance of the app is not needed.

With Chaos Toolkit, the experiments are written in JSON files and split into various stages:

It’s important to provide a clear title and description for the experiment. For example:

    "version": "1.0.0",
    "title": "Terminate ${app_name} on ${space_name}",
    "description": "Experiment to terminate a random instance of PRM to see if it automatically restarts",

Once that is out of the way, an experiment must define the method and rollback properties.

 "method": [
      {
        "name": "fetch-app-statistics",
        "type": "probe",
        "provider": {
          "type": "python",
          "secrets": ["appsecret"],
          "module": "chaoscf.probes",
          "func": "get_app_stats",
          "arguments": {
            "app_name": "${app_name}",
            "org_name": "${org_name}",
            "space_name": "${space_name}"
          }
        }
      },
      {
        "name": "terminate-random-instance",
        "type": "action",
        "provider": {
          "type": "python",
          "secrets": ["appsecret"],
          "module": "chaoscf.actions",
          "func": "terminate_some_random_instance",
          "arguments": {
            "app_name": "${app_name}",
            "org_name": "${org_name}",
            "space_name": "${space_name}"
          }
        }
      },
      {
        "ref": "fetch-app-statistics"
      }
    ],
    "rollbacks": [
    ]

Finally, we also defined few other properties to organize the experiment properly.

"tags": [
      "cloud",
      "tp"
    ],
    "contributions": {
      "reliability": "high",
      "security": "none",
      "scalability": "medium"
    },
    "configuration": {
      "api_url": "https://api.this.is.test.net",
      "verify_ssl": false,
      "org_name": "autoreports",
      "app_name": "auto-management",
      "space_name": "env-test",
      "prm_rest_url": "https://autoreports.this.is.test.net/auto-management"
    },
    "secrets": {
      "appsecret": {
        "username": {
          "type": "env",
          "key": "USERNAME"
        },
        "password": {
          "type": "env",
          "key": "PASSWORD"
        }
      }
    }

The Master of Disaster worked on writing Chaos Toolkit experiments prior to the GameDay and ran the command line interface commands from a local machine to execute the experiments written in a JSON file. To execute an experiment to terminate an instance, the command would be:

chaos run {terminate-experiment-file-name}.json

To be honest, our first Game Day was as disastrous as it could have been, and we weren’t ready to exhibit grace under pressure. Our system dashboard didn’t show the expected metrics and a few of the alerts didn’t respond to specific events, but this helped the developers to carefully choose the strategies to avoid failures and improve response times.

Concluding the GameDay

We conducted a postmortem with detailed discussion on each experiment:

Experiment 1: Stop the app in test environment

Hypothesis: Experiment to stop the application. The goal is to see what happens when an app is stopped. Will it start automatically? We also expected our dashboards to show that the service is down.
Result: When the app was stopped, it didn’t start automatically, and we had to manually start the application to continue with the experiment. The dashboards didn’t display the service outage, so we couldn’t validate the behavior on the dashboards.

Experiment 2: Throw an exception in test environment

Hypothesis: Experiment to forcefully throw a database connection exception. The goal is to see what happens when we encounter an exception message. As a team, we expect the logs to show in log management system. We also expected our dashboards to show the outage.
Result: When the connection exception was thrown, they weren’t logged in the log management system. This is likely because the chaos tool is hit to throw an error, rather than going into the code itself. We also couldn’t validate the behavior in our dashboard since it wasn’t connected to the test environment.

Postmortem

Are the alerts set up properly in the log management system?
Are our dashboards properly set up to display service status when it is down?
How does load balancing work and how do we manage resources if a server instance goes down?
Is there value adding Chaos to applications beyond a single product team to see how a broader audience react to disruption?

Bringing chaos to the pipeline

I strongly recommend integrating Chaos experiments into the pipeline to help ensure the new code performs as expected. We built a pipeline to inject a controlled amount of failure that is scheduled to run in intervals.

chaos-pipeline

Putting observability to use

While chaos engineering helped us explore ways a system could fail, Observability Dashboards helped us uncover the causes and reason for the failures. This further helped us explore the mean time to recover and plan strategies to fix outages quicker.

Observability

Conclusion

In summary, here are the high-level steps to implementing chaos engineering for a product:

Add the necessary configurations to the application.
Start small by running simple POST requests to experiment with Chaos Monkey.
Set up a Game Day to run experiments using Chaos Toolkit.
Build a chaos pipeline to build confidence in the resiliency of the system.
Leverage Observability through chaos engineering to inspect and debug failures.

This approach has helped build additional confidence in our system. It is always a challenge to introduce new technologies while the product teams juggle competing priorities, but it’s worth it. Even though hot chocolate is one of my favorite drinks, I would prefer a good night’s sleep over it any day.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

Chaos Engineering

ContinuousLearning

SRE