AWS Multi-Account Management using Cloud Custodian and Serverless Compute

A simple and scalable approach to using Cloud Custodian for AWS governance, security and cost controls.

Ryan Ash
Senior Technology Engineer
Note: Some content in this article could potentially be outdated due to age of the submission. Please refer to newer articles and technical documentation to validate whether the content listed below is still current and/or best practice.

Whether administering three or 300 AWS accounts, it is essential to implement consistent security and governance policies. Enabling users to leverage all the AWS services while remaining within the guardrails defined by your company is the goal. Polices should be relatively easy to create and implement. Having a stable deployment and runtime for these policies allows administrators to focus their valuable time elsewhere. Cloud Custodian policies provide flexibility to write governance rules in the data serialization language YAML, while remaining simple to deploy and support.

In this article we will discuss a sample deployment of Cloud Custodian which is fully automated, scalable, and surprisingly simple. We will also delve into some customizations which have allowed us to better utilize Cloud Custodian. Let’s jump right into it!

Previous Architecture:

Before discussing the latest architecture, it is important to understand its predecessor. Cloud Custodian is intentionally non prescriptive on how it’s implemented. This provides flexibility to use various implementation models, each with its own pros and cons. It is possible to combine various models like event-driven or periodic scans. Initially, I loved the idea of a decentralized model, where policies were run from the member accounts and results were sent to a centralized event bus for processing. This model utilizes Cloud Custodian’s ability to deploy Lambda functions for each policy.

Cloud Custodian Architecture Before

This is a high level diagram which does not include all the aspects of a complete security and governance solution.

Let’s draw attention to a few components of architecture above. Routing all the detection results back through a single SQS queue provides an aggregation point for downstream processing. If you have additional element managers such as GuardDuty, which feed your ticketing system, you may require an additional queue which does the necessary transforms and processing for this source. This architecture also uses a combination of event-based and periodic policies. Event-based policies are primarily used for more severe or time sensitive situations. The Git CI/CD pipeline is ultimately responsible for deploying Cloud Custodian resources into member accounts. A key component of this deployment was c7n-org, which uses policy tags to apply policies to the correct environments.

Policy Examples

Example of a periodic policy which will deploy a Lambda and associated Cloudwatch trigger:

- name: aws-periodic-rds-encrypted
    description: "Notify of unencrypted RDS instance"
    resource: aws.rds
    tags:
      - environment:Test
      - severity:major
    mode:
      type: periodic
      role: arn:aws:iam::{account_id}:role/custodian-lambda-role
      schedule: "rate(1 day)"
    filters:
      StorageEncrypted: false
    actions:
      type: notify
      transport:
        type: sqs
        queue: arn:aws:sqs:us-east-1:111122223333:my_queue

Example of an event-based policy which will deploy a Lambda and associated Cloudwatch trigger watching for the Cloudtrail events:

 - name: aws-periodic-rds-encrypted
     description: "Notify of unencrypted RDS instance"
     resource: aws.rds
     tags:
       - environment:Test
       - severity:major
     mode:
       type: cloudtrail
       event:
         - createDBInstance
     filters:
       StorageEncrypted: false
     actions:
       type: notify
       transport:
         type: sqs
         queue: arn:aws:sqs:us-east-1:111122223333:my_queue

While successful, there were two main concerns with this model. First, it requires the deployment pipeline to deploy multiple Lambda functions into each member account. Removing a policy required custom automation to loop through each account and then delete the related Lambda and CloudWatch resources. As our business partners logged into their freshly built AWS account they encountered and navigated around 100 Cloud Custodian Lambda functions. This approach is not very efficient.

Reporting is the second concern. While Cloud Custodian had been doing a wonderful job identifying where violations existed, we wanted to better understand additional dimensions of the detection policies overall:

  • What does the trend look like per region, per policy, per resource?
  • Are detections increasing or decreasing over time?
  • How could we include metrics from other element managers (data sources)?

Our goal was to address these questions with the new architecture.

New Architecture:

The new architecture fulfills all of our must-have requirements, with the added benefit of scalability by utilizing serverless resources. The deployment pipeline will build all the necessary components for the container to run as an ECS task, providing target information as environment variables when the tasks are queued up by the invoking Lambda.

Cloud Custodian Architecture Before

We currently have multiple paths for output from our policies. The default action will send a message to SQS. This is the initial step for an event-driven framework, which ultimately ends up in the single pane of glass dashboard. The secondary path is also SQS but is for custom reporting data that is discussed below.

Ideally, Cloud Custodian would be the only tool needed to track internal compliance across various AWS accounts. There are often complex compliance scenarios that combine multiple resources which can be difficult, if not impossible, to accomplish with Cloud Custodian. In these rare instances where Cloud Custodian is not a good fit for the desired compliance policy, we’ve created a simple process for introducing Lambda based detections. The same master process which invokes hourly Custodian checks for each account and region will inject a similar entry into SNS with target information. All custom Lambda policies subscribe to this topic and run against each account, with the same output as a native Cloud Custodian policy (i.e., SQS).

Reporting Model Before and After:

One of the primary drivers for refactoring our Cloud Custodian architecture was to improve reporting. Previously when asked about our ‘moment in time’ internal compliance rate, we’d simply show Alerta, our single pane of glass for alerts. Unfortunately, this implementation could only provide visibility into events where a resource violated a defined policy. In other words, we lacked the ability to say ‘99% of our EC2 servers are compliant with rule X’.

Cloud Custodian Reporting

It turns out that doing this with Cloud Custodian requires a bit of additional work. There are native features in Cloud Custodian to write metrics to CloudWatch, but it didn’t exactly match our needs. To satisfy internal reporting requirements, we run a blank policy for each active resource type to get a full list of resources. The primary script of the ECS container task will compare results for each policy against the resources found in this baseline policy. Since we have a different policy file for each AWS resource type, there is a baseline policy within each policy file. For example, imagine you have 10 RDS instances, nine of which are encrypted and passing, and one which is not. This baseline policy will return 10 instances, and the encryption policy will return one failure, allowing us to publish results indicating pass and failure rates.

- name: aws-baseline
  resource: aws.rds
  description: "Baseline"
  filters:
    - "tag:this-tag-will-never-exist": absent

For visualizations we often leverage AWS Quicksight, which is simple to implement and provides quality business intelligence features. Quicksight also has native integration with AWS Timestream, and pushing our metrics into Timestream created a simple metrics storage solution. As the detections are processed, we attach all the necessary dimensions such as account name, region, policy/detection, aligned product, etc. In whole, this provides our team a view of our overall security and governance internal compliance state, which can be trended over time.

Plan Ahead for Exceptions:

Below is a simple policy filter which demonstrates how we use tags as an exception process for a given policy. These tags can also be reported, allowing us to create metrics while helping ensure that both users and administrators at State Farm understand the accepted risk.

 - name: aws-rds-encryption
   resource: aws.rds
   filters:
     - not:
       - "tag:aws-rds-encryption": "PlatformException"

Docker Container

Behind our hourly Cloud Custodian checks is a Docker container, which is run as an ECS task. Cloud Custodian maintains a docker container on DockerHub which provides a good starting place, and additional examples can be found on GitHub. If you run into “Access Denied” issues it is important to remember that how IAM roles work for non-PID 1 1 processes within these containers. In a future article, I will provide more detail on how we addressed this issue for our Gitlab pipeline using ECS tasks on Fargate. For our container, we have a custom ENTRYPOINT script which controls the execution of c7n-org and c7n commands. This script will also collect and process our custom data for reporting.

Conclusion

There are many native, commercial, and open source tools to address cloud security and governance. Cloud Custodian is a flexible, simple to use option, that’s complex enough to tackle most common requirements. It is worth noting that the creators of Cloud Custodian has recently started a new company called Stacklet, with Cloud Custodian at its core.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.