Identify repeated events and prevent duplicated data with AWS services
Avoid data duplication
Is such a thing possible in the AWS serverless space
Recently I experienced something that you usually hear about when going through resources around cloud and distributed systems, which is everything fails, and you need to plan for it. This doesn’t just mean infrastructure failure but also software failure. Now, if you ever went through any of the AWS messaging services documentation like SNS and SQS, I am sure you have read somewhere in the documentation that they ‘promise’ at least one message delivery and they don’t guarantee exactly one. Basically, this means, your job as a programmer is to make sure to guarantee that safely because even AWS resources can fail.
It all comes to “everything fails, and you need to plan for it” when I recently encountered an issue that idempotency could be the key to solve it. Let me first try to define idempotency in my own understanding and knowhow. It is when a functionality is invoked multiple times with the same input results in different outputs after each invocation. In this case, a message that holds the same values resulted in multiple documents.
How it all spanned out
The setup I was going with has the following…A Lambda that is scheduled every day to be invoked queries a bunch of data from DynamoDB. It will publish a message to an SNS topic to which another Lambda function is subscribed and will process the message, At the end of the process of each message one single document should be stored in the table based on the message data it received.
The implementation here failed upon me and it’s because I didn’t have a couple of things in mind that revolve around “things fail all the time”. A message can be received more than once in rare situations, but also modeling the document without ensuring uniqueness & achieving idempotency for such business process. The unique document that should’ve been stored in the database once was stored twice.
A step towards achieving the needed state
Now the cool thing is that with some changes I was able to overcome this issue completely, I thought. At the time I hadn’t any clue what caused this issue whether the scheduler was sending two different messages with the same record because of data manipulation or was it that the message was being received more than once by the processor because SNS was having issues, or the processor had some edge case that causes it to process the same message twice. When in doubt, apply all types of rules until it’s fixed. The approaches I took had to be different and repetitive.
Leveraging Topics & Queues
Starting from the messaging service. The good thing is SNS has a capability that could help in decreasing the chances of receiving the same message more than once, and that is by using an SNS FIFO topic. FIFO topics have the capability of message deduplicating delivery. You configure it to deduplicate based on the content of the message that is being published on the topic or providing a deduplication ID. In my case, the message is always unique, it consists of two properties that hold UUIDs for two different types of documents that combined makes them unique. In this case, I went for deduplication based on the content of the message.
Distributed systems (like SNS) and client applications sometimes generate duplicate messages. You can avoid duplicated message deliveries from the topic in two ways: either by enabling content-based deduplication on the topic, or by adding a deduplication ID to the messages that you publish. With message content-based deduplication, SNS uses a SHA-256 hash to generate the message deduplication ID using the body of the message. After a message with a specific deduplication ID is published successfully, there is a 5-minute interval during which any message with the same deduplication ID is accepted but not delivered. If you subscribe a FIFO queue to a FIFO topic, the deduplication ID is passed to the queue and it is used by SQS to avoid duplicate messages being received.
To achieve exactly-once message delivery some conditions must be met:
- An SQS fifo queue is subscribed to the SNS fifo topic
- The SQS queue processes the messages and deletes them before the visibility timeout
- There is no message filtering on the SNS subscription
- There must be no network disruptions that could prevent message received acknowledgment
All of the conditions can be met by the configurations we make, except for the last one “There must be no network disruptions” which we will take care of in the next section. With some adjustments to our architecture, it should look like the flow diagram below.
Below in the code snippet, we’re passing a value to MessageDeduplicationId property that is composed of two values that will always be uniquely combined together.
import SNS from 'aws-sdk/clients/sns';
import { format } from 'date-fnz/fp';
const sns = new SNS({ region: process.env.AWS_REGION });
const xValue = "x_unique_value";
const yValue = "y_unique_value";
await sns.publish({
TopicArn: process.env.example_fifo_TOPIC_ARN,
Message: JSON.stringify({ x: xValue, y: yValue}),
MessageDeduplicationId: `${xValue}#${yValue}`,
}).promise();
Using the serverless framework, it should look something like the following:
exampleFifoSNSTopic:
Type: AWS::SNS::Topic
Properties:
FifoTopic: true
ContentBasedDeduplication: true
TopicName: exampleFifoSNSTopic.fifo
exampleFifoSQSQueue:
Type: AWS::SQS::Queue
Properties:
FifoQueue: true
ContentBasedDeduplication: true
QueueName: exampleFifoSQSQueue.fifo
exampleSNSTopicSubscription:
Type: AWS::SNS::Subscription
Properties:
RawMessageDelivery: true
TopicArn:
Ref: exampleFifoSNSTopic
Protocol: sqs
Endpoint:
Fn::GetAtt:
- exampleFifoSQSQueue
- Arn
exampleSQSSNSPlicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- Ref: exampleFifoSQSQueue
PolicyDocument:
Version: "2012-10-17"
Statement:
- Action: SQS:SendMessage
Effect: Allow
Principal:
Service: "sns.amazonaws.com"
Resource:
- Fn::GetAtt:
- exampleFifoSQSQueue
- Arn
Condition:
ArnEquals:
aws:SourceArn:
Ref: exampleFifoSNSTopic
There is a caveat though, and that is the deduplication only happens within a 5 minute interval. Simply put, if the same message (same deduplication ID or message content) has been sent more than once outside a 5-minute window where a previous has been sent, SNS & SQS won’t deduplicate it for us and will consider it as a unique message that was not sent before. This could also be tackled in the next section.
DynamoDB and leveraging Composite Keys within it
Taking things one step further to make sure that we’re only going to store one single document by taking advantage of the composite keys capability that can be achieved in DynamoDB. A combination of a partition key (PK) & sort key (SK) attribute values on the table are considered as a primary key on DynamoDB, which means no other document with the same PK & SK combined values can exist.
Good thing about composite keys is, if anything happens during the publishing & consuming of the message and we receive the message more than once, the composite key will guarantee idempotency as a last resort when storing the document in the table.
Conclusion
In programming, idempotency refers to the capacity of an application or component to identify repeated events and prevent duplicated, inconsistent, or lost data.Idempotency can be ensured in multiple ways. One of them is what we have discussed in this article, by making use of FIFO topics and queues in order to deduplicate the exact same message within a certain time of interval. To guarantee idempotency all the way to the data source layer is with the usage of the DynamoDB composite keys won’t allow store of the same document with a composite key that already exists.
This is totally out of an experience observed during the course of data migration and also in the process of trying to look to ways to to avoid any data duplication and at the same time reduce costs of storing unneeded copies of data.
To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.