AWS DynamoDB - Best Practices for “Exponential backoff retry” to improve the resource accessibility
Introduction
The post outlines best practices for dealing with retry techniques in AWS DynamoDB, focusing on optimizing resource accessibility, effectively calculating sufficient items for operations, and utilizing auto-scaling for capacity management. These strategies aim to ensure better performance and responsiveness in handling requests to the database.
Key Points
Exponential backoff retry is recommended as a method to manage request failures in AWS DynamoDB.
It is crucial to calculate the exact number of items needed for operations to avoid unnecessary load on the database.
Setting up auto-scaling can dynamically adjust capacity units based on demand, optimizing performance and cost.
Proper error handling strategies are essential to improve the overall reliability and efficiency of database interactions.
Leveraging retries can significantly enhance resource accessibility, especially during high-load scenarios.
Monitoring performance metrics is vital to fine-tune the retry mechanism and scaling policies.
Using best practices ensures compliance with AWS recommendations, enhancing the integration of DynamoDB in cloud applications.
Overview of our system (Stampless) with DynamoDB
Stampless is aMoney Forward product which supports contract lifecycle managementas a non-stop platform such as creating, applying or signing contracts. In addition, the application lets users operate on organizational structures like offices and employees. To improve the quality of service and define an appropriate roadmap, a daily report of office usage containing information such as user_count, contract_count is generated to provide a clearer view for the development team.
In Stampless, we leverage the AWS DynamoDB - a fully managed, serverless NoSQL database service - to store office reports as it is suitable for handling high write volumes and flexible schema design.
However, like any distributed system, it can encounter transient errors due to network issues, server load, or other unexpected events when serving requests. As a result, AWS DynamoDB employs a robust retry mechanism, automatically handling temporary failures and retrying requests until success.
Understanding the advantages and disadvantages of AWS DynamoDB's retry mechanism is crucial for optimizing application performance, ensuring data consistency, and building resilient DynamoDB services[1].
AWS DynamoDB offerings
Read/write capacity unit
Firstly, it is important to understand how DynamoDB controls its resources. DynamoDB utilizes a unique resource management system based on Read Capacity Units (RCUs) and Write Capacity Units (WCUs)[2] to handle access from applications effectively.
Each capacity unit determines how much data can be accessed/updated per second. For a read operation, the limit is 4 KB while a write operation is 1 KB. If the data does not fit with the limit, dynamoDB rounds it up.
For example: BatchWriteItem (10 items) with each item costing around 500 bytes. DynamoDB will round each item size to meet the limit (1 KB). The final result is 10 write capacity units required for this operation.
→ AWS will throw ProvisionedThroughputExceededException if the provisioned write capacity throughput is set to 5.
This setting of the capacity unit is one cause that might trigger possible errors that we observed during implementation, as described below.
Exponential backoff retry [3]
The following diagram illustrates how the server can retry the calls to DynamoDB until a successful response is returned. If DynamoDB doesn't return a successful response after a few tries, the server can stop retrying and return a failure to its caller.
For example, a maximum of three retries are configured beforehand with an increased multiplier of 1.5 seconds (backoff delay). The next retry will multiply the multiplier with the interval of the current retry that has to wait:
- First retry: occurs after 3 seconds.
- Second retry: occurs after 3x1.5 seconds
- Third retry: occurs after 4.5x1.5
- Success: return as a success response
- Failed: return 429 error code
In AWS SDK[4] for DynamoDB, exponential backoff retry is already implemented with the default configuration (retry attempts - 3). The backoff delay is determined randomly using jitter because AWS wants to avoid successive collisions - randomness can avoid many operations retrying at the same time. Usually, the backoff delay value is less than 1 second as the R/WCU will be restored after a second.
This exponential backoff retry supported feature is applied in our implementation to improve with a better result compared to other approaches (“no retry” or “retry” instantly), but still causes possible errors for our implementation, as noted below in the observed errors..
Scenario
In order to manage or evaluate the usage of features, our system (Stampless) will generate a web-based report containing user_count, contract_count of each office which we mentioned at the beginning. Supposedly, there are around 1000 offices and each office report size is around 500 bytes, we need to write daily reports for these offices in 1 running process to DynamoDB. The provisioned throughput of the write capacity unit is 5 with no auto-scaling in the AWS environment.
Let assume a given request sending 25 write item APIs per second to generate a report as described above. Here are the possible approaches:
“No retry”:
Only succeed with 5 records out of 25 due to write capacity unit limitation.
Can not save 100% of records.
“Retry”:
The application catches the failed items and retries until all items are written.
Only 5 records can be written per second, so the application needs around 5 seconds to complete.
However, the application needs to call many useless requests in the meantime.
“Exponential backoff retry”:
The application tries to retry after a small period.
After a few attempts, the application finally completed the process.
It reduces redundant calls to AWS compared to normal retry.
Here are the example if the number of attempts is 3 and backoff delay is 500ms:
Initial request: 5 successive records (20 failed).
1st retry - 500ms: 0 successive records because it is still less than 1 second period.
2nd retry - 1s: 5 successive records (15 failed).
3rd retry - 1.5s: 0 successive (same as 1st retry).
After the 3rd retry, AWS will throw an error.
Observed errors
Here are common errors that will be encountered when our system relies too heavily on (exponential backoff) retry mechanism.
Throttling exception
As usual, the exponential backoff retry method defines a default number of retry. From the example above, the magic number will not always work and AWS will throw a throttling exception error. We can consider setting this number to a high value. However it can violate rate limiting below.
QuotaExceededError5
AWS wants to limit the resource being accessed ineffectively by external applications. For example, when applications abuse the backoff retry, another error will be thrown - QuotaExceededError. This error aims to alert that the logic is not efficient and applications are limited from accessing the resource.
Here is the breakdown:
By default, there are 500 points for the external service to access the AWS resource.
When a retry fails, the points will get reduced.
An operation that succeeds on the 1st attempt adds 1 point.
→ When the points run out, the application is not allowed to access the resource for a while.
From the observed errors, it is not ideal to rely on an exponential backoff retry mechanism which is suitable for occasional failure. Therefore, we come up with some solutions and optimization to improve the efficiency of our system operation.
Best Practices
Here are some solutions that can be applied together or individually depending on the configuration of DynamoDB and the amount of data that needs to be processed.
NOTE: These long-running processes should not be coupled into APIs or the same functionality because it may result in longer wait times for end users. Consider implementing with cron-jobs or queue processors.
Increase retry attempts
We will catch the failed items from batch API, and apply the logic of exponential retry with a fixed backoff delay of 500ms. In order to not affect the long-running process, the number of attempts should be configured to a small number as there are already 3 attempts by default in AWS SDK. I did increase the attempt by 2 because the second retry will wait more than 1 second which assumes the R/WCUs are restored.
→ This solves the problem described above with a small amount of reports (~1000). However, there is still a chance that it may violate rate limiting logic.
Calculate sufficient items
As if many of the retry attempts fail. there might be a chance that QuotaExceedError will occur, it is ideal to calculate enough number of items per each batch API call. We simply pick an item to calculate a sample item size and get the provisioned throughput configured in the AWS environment by API DescribeTable.
→ This solution will not violate throttling and quota limit logic. However, it may take a long time to finish the process.
→ Besides, one process can consume all the read/write capacity units so others may get starved. Therefore, it is ideal to combine with setting auto-scaling for DynamoDB.
→ The function to calculate item size[6] may not be exactly correct due to the variance of data but it is not a usual case or we can monitor to estimate the variance if needed.
Besides the solutions, it is ideal to improve other aspects of the service, such as 2 additional practices below.
Reduce record size
Decreasing the size of the record can result in more records being read/written at a time. One of the ways is to shorten the column name. Each record request sent to dynamoDB contains both the column names and values.
→ This is only feasible in case the request is larger than the limit of the operation unit. For example: an item size is 1.5 KB can be reduced to less than 1 KB to acquire only 1 unit.
Set auto-scaling for capacity unit
Check the monitor of dynamoDB and pick the highest consumed capacity unit when throttle issues happen. Set the auto-scaling of the read/write unit to that value.
→ This solution is good for occasional throttling issues which do not happen as usual. It costs an amount of increased units. Moreover, the highest consumed unit is not stable.
Closing thoughts
In this discussion, we outlined the exponential backoff retry technique employed within AWS DynamoDB. Retries provide a valuable safety net in the face of transient errors, but they should not be the only solution. By incorporating the optimization techniques mentioned in this article, we can further enhance performance, minimize resource consumption, and improve the overall efficiency of our DynamoDB services. The implementation has yielded positive results, showing improvement in performance.