Between approximately 5:15 PM CT and 8:15 PM CT on July 30th, 2024, the US instance of Snapsheet Claims experienced degraded response times, resulting in timeouts and pages failing to load for users.
The event was triggered by an AWS outage that affected 64 AWS services in the us-east-1
region that started at approximately 5:00 PM CT, was first reported by AWS on their status page at 5:40 PM CT, and was marked as resolved by AWS at 11:55 PM CT.
We apologize for the inconvenience this caused and are committed to continually improving the resiliency of the Snapsheet Claims platform.
All times are CT.
5:10 PM: An automated alert indicated that response times were degraded for one of the backend services for Snapsheet Claims and engineering began investigating immediately
5:15 PM - 5:35 PM: Several support tickets were raised by clients indicating that they were experiencing issues loading pages within Snapsheet Claims
5:35 PM: Response times continued to escalate across two different Snapsheet Claims backend services and an incident was published on the Snapsheet status page
5:40 PM: AWS reported that an ongoing incident in the us-east-1
region was impacting multiple services
6:00 PM: Snapsheet engaged directly with AWS resources to get more information on the impact and mitigation options for the incident
7:00 PM: Response times returned to normal for one of the two impacted Snapsheet Claims backend services
8:00 PM: Response times started improving for the remaining degraded Snapsheet Claims backend service
8:15 PM: Response times returned to normal for Snapsheet Claims and the incident was considered resolved after a period of monitoring
The event was triggered by an AWS outage that affected 64 AWS services in the us-east-1
region that started at approximately 5:00 PM CT, was first reported by AWS on their status page at 5:40 PM CT, and was marked as resolved by AWS at 11:55 PM CT.
AWS had an issue with the Kinesis service. Kinesis is a fully managed AWS service that enables real-time collection, processing, and analysis of streaming data at scale.
Amazon CloudWatch experienced elevated error rates and latencies due to its dependency on the degraded Kinesis service. Amazon CloudWatch is a dependency across most Amazon services for logging and monitoring. This led to cascading failures across 64 AWS services as indicated in the AWS outage details.
Snapsheet does not use Amazon Kinesis directly but was impacted by the internal dependency that Amazon CloudWatch has on Kinesis. Two Snapsheet Claims backend services that leverage Amazon Elastic Load Balancer and Elastic Container Service were impacted despite being available across 6 different availability zones (discrete data centers). Amazon Elastic Load Balancer and Elastic Container Service were impacted due to their dependency on CloudWatch for logging.
As we did not have direct visibility into the AWS issue, correspondence with AWS confirmed with their back-end tooling and logging that the two Snapsheet Claims backend services were impacted by the AWS outage.
All Snapsheet platform services are available across multiple availability zones within each region and are configured to automatically failover when a disruption occurs. Unfortunately, in this case, the AWS internal dependency on Amazon CloudWatch caused failures across all availability zones simultaneously.
Snapsheet also has the ability to restore services across AWS regions. Due to the nature of this incident occurring on internal AWS service dependencies, Snapsheet did not have visibility into where the issue was coming from until Amazon provided additional information which made it difficult to determine if we should start rotating certain services to a different region.
Snapsheet will be working with AWS as additional details of their root cause analysis become available and will be investigating multiple options for preventing and mitigating similar issues in the future.