Snapsheet Claims Outage

Incident Report for Snapsheet

Postmortem

Incident summary

Between approximately 5:15 PM CT and 8:15 PM CT on July 30th, 2024, the US instance of Snapsheet Claims experienced degraded response times, resulting in timeouts and pages failing to load for users.

The event was triggered by an AWS outage that affected 64 AWS services in the us-east-1 region that started at approximately 5:00 PM CT, was first reported by AWS on their status page at 5:40 PM CT, and was marked as resolved by AWS at 11:55 PM CT.

We apologize for the inconvenience this caused and are committed to continually improving the resiliency of the Snapsheet Claims platform.

Timeline

All times are CT.

5:10 PM: An automated alert indicated that response times were degraded for one of the backend services for Snapsheet Claims and engineering began investigating immediately

5:15 PM - 5:35 PM: Several support tickets were raised by clients indicating that they were experiencing issues loading pages within Snapsheet Claims

5:35 PM: Response times continued to escalate across two different Snapsheet Claims backend services and an incident was published on the Snapsheet status page

5:40 PM: AWS reported that an ongoing incident in the us-east-1 region was impacting multiple services

6:00 PM: Snapsheet engaged directly with AWS resources to get more information on the impact and mitigation options for the incident

7:00 PM: Response times returned to normal for one of the two impacted Snapsheet Claims backend services

8:00 PM: Response times started improving for the remaining degraded Snapsheet Claims backend service

8:15 PM: Response times returned to normal for Snapsheet Claims and the incident was considered resolved after a period of monitoring

Root Cause

The event was triggered by an AWS outage that affected 64 AWS services in the us-east-1 region that started at approximately 5:00 PM CT, was first reported by AWS on their status page at 5:40 PM CT, and was marked as resolved by AWS at 11:55 PM CT.

AWS had an issue with the Kinesis service. Kinesis is a fully managed AWS service that enables real-time collection, processing, and analysis of streaming data at scale.

Amazon CloudWatch experienced elevated error rates and latencies due to its dependency on the degraded Kinesis service. Amazon CloudWatch is a dependency across most Amazon services for logging and monitoring. This led to cascading failures across 64 AWS services as indicated in the AWS outage details.

Snapsheet does not use Amazon Kinesis directly but was impacted by the internal dependency that Amazon CloudWatch has on Kinesis. Two Snapsheet Claims backend services that leverage Amazon Elastic Load Balancer and Elastic Container Service were impacted despite being available across 6 different availability zones (discrete data centers). Amazon Elastic Load Balancer and Elastic Container Service were impacted due to their dependency on CloudWatch for logging.

As we did not have direct visibility into the AWS issue, correspondence with AWS confirmed with their back-end tooling and logging that the two Snapsheet Claims backend services were impacted by the AWS outage.

Preventative Measures

All Snapsheet platform services are available across multiple availability zones within each region and are configured to automatically failover when a disruption occurs. Unfortunately, in this case, the AWS internal dependency on Amazon CloudWatch caused failures across all availability zones simultaneously.

Snapsheet also has the ability to restore services across AWS regions. Due to the nature of this incident occurring on internal AWS service dependencies, Snapsheet did not have visibility into where the issue was coming from until Amazon provided additional information which made it difficult to determine if we should start rotating certain services to a different region.

Snapsheet will be working with AWS as additional details of their root cause analysis become available and will be investigating multiple options for preventing and mitigating similar issues in the future.

Posted 8 months ago. Jul 31, 2024 - 14:02 CDT

Resolved

This incident has been resolved. We will continue to monitor performance and provide a post-mortem as soon as possible.
Posted 8 months ago. Jul 30, 2024 - 21:34 CDT

Monitoring

We are seeing improved response times and normal performance within Snapsheet Claims over the past half hour. We are still working with AWS until they confirm that the issue is resolved and to produce a full post-mortem for the issue.
Posted 8 months ago. Jul 30, 2024 - 21:00 CDT

Identified

We are remaining engaged with the AWS team as they continue to address the widespread AWS outage. They have identified the root cause and are actively working on multiple parallel paths to mitigate the issue. We will provide updates as they become available.
Posted 8 months ago. Jul 30, 2024 - 20:00 CDT

Update

We are continuing to investigate, but the issue appears to be caused by the AWS outage. We are working with the AWS team to get more information and will provide updates as they become available.
Posted 8 months ago. Jul 30, 2024 - 18:56 CDT

Update

We have identified that one of our underlying vendors, AWS, is reporting issues across multiple services: https://health.aws.amazon.com/health/status. We are continuing to investigate and will continue to provide updates as they become available.
Posted 8 months ago. Jul 30, 2024 - 17:57 CDT

Investigating

We are currently experiencing issues with accessing Snapsheet Claims. We are investigating urgently and will provide an update as soon as possible.
Posted 8 months ago. Jul 30, 2024 - 17:35 CDT
This incident affected: Snapsheet Payments (US, EU), Snapsheet Claims (US, EU), and Snapsheet Appraisal Services (US, EU).