Partial service failure due to non-volatile Redis saturation

Incident Report for Gemfury

Resolved

Earlier today, we deployed a bug that exposed a legacy system to excessive traffic, overwhelming our core Redis instance and causing cascading failures in certain functionality. In the past, we occasionally used a legacy internal system to capture debug information for rare data states, aiding in tracking and reproducing customer issues. This information was stored in our core non-volatile Redis instance. While this approach had worked for rare conditions and low-traffic code paths, this incident occurred due to mistakenly adding such tracking to heavily-trafficked functionality.

May 29th, 12:43 UTC - Deployment of a release with the tracking bug

The release containing the tracking bug passed preflight and was deployed to production. Initially, everything appeared stable. However, the utilization of our non-volatile Redis, which usually hovers below 5%, slowly started to increase. Unfortunately, this increase went unnoticed.

May 29th, 15:55 UTC - Redis storage reached maximum utilization

When the non-volatile Redis storage reached 100% utilization, write operations began receiving "OOM command not allowed" error responses, resulting in 500 errors for certain user-facing APIs. Most read operations were successful, but not all. Regrettably, the tracking code was present in the API layer servicing the Dashboard and the CLI, causing errors for those read operations. Worse, the error rate remained low enough to not trigger any alarms.

May 29th, 23:04 UTC - Redis instance cleaned to restore service

Customer service noticed an increase in error reports and promptly notified engineering to investigate. Engineering quickly identified the issue and cleared excess data from Redis, restoring service.

May 29th, 23:40 UTC - Fix deployed to remove the tracking bug

We deployed a fix to remove the tracking bug.

Further steps

Later in the day, we removed the legacy tracking system and migrated that functionality to use standard error and metrics tracking. This step will prevent similar space issues in the future and consolidate our monitoring infrastructure. Moving forward, we will implement more alarms for storage utilization and introduce more fine-grained tracking of error rates.

Posted Jun 29, 2023 - 09:00 PDT