Summary
First, we want to thank everyone for your patience as we worked through this incident. After restoring platform availability, we noticed that some services (e.g. uploads) did not immediately recover. This postmortem explains what happened, what we found, and what we’re doing to prevent similar issues in the future.
What Happened
The platform experienced a temporary availability outage that affected several services. While core systems came back online as expected, a subset of services continued to experience availability issues after the upstream outage.
Root Cause
The affected services were still using stale DNS cache entries from during the outage, preventing them from reaching other internal components. While DNS caching is technically correct, our monitoring should have detected the failures, and prompted corrective action. However, our existing checks were only performing shallow availability tests and did not verify internal reachability, leading to a false positive.
Resolution
Once identified, clearing the DNS caches restored full connectivity and service availability. To prevent recurrence, we’ve extended our monitoring to include internal reachability probes, ensuring that similar issues are surfaced automatically in the future.
Closing
We understand how disruptive downtime can be, and we’re committed to learning from each incident to make our platform more resilient. Thank you again for your patience and continued trust.