Date: May 14, 2024
Status: Resolved
Summary On May 14, 2024, our service experienced a downtime event due to multiple frontend versions being served concurrently. This incident was caused by a discrepancy in our build process that led to different JavaScript chunks being deployed despite using the same Git commit. Immediate actions were taken to redeploy the code and resolve the issue, and measures have been implemented to prevent recurrence.
Impact The downtime affected all users attempting to access our site, resulting in an inability to load the app. This incident primarily impacted customers on the main version of the app, while those on Canary were unaffected.
Root Causes The incident was triggered by a container restart that resulted in different JavaScript chunks being served. Our build process, though using the same Git commit, produced non-idempotent results, causing one of our webApp containers to serve incorrect chunks.
Resolution and Improvements Immediate actions included redeploying the rollback branch, ensuring all containers served the correct chunks, and implementing a fix to skip build/push for the container image if the SHA tag already exists. Additionally, the following improvements are planned:
Description of Events