Date: November 12, 2024
Status: Resolved
On November 12, 2024, the application experienced a rolling service disruption that intermittently impacted our customers when attempting to create new items in our application, such as tasks, assets, and parts. The incident was caused by a database migration deployed into production, which overloaded our databases and caused delays in propagating requests. Immediate action was taken to terminate the migration and restore service to customers. to prevent a recurrence, new deploy procedures and monitors are being implemented.
For a period of approximately 3 hours, customers intermittently encountered delays or failures when attempting to create items, such as tasks, assets, and parts in our application. In some cases, the action appeared to fail, but the items were successfully created, appearing in the application after some delay.
The incident was caused by a database migration which overloaded our production database. Overloading of the database directly led to a increase in ‘replication lag’. When this metric exceeded 1 second, our applications’ workflows began failing or timing out.
Once discovered the offending database migration was immediately terminated, restoring service to all customers. Next, that migration was corrected by our Engineers, thoroughly tested using improved protocols, and re-executed without a recurrence of service disruption. Additionally, the following improvements will be implemented: