Issues with creating Tasks, Parts, Assets
Incident Report for Limble
Postmortem

Date: November 12, 2024

Status: Resolved

Summary

On November 12, 2024, the application experienced a rolling service disruption that intermittently impacted our customers when attempting to create new items in our application, such as tasks, assets, and parts. The incident was caused by a database migration deployed into production, which overloaded our databases and caused delays in propagating requests. Immediate action was taken to terminate the migration and restore service to customers. to prevent a recurrence, new deploy procedures and monitors are being implemented.

Impact

For a period of approximately 3 hours, customers intermittently encountered delays or failures when attempting to create items, such as tasks, assets, and parts in our application. In some cases, the action appeared to fail, but the items were successfully created, appearing in the application after some delay.

Root Cause

The incident was caused by a database migration which overloaded our production database. Overloading of the database directly led to a increase in ‘replication lag’. When this metric exceeded 1 second, our applications’ workflows began failing or timing out.

Resolution and Improvements

Once discovered the offending database migration was immediately terminated, restoring service to all customers. Next, that migration was corrected by our Engineers, thoroughly tested using improved protocols, and re-executed without a recurrence of service disruption. Additionally, the following improvements will be implemented:

  • Monitoring and Alerting Improvements
  • Stricter requirements in testing of all migrations using a near production sandbox database
  • High-risk migrations will be executed during planned maintenance windows

 Timeline of Events

  • 11:29 AM MST: Database migration initiated.
  • 12:03 PM MST: Customers begin reporting disruptions.
  • 12:30 PM MST: Investigation and communication initiated.
  • 12:57 PM MST: Incident is declared.
  • 2:15 PM MST: Root cause identified and solution identified.
  • 2:22 PM MST: Solution implemented and verified in production.
  • 2:37 PM MST: Incident resolved following further monitoring.

 Key Points

  • No loss of our customers' historical data.
  • Not all customers were impacted at the same time. This was a rolling disruption.
Posted Nov 14, 2024 - 13:15 MST

Resolved
This incident is now resolved.
Posted Nov 12, 2024 - 14:59 MST
Monitoring
A fix has been implemented and we are monitoring results
Posted Nov 12, 2024 - 14:31 MST
Update
We have identified the issue and have taken steps toward a fix.
Posted Nov 12, 2024 - 14:10 MST
Update
We are continuing to investigate this issue.
Posted Nov 12, 2024 - 13:13 MST
Investigating
We are currently investigating this issue.
Posted Nov 12, 2024 - 13:11 MST
This incident affected: Limble CMMS Web Application and Limble CMMS API.