Date: March 12, 2025
Status: Resolved
Impacted Region(s) or Services: Limble CMMS Web Application
Note: All times are in Mountain Daylight Time (MDT)
On March 12 at approximately 5 AM we identified that our backend service responsible for generating scheduled tasks had failed to initialize. Further investigation found that the scheduler had been offline since the previous evening at 5:30 PM due to a container image unexpectedly expiring. After initiating our incident response plan, engineers discovered the cause to be an expired container image. Manual steps were taken to update the container image and restart the scheduled task. Further steps were taken to implement detailed monitoring to alert if another imagine expires in the future.
Scheduled PMs and Cycle Counts experienced a temporary delay of approximately 14 hours, from 5:30 PM on March 11th to 9:00 AM on March 12th, the service was successfully restored, and all scheduled tasks resumed normal operation.
The scheduler container image failed to propagate during a deployment. Consequently, the scheduler attempted to run with an expired image, which resulted in a service startup failure.
Engineers were quickly able to identify the root cause and its relation to the missing container image. A rollback of the deployment was initiated, which resolved the issue.
To prevent future occurrences, the following improvements have been implemented: