Intermittent issues with new PMs and Cycle Counts launching at scheduled times

Incident Report for Limble

Postmortem

Date: March 12, 2025

Status: Resolved

Impacted Region(s) or Services: Limble CMMS Web Application

Note: All times are in Mountain Daylight Time (MDT)

Summary

On March 12 at approximately 5 AM we identified that our backend service responsible for generating scheduled tasks had failed to initialize. Further investigation found that the scheduler had been offline since the previous evening at 5:30 PM due to a container image unexpectedly expiring. After initiating our incident response plan, engineers discovered the cause to be an expired container image. Manual steps were taken to update the container image and restart the scheduled task. Further steps were taken to implement detailed monitoring to alert if another imagine expires in the future.

Impact

Scheduled PMs and Cycle Counts experienced a temporary delay of approximately 14 hours, from 5:30 PM on March 11th to 9:00 AM on March 12th, the service was successfully restored, and all scheduled tasks resumed normal operation.

Root Cause

The scheduler container image failed to propagate during a deployment. Consequently, the scheduler attempted to run with an expired image, which resulted in a service startup failure.

Resolution and Improvements

Engineers were quickly able to identify the root cause and its relation to the missing container image. A rollback of the deployment was initiated, which resolved the issue.

To prevent future occurrences, the following improvements have been implemented:

  • Enhanced Monitoring: We have deployed detailed monitoring systems specifically designed to detect and alert us to any future container image expirations, ensuring proactive intervention.

Timeline of Events

  • 3/12/2025 at 4:59 AM: Customer reports that scheduled items did not send as expected
  • 7:30 AM: Incident is declared and engineering team is alerted
  • 8:15 AM: Container image was updated
  • 8:45 AM: Process was manually re-run
  • 9:15 AM: All processes re-ran successfully

Key Points

  • No loss of customer data
Posted Mar 21, 2025 - 10:46 MDT

Resolved

The incident has been resolved. A post-mortem will follow.
Posted Mar 12, 2025 - 10:00 MDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 12, 2025 - 09:30 MDT

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 12, 2025 - 09:15 MDT

Investigating

Some customers are experiencing issues with new PMs and Cycle Counts launching at their scheduled times.
Posted Mar 12, 2025 - 07:30 MDT
This incident affected: Limble CMMS Web Application and Limble CMMS API.