- Print
- PDF
Meaning
This alert is triggered when the number of error events in the state machine processing queues within Knovvu Analytics increases. It indicates that one or more components responsible for processing conversation states are encountering failures. The alert fires if errors continue to grow for 5 consecutive minutes.
Full context
Knovvu Analytics uses a state machine architecture to orchestrate the processing of conversations through different stages (e.g., ingestion, analysis, indexing). Each stage has its own queue, and a dedicated handler processes each queue. Errors in these queues may indicate failures in handling specific steps of the conversation lifecycle.
This alert checks for any state machine queue with a growing number of errors over a short period. A consistent rise in error events likely points to a systemic issue in one of the conversation pipelines.
Impact
If errors in the state machine increase:
- Conversations may get stuck at various stages and never complete processing.
- Downstream data (e.g., search indexes, dashboards, analytics) may become incomplete or inconsistent.
- Recovery or reprocessing might be required to handle failed items.
- Operational visibility may be impaired if processing status is not up to date.
Diagnosis
- Identify which specific queue(s) are reporting errors by examining the affected state machine queue names.
- Review the logs and metrics for the
ca-state-manager
service, which manages the state transitions between processing stages. - Look for root causes in the related processing component (e.g., ingestion, analysis, indexing) tied to the failing queue.
- Inspect recent deployments, configuration changes, or infrastructure issues that may have disrupted the normal flow.
- Correlate the error spike with data patterns — e.g., certain tenants, conversation types, or time-based events.
Mitigation
- If errors are caused by malformed or unexpected input, enhance validation and error-handling logic to prevent retries or crashes.
- Restart or scale the
ca-state-manager
service if it appears stuck or overloaded. - Quarantine or discard repeatedly failing messages to unblock the queues.
- Coordinate with the engineering team to resolve underlying bugs or integration issues in downstream services.
- Monitor the queue length and error rate to confirm that the backlog is decreasing after action is taken.