- Print
- PDF
Meaning
This alert is triggered when the number of error-level logs generated by a service within Knovvu Analytics exceeds a high threshold over a short period. It indicates that a component is likely experiencing a failure or repeated operational issue. The alert is marked as critical if the number of error logs exceeds 250 in a 5-minute window.
Full context
Knovvu Analytics monitors the volume of error-level log entries for each service. A sudden spike in errors is a strong signal of service degradation, misconfiguration, failing dependencies, or unhandled exceptions. This alert serves as an early warning to investigate and remediate before further impact propagates through the system.
The alert is service-specific and triggers when a particular component logs more than 250 error-level events within 5 minutes.
Impact
If a service is generating excessive error logs:
- It may be experiencing runtime issues such as unhandled exceptions, resource exhaustion, or misconfigured inputs.
- Upstream or downstream processing could be affected due to cascading failures.
- User-facing features may become unreliable or unresponsive.
- The logs may fill storage rapidly, potentially impacting observability or system performance.
Diagnosis
- Identify the affected service from the alert label.
- Inspect the logs for the specific service and look for repeating stack traces, exceptions, or error messages.
- Correlate the error spike with recent activity (e.g., deployment, configuration change, traffic surge).
- Use service-level dashboards to check health metrics such as CPU, memory, thread pool exhaustion, or request latency.
- Review dependency health if errors indicate failure in communication with external services (e.g., database, message queue, downstream APIs).
Mitigation
- Roll back recent changes if the issue coincides with a deployment or configuration update.
- Restart the affected service if it appears to be in a faulted or degraded state.
- Apply hotfixes or patch known bugs causing recurring errors.
- Tune error handling to reduce the impact of recoverable issues and prevent cascading failures.
- If the error volume is expected (e.g., due to invalid input spike), document and silence the alert temporarily with clear justification.
- Escalate to engineering if the root cause cannot be resolved quickly or if user impact is significant.