Traditional monitoring alerts trigger on fixed thresholds: CPU above 90%, error rate above 1%, latency above 500ms. These thresholds generate alert fatigue during busy periods and miss slow degradation that stays just under the threshold.

The Problem

Static thresholds do not account for normal variation. A 2% error rate might be normal during a deployment window but alarming at 3am on a Saturday. Alert fatigue from false positives causes teams to ignore alerts, increasing the risk of missing genuine incidents.

The AI Approach

An LLM can analyze metric patterns, understand what is normal for a given time of day and context, and generate alerts only when behavior genuinely deviates from expected patterns. When it does alert, it provides context about what changed and potential causes.

Three-Step Build

Step 1 - Metric baseline. Collect 4-8 weeks of historical metrics to establish baseline patterns. Include time-of-day, day-of-week, and known event patterns (deployments, traffic spikes).

Step 2 - Anomaly detection. Compare current metrics against the learned baseline. Flag deviations that exceed normal variance for the current context. Use the LLM to assess whether correlated changes across multiple metrics suggest a common cause.

Step 3 - Contextual alerting. When an anomaly is detected, generate an alert that includes: what metric deviated, how it compares to normal for this time period, what else changed at the same time, and a suggested investigation path.

Where It Breaks

Baselines shift over time as your system evolves, requiring regular retraining. Novel situations (new feature launch, migration) have no baseline. The LLM may suggest incorrect root causes, sending investigators down the wrong path.

The Production Path

Run anomaly detection alongside existing threshold alerts for a month to compare signal quality. Gradually replace threshold alerts for metrics where the anomaly detector proves more accurate. Keep threshold alerts as a safety net for critical metrics.