Incident Management for AI Systems

How to handle incidents in AI systems: on-call rotations, escalation policies, AI-specific runbooks, and post-incident reviews for model and infrastructure failures.

Added 28 Mar 2026 4 min read Updated 30 May 2026

#incident-management #on-call #sre #runbooks #observability #ai-engineering

Learn this your way

Read Guided course

AI systems fail in ways that traditional software does not. A model can produce confidently wrong answers without raising errors. Inference latency can degrade gradually as GPU memory fragments. Retrieval quality can drop silently when embedding drift goes undetected. Incident management for AI systems must handle both infrastructure failures and model quality degradation.

On-Call Structure

Who Is On-Call

AI systems span multiple domains. A single on-call rotation rarely covers all failure modes:

Infrastructure on-call - Handles compute, networking, storage, and Kubernetes issues. Responds to node failures, scaling problems, and resource exhaustion.
ML/AI on-call - Handles model quality, evaluation failures, data drift, and retrieval degradation. Requires domain knowledge to distinguish a model bug from expected behaviour on edge cases.
Data on-call - Handles pipeline failures, data quality issues, and source system outages. Responds when training data or feature store updates fail.

Small teams may combine these into a single rotation. Larger organisations split them with clear escalation paths between tiers.

Escalation Policy

Define escalation clearly before incidents occur:

L1 (5 minutes) - Alert fires, on-call engineer is paged. Acknowledge within 5 minutes.
L2 (15 minutes) - If no acknowledgment, escalate to secondary on-call.
L3 (30 minutes) - If unresolved, escalate to the team lead and relevant domain expert.
Executive (60 minutes) - For customer-impacting incidents lasting more than 30 minutes, notify engineering leadership.

Use PagerDuty, Opsgenie, or Incident.io for automated escalation.

AI-Specific Runbooks

Standard runbooks cover infrastructure failures. AI systems need additional runbooks for model and data failures.

Model Quality Degradation

Symptoms: Evaluation metrics drop, user complaints increase, downstream system accuracy decreases.

Steps:

Check if a new model version was recently deployed. Compare evaluation metrics between current and previous versions.
Check data freshness. Is the feature store receiving updates? Is the vector index current?
Check for data distribution shift. Compare recent input distributions against the training distribution.
If a recent model deployment caused the issue, rollback to the previous version.
If data drift is the cause, trigger a retraining pipeline with recent data.

Inference Latency Spike

Symptoms: P95 or P99 latency exceeds SLO, timeout rate increases.

Steps:

Check GPU utilisation and memory. Are GPUs saturated? Is memory fragmented?
Check request queue depth. Is the system overloaded? Scale up replicas if auto-scaling has not triggered.
Check for input anomalies. Are requests suddenly larger (longer prompts, bigger documents)?
Check upstream dependencies. Is the vector database, feature store, or model provider slow?
If GPU memory fragmentation is the cause, perform a rolling restart of inference pods.

Retrieval Quality Failure

Symptoms: Retrieved documents are irrelevant, RAG system returns wrong information.

Steps:

Check if the embedding model or index was recently updated.
Verify the vector database is healthy and returning results.
Run a set of known-good queries against the retrieval system and compare results to expected output.
Check if source documents were recently updated or deleted.
If the index is corrupted, trigger a reindex from the source data.

Severity Levels

Define severity relative to user impact:

Severity	Description	Response Time	Example
SEV1	Complete service outage	Immediate	Inference endpoint returns 500 for all requests
SEV2	Major degradation	15 minutes	Model quality dropped 30%, affecting most users
SEV3	Partial degradation	1 hour	Latency elevated for 10% of requests
SEV4	Minor issue	Next business day	Evaluation pipeline failed for non-critical test set

Post-Incident Review

After every SEV1 and SEV2 incident, conduct a blameless post-incident review:

Timeline - What happened, when, and what actions were taken
Root cause - Why did the failure occur? Distinguish between proximate cause and contributing factors
Detection - How was the incident detected? Could it have been detected earlier?
Resolution - What fixed the issue? How long did it take?
Action items - What changes will prevent recurrence or improve detection? Assign owners and deadlines.

Post-incident reviews are not blame exercises. They are learning opportunities. If your review identifies a person as the root cause, you have not dug deep enough. Ask why the system allowed the error to propagate.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session