Traditional application monitoring tracks uptime, latency, and error rates. AI systems need all of that plus observability into prediction quality, model behavior, and data characteristics. An AI system can be up, fast, and returning 200 status codes while producing completely wrong answers. Full-stack AI observability closes this gap.

The Observability Stack

Infrastructure metrics. GPU utilization, memory usage, request queue depth, and instance health. These are table stakes. Use your existing infrastructure monitoring tools (CloudWatch, Datadog, Prometheus).

Application metrics. Request latency (p50, p95, p99), throughput, error rates, and availability. Standard application monitoring applies to AI systems just as it does to any service.

Model metrics. Prediction distributions, confidence scores, feature values, token usage, and model version in use. These are specific to AI and require custom instrumentation.

Quality metrics. Accuracy, relevance, faithfulness, and task-specific evaluation scores. These require either ground truth labels (often delayed) or automated evaluation (LLM-as-judge, heuristic checks).

Tracing for LLM Applications

LLM applications involve multiple steps: retrieval, prompt construction, model inference, output parsing, and tool use. A single user request might trigger several LLM calls, database queries, and API calls. Without tracing, debugging a bad response is nearly impossible.

Implement distributed tracing that captures:

  • The user’s input and the final output
  • Each LLM call with its full prompt, completion, model used, token counts, and latency
  • Retrieval queries and results (for RAG systems)
  • Tool calls and their results (for agent systems)
  • Any post-processing or filtering applied to the output

Tools like LangSmith, Arize Phoenix, and OpenTelemetry-based solutions provide tracing frameworks designed for LLM applications. The key is capturing enough context to diagnose issues without the storage costs of logging every token for every request.

Evaluation in Production

Production evaluation differs from offline evaluation because you rarely have ground truth labels in real time. Strategies include:

User feedback. Thumbs up/down, ratings, corrections, and regeneration requests. This is noisy and biased (unhappy users are more likely to give feedback) but provides real signal about user-perceived quality.

Automated evaluation. Use a separate LLM to judge response quality on dimensions like relevance, faithfulness to sources, and helpfulness. This is imperfect but scalable and provides immediate signal.

Sampled human evaluation. Route a sample of production requests to human reviewers. This provides high-quality labels for calibrating automated metrics and catching issues that automated evaluation misses.

Functional tests. Run a suite of known queries through the system regularly and verify outputs meet expectations. This catches regressions introduced by model updates, data changes, or code deployments.

Alerting Strategy

Latency alerts. Alert on p95 and p99 latency breaches. LLM latency is highly variable, so set thresholds based on observed distributions rather than arbitrary targets.

Quality alerts. Alert when automated evaluation scores drop below thresholds or when user feedback sentiment shifts negative. Use sliding windows to smooth noise.

Cost alerts. Alert on unexpected cost increases that may indicate prompt bloat, routing failures, or traffic spikes.

Drift alerts. Alert when input distributions or prediction distributions shift beyond expected bounds. Integrate with your drift detection system.

Dashboards

Build dashboards at multiple levels of granularity:

  • Executive dashboard showing request volume, overall quality, cost, and incident count
  • Operational dashboard showing latency, errors, resource utilization, and active alerts
  • Debug dashboard showing individual traces, prompt/response pairs, and retrieval results for investigating specific issues

Retention and Sampling

Storing every prompt and response is expensive. Implement tiered retention: store metadata for all requests, full traces for a sample, and full traces for all errors and low-quality responses. Adjust sampling rates based on traffic volume and your debugging needs.