Observability is the ability to understand the internal state of a system from its external outputs. For traditional software, three categories of output provide this understanding: logs (discrete events), metrics (numeric measurements over time), and traces (the path a request takes through a distributed system). AI systems generate all three but require additional instrumentation to capture the information that matters: token usage, response quality, cost per request, and model version attribution.

Why AI Systems Need Specialized Observability

A standard web application becomes unresponsive or throws HTTP 500 errors when something goes wrong. AI systems can fail silently. A model can return a response with a 200 status code while:

  • Hallucinating incorrect information
  • Exceeding the token budget by 10x due to a prompt injection
  • Returning responses that violate your content policy
  • Degrading in quality due to model drift over time

Standard application monitoring catches the first category (technical failures). Specialized AI observability catches the others.

Pillar 1: Metrics

Metrics are numeric measurements aggregated over time. For AI systems, three categories matter.

Infrastructure metrics (standard CloudWatch):

  • Lambda invocation count, error rate, duration p50/p95/p99
  • API Gateway 4xx and 5xx error rates, latency
  • SQS queue depth (for async AI pipelines)

AI-specific metrics (custom CloudWatch metrics):

import boto3

cloudwatch = boto3.client('cloudwatch')

def publish_ai_metrics(model_id, input_tokens, output_tokens,
                        latency_ms, request_id):
    cloudwatch.put_metric_data(
        Namespace='AI/LLMMetrics',
        MetricData=[
            {
                'MetricName': 'InputTokens',
                'Value': input_tokens,
                'Unit': 'Count',
                'Dimensions': [
                    {'Name': 'ModelId', 'Value': model_id},
                    {'Name': 'Environment', 'Value': ENVIRONMENT}
                ]
            },
            {
                'MetricName': 'OutputTokens',
                'Value': output_tokens,
                'Unit': 'Count',
                'Dimensions': [
                    {'Name': 'ModelId', 'Value': model_id},
                    {'Name': 'Environment', 'Value': ENVIRONMENT}
                ]
            },
            {
                'MetricName': 'InferenceLatencyMs',
                'Value': latency_ms,
                'Unit': 'Milliseconds',
                'Dimensions': [
                    {'Name': 'ModelId', 'Value': model_id}
                ]
            }
        ]
    )

Cost tracking metrics: Calculate cost per request using the token counts and published pricing:

# Anthropic Claude Sonnet example pricing (verify current pricing)
INPUT_TOKEN_COST = 0.003 / 1000   # per token
OUTPUT_TOKEN_COST = 0.015 / 1000  # per token

cost = (input_tokens * INPUT_TOKEN_COST) + (output_tokens * OUTPUT_TOKEN_COST)

Publish cost as a CloudWatch metric and set a budget alarm to alert when daily spend exceeds a threshold.

Pillar 2: Logs

Logs capture discrete events with context. For AI systems, every inference call should produce a structured log entry.

Enable Bedrock model invocation logging (this is off by default):

bedrock_client = boto3.client('bedrock')
bedrock_client.put_model_invocation_logging_configuration(
    loggingConfig={
        'cloudWatchConfig': {
            'logGroupName': '/aws/bedrock/model-invocations',
            'roleArn': BEDROCK_LOGGING_ROLE_ARN,
            'largeDataDeliveryS3Config': {
                'bucketName': LOG_BUCKET,
                'keyPrefix': 'bedrock-logs/'
            }
        },
        'textDataDeliveryEnabled': True,
        'imageDataDeliveryEnabled': False
    }
)

Structured application log entry:

import json, logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_inference(request_id, user_id, model_id, prompt_template_version,
                   input_tokens, output_tokens, latency_ms, guardrail_triggered):
    logger.info(json.dumps({
        "event": "inference_complete",
        "request_id": request_id,
        "user_id": user_id,           # for debugging, not prompt content
        "model_id": model_id,
        "prompt_template_version": prompt_template_version,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "latency_ms": latency_ms,
        "guardrail_triggered": guardrail_triggered,
        "environment": ENVIRONMENT
    }))

Use CloudWatch Logs Insights to query these structured logs for operational analysis.

Pillar 3: Traces

Distributed traces show the full path of a request through multiple services. For multi-agent AI systems, a trace might span: API Gateway -> Lambda orchestrator -> Bedrock agent -> Knowledge base -> Lambda tool -> DynamoDB.

OpenTelemetry for distributed tracing:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

tracer = trace.get_tracer(__name__)

def handle_user_request(user_query):
    with tracer.start_as_current_span("handle_user_request") as span:
        span.set_attribute("user_query_length", len(user_query))

        with tracer.start_as_current_span("bedrock_invoke") as child_span:
            response = invoke_bedrock(user_query)
            child_span.set_attribute("model_id", MODEL_ID)
            child_span.set_attribute("output_tokens", response['usage']['outputTokens'])

        return process_response(response)

AWS X-Ray integrates with Lambda and API Gateway for automatic trace capture, with OpenTelemetry SDK support for custom instrumentation.

LLM-Specific Tracing with Langfuse

Langfuse is an open-source LLM observability platform that captures prompt/response pairs, token counts, cost, and user feedback in a searchable interface. It is the recommended tool for production LLM quality monitoring.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse()

@observe()
def generate_response(user_query: str) -> str:
    langfuse_context.update_current_observation(
        input=user_query,
        metadata={"prompt_template_version": PROMPT_VERSION}
    )

    response = call_bedrock(user_query)

    langfuse_context.update_current_observation(
        output=response['text'],
        usage={
            "input": response['usage']['inputTokens'],
            "output": response['usage']['outputTokens']
        }
    )

    return response['text']

Langfuse provides a dashboard showing: average latency per model, token usage trends, cost per day, user sessions, and the ability to search and replay specific conversations for debugging.

Dashboard Design

A well-designed AI observability dashboard has three sections.

Operational health (visible at all times):

  • Current error rate vs 7-day baseline
  • p95 latency vs SLA threshold
  • Requests per minute with trend

Cost and usage:

  • Daily token spend with 30-day trend
  • Cost per request by model
  • Projected monthly spend vs budget

Quality indicators:

  • Guardrail block rate
  • Response length distribution
  • User feedback score (if collected)

Sources and Further Reading