SLA, SLO, and SLI

What SLAs, SLOs, and SLIs are, how they relate to each other, and how to define them for AI services.

Added 28 Mar 2026 3 min read Updated 30 May 2026

Learn this your way

SLA, SLO, and SLI form a hierarchy of reliability concepts. SLIs measure service behavior, SLOs set internal reliability targets, and SLAs define contractual commitments to customers. Together, they provide a structured approach to defining, measuring, and committing to service reliability.

Definitions

SLI (Service Level Indicator) is a quantitative measure of service behavior. Examples: request latency (p99 under 500ms), availability (percentage of successful requests), error rate (percentage of requests returning errors), throughput (requests per second). SLIs are the raw measurements.

SLO (Service Level Objective) is an internal target for an SLI. “99.9% of requests will return a successful response within 2 seconds over a 30-day window.” SLOs define what “reliable enough” means for your team. They are more aggressive than SLAs and drive engineering priorities through error budgets.

SLA (Service Level Agreement) is a contractual commitment to customers with defined consequences (credits, refunds) for violations. “99.5% monthly uptime; if violated, customer receives 10% service credit.” SLAs are less aggressive than SLOs because violating an SLA has business consequences.

The Hierarchy

SLOs should be stricter than SLAs. If your SLA promises 99.5% availability, your SLO might target 99.9%. The gap between SLO and SLA is your safety margin. If you breach the SLO, you have time to fix the issue before it becomes an SLA violation.

Defining SLIs for AI Services

AI services need SLIs that capture what users experience:

Availability - percentage of inference requests that return a response (not a timeout or 5xx error)
Latency - inference response time at p50, p95, p99
Quality - percentage of responses with confidence above a threshold, or human-rated quality scores
Freshness - time since the knowledge base or model was last updated (relevant for RAG systems)

Practical Guidance

Start with 2-3 SLIs that directly reflect user experience. Avoid too many SLIs - they become impossible to reason about. Set SLOs based on user needs, not technical capabilities (users do not care about CPU utilization). Measure SLIs continuously and automatically. Review SLOs quarterly and adjust based on operational experience. Only commit to SLAs when you have a track record of meeting the corresponding SLOs with margin to spare.

Sources

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter 4: Service Level Objectives. (Primary reference defining SLI, SLO, and SLA and their relationships; the foundation of reliability engineering practice.)
Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (Eds.). (2018). The Site Reliability Workbook. O’Reilly Media. Chapter 2: Implementing SLOs. (Practical guidance on choosing SLIs, setting SLO targets, and creating measurement infrastructure.)

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session