The Well-Architected Framework - Why Every Cloud Provider Has One
What the Well-Architected Framework is, its origins at AWS, how Azure and GCP adopted it, its six pillars, and why it matters especially for AI workloads.
Every major cloud provider now publishes a Well-Architected Framework. AWS, Azure, and Google Cloud have each built their own version, and while the names and pillar counts differ slightly, the underlying logic is identical: cloud workloads fail in predictable ways, and a structured set of best practices can prevent most of those failures. This document explains what the framework is, where it came from, and why it matters especially for AI workloads.
What the Well-Architected Framework Is
The Well-Architected Framework is a structured set of best practices for designing and evaluating cloud workloads. It covers reliability, security, cost management, operational effectiveness, performance, and - in more recent versions - sustainability. It is not a compliance checklist or a certification program. It is a lens for asking the right questions about how a system is built and what risks it carries.
The framework is typically applied through a Well-Architected Review: a structured conversation between a solutions architect and the team responsible for a workload. The review surfaces risks against each pillar, produces a prioritized list of issues, and results in an improvement plan. AWS provides the Well-Architected Tool in the console to run these reviews and track progress over time.
Origin: AWS in 2015
AWS published the first Well-Architected Framework in 2015. It was created by the AWS Solutions Architecture team, drawing on patterns observed across thousands of customer architecture reviews. The team noticed that the same categories of problems appeared repeatedly: workloads that failed under load because reliability was not designed in, systems that were compromised because security was bolted on after the fact, and projects that ran out of budget because cost was never modeled.
The original framework had four pillars: Security, Reliability, Performance Efficiency, and Cost Optimization. Operational Excellence was added in 2016. Sustainability was added in 2021, bringing the total to six. The framework has been updated continuously as cloud patterns evolved and as AI workloads introduced new categories of risk.
Source: AWS Well-Architected Framework
Why All Hyperscalers Adopted It
The pattern proved universal because cloud workloads face the same fundamental challenges regardless of provider. A poorly architected workload on Azure fails for the same structural reasons as a poorly architected workload on AWS: no redundancy across availability zones, overprivileged IAM roles, no cost alerting, no runbooks for incident response.
Azure followed with the Azure Well-Architected Framework, organized around five pillars: Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency. Google Cloud published the Google Cloud Architecture Framework with pillars covering Operational Excellence, Security/Privacy/Compliance, Reliability, Cost Optimization, and Performance Optimization.
The convergence across providers reflects a consensus in the industry: these are not vendor-specific concerns. They are properties that any well-designed distributed system should have, and a framework for evaluating them systematically reduces the chance of missing something important.
Source: Azure Well-Architected Framework | Google Cloud Architecture Framework
The Six AWS Pillars
Operational Excellence covers the ability to run and monitor systems to deliver business value, and to continually improve supporting processes and procedures. Key practices include runbooks, automation of operational tasks, observability through logs and metrics, and structured incident response.
Security covers the ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies. Key practices include least-privilege access through IAM, encryption at rest and in transit, network segmentation, and automated security detection.
Reliability covers the ability of a workload to perform its intended function correctly and consistently. Key practices include designing for fault tolerance, implementing health checks, automated recovery, and planning for disaster recovery with defined recovery time and recovery point objectives.
Performance Efficiency covers the ability to use computing resources efficiently to meet system requirements. Key practices include selecting the right compute type for each workload, reviewing performance regularly, and using managed services to reduce operational burden.
Cost Optimization covers the ability to run systems at the lowest price point while still meeting business requirements. Key practices include right-sizing instances, using reserved or spot capacity where appropriate, setting cost budgets and alerts, and attributing costs to specific teams or workloads.
Sustainability covers minimizing the environmental impact of cloud workloads. Key practices include using managed services that achieve higher utilization than self-managed alternatives, implementing data lifecycle policies to avoid storing data longer than necessary, and selecting efficient workload architectures.
Source: AWS Well-Architected Framework
The Review Process
A Well-Architected Review follows a structured pattern. The reviewing architect asks a set of questions across each pillar - typically 50 to 70 questions for a full review. Each question maps to a best practice. Gaps produce findings categorized as High Risk Issues or Medium Risk Issues. The output is a prioritized improvement plan with specific remediation steps.
AWS provides the Well-Architected Tool in the AWS console to formalize this process. It stores review results, tracks remediation progress, and provides milestone snapshots to show how a workload has improved over time.
Why It Matters for AI Workloads
AI workloads amplify traditional cloud challenges in several ways. Cost is less predictable: model inference on foundation models can scale dramatically with usage, and training runs can consume significant GPU compute with costs that are difficult to estimate in advance. Security risk is higher: training data may contain PII, models can be manipulated through prompt injection, and model outputs may inadvertently expose sensitive information from training data.
Reliability is more complex: an AI-powered feature has two failure modes - the infrastructure can fail, and the model can produce poor outputs. Operational complexity increases with ML pipelines: data pipelines, training jobs, model registries, and deployment endpoints each require monitoring and runbooks. Sustainability matters: large model training runs have significant energy footprints.
AWS published a dedicated ML Lens for the Well-Architected Framework to address these AI-specific concerns. It maps each pillar to the ML lifecycle and provides ML-specific best practices.
Source: AWS Well-Architected AI/ML Lens
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops