The Cloud Architecture Behind Every AI App

A plain-English tour of the production cloud stack behind a real AI application, from the app you touch down to inference, retrieval, and cost control.

Added 25 Jun 2026 7 min read Updated 25 Jun 2026

#architecture #cloud #inference #rag #finops

Learn this your way

Read Guided course

A dark data center corridor lined with red-lit server racks fading into the distance. — The chat box you type into is the front door. Most of an AI app lives in racks like these, far from view.

When you ask an AI app a question, the reply feels like it comes from one place. It does not. Behind the chat box sits a stack of cloud services, each with a narrow job, wired together so a single request can travel through all of them in a second or two. This guide walks through that stack layer by layer, in plain language, so you can picture what runs underneath the apps you already use.

The seven layers at a glance

A production AI app is built from distinct layers. Each layer can be swapped, scaled, or priced on its own. The diagram below shows the path a single request takes from top to bottom.

Client / App

Web app Mobile app What the user sees and types into

API Gateway

Auth Rate limits Routing Front door that checks who you are and where to send the request

Orchestration / Agent

Workflow logic Tool calls Memory Decides the steps and coordinates everything below

Retrieval (RAG)

Vector database Embedding model Finds the company facts the model was never trained on

Model / Inference

Hosted model API Self-hosted GPU The part that actually writes the answer

Observability / Evaluation

Traces Logs Quality scores Records what happened and whether the answer was good

Cost / FinOps

Token tracking Caching Model tiering Keeps the bill in check as usage grows

Layer 1: The client and app you touch

This is the only layer most people ever see. It is the web page, the mobile app, or the chat panel inside a bigger product. Its job is narrow. It collects what you type, shows the reply, and handles the screen. It holds no model and stores little. Everything that makes the answer happens elsewhere, which is why the same AI app works on a phone, a laptop, and a browser at once.

Layer 2: The API gateway, the front door

Every request from the app first hits an API gateway. Think of it as the reception desk of a building. It checks your identity, confirms you are allowed in, and counts how often you knock so no single user can overwhelm the system. The gateway also decides where to send each request next. On the major clouds this maps to Amazon API Gateway, Azure API Management, and Google Cloud API Gateway. Many AI teams add a dedicated LLM gateway here too, a control point that routes traffic across several model providers and tracks spend in one place.

Layer 3: Orchestration, the part that plans the work

A single question often needs more than one step. The orchestration layer is the planner. It might decide to search a database, call an external tool, run the model twice, then combine the results. When the app can choose its own next step rather than follow a fixed script, this layer is often called the agent layer. Industry write-ups describe AI orchestration as coordinating models, tools, memory, and human checkpoints into one reliable flow that no single model call could handle alone. Common building blocks here include LangChain, LlamaIndex, and Haystack. See AI agents and agentic loops for how this layer behaves when it runs on its own.

Step 1Question arrivesGateway authenticates and forwards the request.

→

Step 2Plan the stepsOrchestrator decides whether to search, call a tool, or answer directly.

→

Step 3Retrieve factsVector database returns relevant company documents.

→

Step 4Generate answerThe model reads the question plus the retrieved facts and writes a reply.

→

Step 5Return and logAnswer goes back to the app and the trace is recorded.

Layer 4: Retrieval, how the app knows your data

A base model only knows what it learned during training. It has never seen your company handbook or last week’s tickets. Retrieval fixes that. The technique is called retrieval-augmented generation , or RAG. Your documents are turned into numbers that capture meaning, then stored in a vector database . When a question comes in, the system converts it the same way and pulls back the closest matching passages, then hands them to the model as context. Reports put retrieval at roughly 50 to 300 milliseconds, depending on the database and the embedding model. Common vector stores include Pinecone, Weaviate, and pgvector on Postgres.

Layer 5: Inference, the part that writes the answer

This is the engine. Inference is the act of running a trained model to produce output. Teams reach it in one of two ways. The most common route is a hosted model API, where a provider runs the hardware and you pay per token of text in and out. The other route is self-hosting, where you rent GPUs and run the model yourself, often with a serving engine such as vLLM. On the major clouds the managed route maps to Amazon Bedrock, Azure AI Foundry, and Google Vertex AI. All three keep pace on the popular models, including Claude, Llama, and Mistral; Azure AI Foundry has the deepest GPT-family integration through the OpenAI partnership.

	Hosted model API	Self-hosted GPUs
You manage	API keys and prompts	Servers, scaling, updates
Pricing unit	Per token	Per GPU-hour
Best for	Most teams, fast start	High volume, strict data control
Cloud examples	Bedrock, Foundry, Vertex AI	EC2, Azure VMs, GCE plus vLLM

Layer 6: Observability and evaluation

Once an app is live, the team needs to see what happened inside each request and whether the answer was any good. Observability for AI has three parts. Infrastructure metrics track machines and latency. Telemetry traces the path of each request through every layer. Quality evaluation scores the output itself, catching wrong or low-quality answers that no server metric would reveal. Tools in this space include Langfuse and the cloud providers’ own tracing services. This layer is what turns a black box into something a team can debug and improve.

Layer 7: Cost and FinOps

AI apps bill in a unit most cloud teams had never managed before: the token, a fragment of text the model reads or writes. The FinOps Foundation frames the cost of any cloud service as price times quantity, and notes that cost-per-token is now the dominant metric for generative AI services. The share of FinOps teams managing AI spend has risen sharply, from under a third two years ago to nearly all of them today, according to the foundation’s community reports. Three habits keep the bill down:

Model tiering: send easy requests to a small cheap model and only escalate hard ones to a large model.
Semantic caching: reuse a stored answer when a near-identical question has already been asked. Reports cite reductions of 30 to 60 percent in model calls.
Context management: trim long prompts so the model is not charged for text it does not need.

The FinOps Foundation suggests growing this discipline in three stages, crawl, walk, then run, starting with simple visibility and ending with cost tied to return on investment.

Vendor-neutral summary

Most AI apps follow the same shape regardless of cloud. The names change, the layers do not.

Layer	AWS	Azure	Google Cloud
API gateway	API Gateway	API Management	API Gateway
Model / inference	Bedrock	AI Foundry	Vertex AI
Vector store	OpenSearch, pgvector	AI Search	Vertex AI, pgvector
Observability	CloudWatch	Monitor	Cloud Operations

Sources

FinOps Foundation : FinOps for AI overview, cost-per-token as the dominant unit, and the crawl-walk-run model.
Usage.ai : FinOps X 2026 takeaways, including the rise in teams managing AI spend.
Finout : token economics, semantic caching, and model tiering as cost levers.
GroovyWeb : definition of AI orchestration and the production stack layers.
Spheron : the layered view of AI infrastructure, from compute to governance.
Atlan : retrieval latency ranges and enterprise RAG platform components.
Bits Lovers : comparison of Amazon Bedrock, Azure AI Foundry, and Google Vertex AI.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session