How ChatGPT Actually Works Behind the Scenes

A plain-words walk through the request lifecycle of ChatGPT: tokenization, prefill and decode, the GPU inference fleet, custom chips, and streaming.

Added 25 Jun 2026 7 min read Updated 25 Jun 2026

#chatgpt #inference #llm #infrastructure #gpu

Learn this your way

Read Guided course

Split image of a dark server room on the left and a red-lit processor chip on the right, representing the data center hardware that runs a chatbot. — Behind a chat window sits a fleet of servers and accelerator chips that turn your words into a stream of predicted tokens.

When you type a message into ChatGPT and press enter, the reply that streams back is the visible end of a long chain of steps. Your text is broken into tokens, routed across the internet to a data center, processed by a large language model running on specialised chips, and sent back one piece at a time. This guide explains that chain in plain words, including the infrastructure layer that most explainers skip. If you want the product overview instead, the basics page on what ChatGPT is covers the user-facing side.

The request lifecycle at a glance

A single message passes through several distinct stages before any text appears on your screen. The front end never talks to the model directly. Instead, layers of routing, safety, and scheduling sit in between.

Step 1Edge and gatewayYour request hits a nearby network edge, then an API gateway that authenticates you and checks rate limits.

→

Step 2OrchestrationA service assembles the full prompt from the system message, prior conversation, and your input, then runs moderation checks.

→

Step 3TokenizeThe assembled text is split into tokens, the numeric units the model reads.

→

Step 4InferenceA scheduler places the request on a GPU cluster, which predicts output tokens one at a time.

→

Step 5Stream backEach token is turned back into text and pushed to your screen as it is produced.

The rest of this guide unpacks the stages that matter most: tokenization, the two phases of inference, and the hardware fleet underneath.

Step 1: your text becomes tokens

A language model does not read words. It reads tokens, which are short chunks of text mapped to numbers. A token can be a whole short word, part of a longer word, or a piece of punctuation. The sentence “Write a short poem” might split into chunks like “Write”, " a", " short", " poem".

Tokenization is the first translation step. The model only ever sees these numeric tokens, and it only ever produces tokens, which are converted back into readable text at the end. The number of tokens in your conversation matters for two reasons. It sets how much the request costs, because pricing is per token. It also counts against the context window , the fixed limit on how much text the model can consider at once. For a deeper look at how splitting works, see the glossary entry on tokenization .

Step 2: the model assembles the full prompt

The model does not just see your latest message. An orchestration service builds a single combined prompt before anything reaches the hardware. That prompt usually contains three parts.

	What it is	Where it comes from
System message	Hidden instructions that set tone and rules	Set by OpenAI for the product
Conversation history	Earlier turns in the same chat	Stored from your session
Your input	The message you just sent	Typed by you

In some modes the orchestration step adds more. With web search or file upload enabled, a retrieval step fetches outside text and inserts it into the prompt. This pattern is called retrieval-augmented generation, covered in the glossary entry on RAG . Moderation checks also run here, before and after the model, to screen for unsafe content.

Step 3: inference has two phases

Inference is the act of running a trained model to produce an output. For a definition in isolation, see the glossary entry on inference . Inside the GPU, the work splits into two phases that behave very differently.

The prefill phase reads the entire prompt at once. Because every token of the prompt can be processed in parallel, this phase keeps the chip busy and is compute-heavy. During prefill the model builds an internal table called the KV cache (key-value cache), which stores intermediate attention values for every token it has seen.

The decode phase produces the answer one token at a time. The model predicts the next token, appends it, and repeats. Each new token reuses the KV cache rather than recomputing everything, which is why the cache matters so much. Decode is memory-heavy rather than compute-heavy, because each step has to read the model weights and the growing cache from memory to produce a single token.

Prefill

Reads full promptParallelCompute-heavyBuilds the KV cache for every prompt token

Decode

One token at a timeSequentialMemory-heavyReuses the cache so it never recomputes past tokens

This two-phase split explains a behaviour you can observe. There is a short pause after you press enter while prefill runs, then text streams out steadily as decode produces tokens. The first-token delay and the steady stream are two separate stages of the same process. The split is described in detail in the survey work on LLM inference serving .

A single user does not get a dedicated chip. That would waste most of the hardware, because decode reads memory for one token at a time and leaves compute capacity idle. Serving systems solve this with continuous batching, where many users’ requests are processed together on the same GPU and the batch changes shape every step.

When one user’s answer finishes, that slot is freed and a new request is admitted. New prompts run their prefill phase while other requests are still decoding. Interleaving the two phases keeps expensive accelerators near full utilisation. This is the main reason a service can answer millions of people at once without a chip per person. The technique is explained in the practical guide to serving LLMs with vLLM .

Step 5: the infrastructure layer underneath

The model runs on a layered stack of infrastructure. Each layer has a narrow job.

Edge and routing

CDN edgeAPI gatewayLoad balancerAuthenticates, rate-limits, routes to the nearest region

Orchestration

Prompt assemblyModerationRetrieval and tools

Scheduling

Request queueContinuous batchingPacks many users onto each accelerator

Inference fleet

GPU clustersKV cache memoryCustom chips

The bottom layer is where the cost lives. Most large model serving today runs on NVIDIA GPUs such as the H200, which carries 141 GB of high-bandwidth memory and lets a single chip serve models that previously needed two. Operators report cloud rental rates in a broad range of roughly 2 to 6 US dollars per hour for that class of chip, per the analysis from Introl . The cost per answer is driven down by batching, by reusing the KV cache, and increasingly by purpose-built silicon.

Custom chips enter the fleet

The dominant chips today are general-purpose GPUs, but the picture is shifting toward hardware built only for inference. On 24 June 2026, OpenAI and Broadcom unveiled Jalapeño, a custom inference chip designed for running models rather than training them. The companies say it offers better performance-per-watt than current alternatives and aim to begin deployment in late 2026.

Two points are worth keeping straight. First, Jalapeño targets inference only. More demanding pre-training work is expected to stay on NVIDIA hardware, and OpenAI’s large NVIDIA commitments remain in place. Second, custom silicon does not replace the GPU fleet overnight. Reporting from VentureBeat and TechCrunch frames it as a way to cut dependence on a single supplier and lower the cost of serving each token. The wiki covers this in more depth in the glossary entry on AI hardware .

Putting it together

The short version is that a chat reply is a pipeline, not a single black box. Your words are tokenized, wrapped in a larger prompt, screened, and scheduled onto shared accelerators. The model runs a fast parallel prefill, then a slower one-token-at-a-time decode, streaming each token back as it goes. Underneath sits a fleet of routing, scheduling, and hardware layers whose entire purpose is to make a large model answer many people at once, cheaply enough to keep the service running.

Sources

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session