Why AI Companies Are Building Their Own Chips

Why OpenAI, Google, and Amazon design custom AI silicon: inference cost, performance per watt, supply limits, and control of the full stack.

Added 25 Jun 2026 6 min read Updated 25 Jun 2026

#ai-hardware #inference #chips #infrastructure

Learn this your way

Read Guided course

Split image of a dark server room on the left and a red-lit processor chip on the right. — The split between buying off-the-shelf compute and designing your own chip is now a strategic choice for large AI companies.

On 24 June 2026, OpenAI and Broadcom unveiled Jalapeno, OpenAI’s first custom chip, built for one job: running AI models in response to user requests. It joins a growing set of in-house chips from Google, Amazon, Microsoft, and Meta. The pattern matters to anyone building with AI because the chip a model runs on shapes both the price you pay per request and how much capacity exists to serve you.

What just happened with OpenAI Jalapeno

Jalapeno is an inference chip. Inference is the act of running a trained model to answer a query, as opposed to training, which is the process of building the model in the first place. OpenAI designed Jalapeno around the workload behind ChatGPT and its coding models, and Broadcom handles the silicon implementation, networking, and connectivity.

The chip was co-developed from initial design to manufacturing tape-out in about nine months, which OpenAI and Broadcom describe as one of the fastest cycles for a chip of this class. OpenAI used its own models to speed parts of the design. Early reporting put the cost saving at roughly 50% compared with typical AI graphics processing units, and OpenAI cites better performance per watt than current alternatives. The first Jalapeno servers are due online by the end of 2026, expanding toward gigawatt-scale data centres in the years after.

“We have a deep understanding of the workload. How can we build something that will be able to accelerate what’s possible?” - Greg Brockman, OpenAI president, as quoted by TechCrunch.

ASIC vs GPU in plain words

The two main chip types in this story are the GPU and the ASIC.

A GPU, or graphics processing unit, is a general-purpose accelerator. It was first built for graphics, then adapted for AI. It runs many kinds of model and many kinds of software. Nvidia dominates this market, and its CUDA software is the reason most AI code runs on Nvidia hardware today.

An ASIC, or application-specific integrated circuit, is a chip designed for one narrow task. Google’s TPU, Amazon’s Trainium and Inferentia, and OpenAI’s Jalapeno are all ASICs. Because an ASIC does fewer things, it can do its one thing with less wasted power and silicon. The trade-off is flexibility: change the workload and the chip may no longer fit.

	GPU (e.g. Nvidia)	ASIC (e.g. TPU, Jalapeno)
Designed for	Many AI and graphics tasks	One narrow workload
Flexibility	High, runs most models	Low, tuned to specific models
Performance per watt	Good, general	Higher for the target task
Software ecosystem	Mature (CUDA)	Owner-controlled, narrower
Best for	Research, mixed workloads	High-volume, repeated workloads

Training chips vs inference chips

Not all AI chips do the same job. Training and inference put different demands on hardware.

Step 1TrainingA model learns from data once. Heavy, parallel maths over weeks or months. Needs huge memory and bandwidth.

→

Step 2DeploymentThe finished model is copied into data centres to serve users.

→

Step 3InferenceThe model answers each query. This runs billions of times. Cost per request and latency dominate.

Training happens once per model version. Inference happens on every single request, forever. As products like ChatGPT scale to billions of queries, the cost of inference becomes the larger and more predictable bill. That is why several custom chips, including Jalapeno and Google’s Ironwood, target inference first. Amazon splits the two jobs across two chip lines: Trainium for training and Inferentia for inference.

The four reasons companies design their own chips

1. The recurring cost of inference

Every query a model answers costs money in electricity and hardware time. A custom chip tuned to a known workload can cut that per-query cost sharply. Amazon reports that workloads moved from Nvidia GPUs to Inferentia can see cost reductions in the range of 80 to 90% for some inference tasks, and its first Inferentia chip delivered up to 70% lower cost per inference than comparable instances. OpenAI’s stated 50% saving on Jalapeno follows the same logic. When you run a model billions of times, a fraction off each request adds up fast.

2. Performance per watt

Power is a hard limit on how much AI a data centre can run. The bottleneck is increasingly electricity and cooling, not floor space. A chip that does more useful work per watt lets a company serve more users from the same power budget. Google describes its seventh-generation Ironwood TPU as its most energy-efficient custom silicon, with large gains per chip over the prior generation. OpenAI makes performance per watt a central claim for Jalapeno.

3. Supply constraints and dependence on Nvidia

Nvidia holds roughly 80% of the AI accelerator market in 2026, and its GPUs are both costly and supply-limited. Hyperscaler capital spending on AI infrastructure in 2026 runs into the hundreds of billions of euros, which strains the supply of chips, networking, power, and cooling. Designing an in-house chip gives a company a second source of compute that it controls, reducing the risk of waiting in line for Nvidia parts. Notably, a large share of Nvidia’s revenue comes from the same handful of cloud companies now building their own alternatives.

4. Control of the full stack

When a company designs the chip, the networking, and the software together, it can tune each layer to the others. OpenAI co-designed Jalapeno’s hardware with its own model software. Google trains and serves Gemini largely on its own TPUs. This vertical control can improve efficiency and reduce reliance on outside vendors for the most strategic part of the business.

Who is building what

OpenAI

JalapenoInference ASICBuilt with Broadcom

Google

TPU (Ironwood)Training and inferenceServes Gemini

Amazon

TrainiumInferentiaUsed by Anthropic

Others

Microsoft MaiaMeta MTIAOften via Broadcom

Amazon’s custom silicon business, which includes Trainium, Inferentia, and its Graviton general-compute line, passed a 20 billion US dollar annualised run rate by early 2026. Broadcom is the common partner behind much of this work, providing design and manufacturing support for several of these chips.

What this means if you build with AI

Custom chips do not replace Nvidia GPUs. The consensus in 2026 reporting is that Nvidia’s lead holds for training and for flexible, fast-changing research, where its CUDA software is hard to match. Custom ASICs win where the workload is high-volume, stable, and cost-sensitive, which describes most production inference.

For you as a builder, the practical effect is downstream pricing. As providers shift inference onto cheaper in-house chips, the cost per token of running a model can fall, and capacity can expand. You rarely choose the chip directly. You choose a model and a provider, and the chip sits underneath. Knowing which chip a workload runs on helps explain why some inference is cheap and abundant while frontier training stays scarce and expensive.

Sources

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session