Groq
Groq builds the LPU, a custom inference chip, and GroqCloud, a fast, OpenAI-compatible API for running open models.

Groq is a hardware and cloud company built around one job: running models that already exist, not training them. It designs the LPU (Language Processing Unit), a chip purpose-built for inference , and offers GroqCloud, an API for calling open foundation models at high speed. The problem it solves is latency. Most inference runs on GPUs designed for training, where memory movement and unpredictable scheduling add delay. Groq rearranges the hardware so tokens come back fast and at a predictable rate.
Jonathan Ross, who earlier worked on Google’s Tensor Processing Unit, founded Groq in 2016. The company frames the LPU with the line “Designed for inference. Not adapted for it.”
Where it sits in the stack
What an LPU is
An LPU is a processor built for one shape of work: the linear algebra that runs a language model forward, token by token. Groq describes several design choices that set it apart from a GPU.
- On-chip memory as primary storage. The LPU holds hundreds of megabytes of SRAM as the main place model weights live, not as a cache. Groq says this keeps the compute units fed at full speed and cuts latency. A GPU, by contrast, moves weights back and forth from separate high-bandwidth memory, which adds delay.
- Deterministic execution. A purpose-built compiler schedules every operation ahead of time. Groq calls this static scheduling and says “every cycle is accounted for,” so the chip runs at a consistent, predictable rate rather than reacting to runtime surprises.
- Direct chip-to-chip links. For large models spread across many chips, Groq connects LPUs directly so hundreds of them “act as a single core,” with the compiler predicting when data arrives instead of relying on switches.
- Air-cooled by design. Groq states the LPU is air-cooled, which avoids the liquid-cooling plumbing that dense GPU racks often need.
The short version: a GPU is a flexible engine that can train and serve many workloads. An LPU narrows the target to inference and trades generality for speed and predictability.
How to access it
You do not buy an LPU. You call GroqCloud, a hosted API. GroqCloud is OpenAI-compatible, so if your code already talks to an OpenAI-style endpoint, you point it at Groq by changing the base URL and API key.
Groq reports that roughly three million developers and teams use its platform, and names customers including Vercel, Canva, and Robinhood. The typical use is any workload where response speed matters: live chat, voice interfaces, and agent loops that make many model calls in sequence.
How it compares
Groq competes with other providers that host open models behind fast APIs. The main difference is that Groq runs custom silicon, while most rivals run GPUs.
| Groq | Fireworks AI | Together AI | Major GPU clouds | |
|---|---|---|---|---|
| Hardware | Custom LPU | GPU | GPU | GPU |
| Main pitch | Very fast, predictable inference | Fast open-model serving | Broad open-model catalog | General compute and inference |
| Own model | No, hosts open models | No, hosts open models | No, hosts open models | Varies |
| API style | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible | Varies by provider |
| Best for | Latency-sensitive apps | Tuned open-model endpoints | Model variety and fine-tuning | Mixed training and serving |
When not to use it
- You need to train or fine-tune models. The LPU targets inference. For training runs, use a GPU cloud.
- You need a specific closed model. Groq hosts open-weight models. If your product depends on a proprietary model such as Claude, use that vendor’s API or a platform like Amazon Bedrock .
- Latency is not your bottleneck. If your workload is batch processing where total cost matters more than speed per token, compare per-token pricing across providers before committing.
- You need a model Groq does not host. Check the current model list first. If your chosen model is absent, a broader catalog provider may fit better.
Further reading
- What is inference? : why running a trained model is a separate problem from training it.
- What are foundation models? : the large open models that GroqCloud serves.
- Fireworks AI : a GPU-based provider for fast open-model serving.
- The LLM landscape in 2026 : where inference providers fit among model makers.
- What is a Language Processing Unit? (Groq) : Groq’s own explanation of the chip.
- LPU architecture (Groq) : the official architecture overview.