A split image of a server room and a red-lit processor, representing custom inference hardware.
Groq designs its own silicon, the LPU, so that running a model is fast and predictable rather than an afterthought on general-purpose chips.

Groq is a hardware and cloud company built around one job: running models that already exist, not training them. It designs the LPU (Language Processing Unit), a chip purpose-built for inference , and offers GroqCloud, an API for calling open foundation models at high speed. The problem it solves is latency. Most inference runs on GPUs designed for training, where memory movement and unpredictable scheduling add delay. Groq rearranges the hardware so tokens come back fast and at a predictable rate.

Jonathan Ross, who earlier worked on Google’s Tensor Processing Unit, founded Groq in 2016. The company frames the LPU with the line “Designed for inference. Not adapted for it.”

Where it sits in the stack

Your app
Chatbot Agent loop RAG backend Calls an OpenAI-compatible endpoint
API layer
GroqCloud API OpenAI-compatible routes Swap the base URL and key, keep your client code
Model layer
Open-weight LLMs Llama family Hosted open models, not Groq's own model
Hardware layer
LPU On-chip SRAM Deterministic compiler Custom silicon instead of GPUs

What an LPU is

An LPU is a processor built for one shape of work: the linear algebra that runs a language model forward, token by token. Groq describes several design choices that set it apart from a GPU.

  • On-chip memory as primary storage. The LPU holds hundreds of megabytes of SRAM as the main place model weights live, not as a cache. Groq says this keeps the compute units fed at full speed and cuts latency. A GPU, by contrast, moves weights back and forth from separate high-bandwidth memory, which adds delay.
  • Deterministic execution. A purpose-built compiler schedules every operation ahead of time. Groq calls this static scheduling and says “every cycle is accounted for,” so the chip runs at a consistent, predictable rate rather than reacting to runtime surprises.
  • Direct chip-to-chip links. For large models spread across many chips, Groq connects LPUs directly so hundreds of them “act as a single core,” with the compiler predicting when data arrives instead of relying on switches.
  • Air-cooled by design. Groq states the LPU is air-cooled, which avoids the liquid-cooling plumbing that dense GPU racks often need.

The short version: a GPU is a flexible engine that can train and serve many workloads. An LPU narrows the target to inference and trades generality for speed and predictability.

How to access it

You do not buy an LPU. You call GroqCloud, a hosted API. GroqCloud is OpenAI-compatible, so if your code already talks to an OpenAI-style endpoint, you point it at Groq by changing the base URL and API key.

Step 1 Get a key Sign up at console.groq.com and create an API key.
Step 2 Pick a model Choose a hosted open model, such as a Llama variant, from the model list.
Step 3 Point your client Set the base URL to Groq and reuse your OpenAI-style client.
Step 4 Stream tokens Send prompts and stream responses back to your app.

Groq reports that roughly three million developers and teams use its platform, and names customers including Vercel, Canva, and Robinhood. The typical use is any workload where response speed matters: live chat, voice interfaces, and agent loops that make many model calls in sequence.

How it compares

Groq competes with other providers that host open models behind fast APIs. The main difference is that Groq runs custom silicon, while most rivals run GPUs.

GroqFireworks AITogether AIMajor GPU clouds
HardwareCustom LPUGPUGPUGPU
Main pitchVery fast, predictable inferenceFast open-model servingBroad open-model catalogGeneral compute and inference
Own modelNo, hosts open modelsNo, hosts open modelsNo, hosts open modelsVaries
API styleOpenAI-compatibleOpenAI-compatibleOpenAI-compatibleVaries by provider
Best forLatency-sensitive appsTuned open-model endpointsModel variety and fine-tuningMixed training and serving

When not to use it

  • You need to train or fine-tune models. The LPU targets inference. For training runs, use a GPU cloud.
  • You need a specific closed model. Groq hosts open-weight models. If your product depends on a proprietary model such as Claude, use that vendor’s API or a platform like Amazon Bedrock .
  • Latency is not your bottleneck. If your workload is batch processing where total cost matters more than speed per token, compare per-token pricing across providers before committing.
  • You need a model Groq does not host. Check the current model list first. If your chosen model is absent, a broader catalog provider may fit better.

Further reading

Sources