Tool

Added 29 Jun 2026 Last updated 29 Jun 2026 Read time 5 min

Groq

Groq builds the LPU, a custom inference chip, and GroqCloud, a fast, OpenAI-compatible API for running open models.

inferencehardwarellmcloud

Connected Inference - Running AI Models in Production Foundation Models Fireworks AI Together AI LLM Landscape 2026: Every Major Model Compared

Learn this your way

Read Guided course

A split image of a server room and a red-lit processor, representing custom inference hardware. — Groq designs its own silicon, the LPU, so that running a model is fast and predictable rather than an afterthought on general-purpose chips.

Groq is a hardware and cloud company built around one job: running models that already exist, not training them. It designs the LPU (Language Processing Unit), a chip purpose-built for inference , and offers GroqCloud, an API for calling open foundation models at high speed. The problem it solves is latency. Most inference runs on GPUs designed for training, where memory movement and unpredictable scheduling add delay. Groq rearranges the hardware so tokens come back fast and at a predictable rate.

Jonathan Ross, who earlier worked on Google’s Tensor Processing Unit, founded Groq in 2016. The company frames the LPU with the line “Designed for inference. Not adapted for it.”

Where it sits in the stack

Your app

Chatbot Agent loop RAG backend Calls an OpenAI-compatible endpoint

API layer

GroqCloud API OpenAI-compatible routes Swap the base URL and key, keep your client code

Model layer

Open-weight LLMs Llama family Hosted open models, not Groq's own model

Hardware layer

LPU On-chip SRAM Deterministic compiler Custom silicon instead of GPUs

What an LPU is

An LPU is a processor built for one shape of work: the linear algebra that runs a language model forward, token by token. Groq describes several design choices that set it apart from a GPU.

On-chip memory as primary storage. The LPU holds hundreds of megabytes of SRAM as the main place model weights live, not as a cache. Groq says this keeps the compute units fed at full speed and cuts latency. A GPU, by contrast, moves weights back and forth from separate high-bandwidth memory, which adds delay.
Deterministic execution. A purpose-built compiler schedules every operation ahead of time. Groq calls this static scheduling and says “every cycle is accounted for,” so the chip runs at a consistent, predictable rate rather than reacting to runtime surprises.
Direct chip-to-chip links. For large models spread across many chips, Groq connects LPUs directly so hundreds of them “act as a single core,” with the compiler predicting when data arrives instead of relying on switches.
Air-cooled by design. Groq states the LPU is air-cooled, which avoids the liquid-cooling plumbing that dense GPU racks often need.

The short version: a GPU is a flexible engine that can train and serve many workloads. An LPU narrows the target to inference and trades generality for speed and predictability.

How to access it

You do not buy an LPU. You call GroqCloud, a hosted API. GroqCloud is OpenAI-compatible, so if your code already talks to an OpenAI-style endpoint, you point it at Groq by changing the base URL and API key.

Step 1 Get a key Sign up at console.groq.com and create an API key.

→

Step 2 Pick a model Choose a hosted open model, such as a Llama variant, from the model list.

→

Step 3 Point your client Set the base URL to Groq and reuse your OpenAI-style client.

→

Step 4 Stream tokens Send prompts and stream responses back to your app.

Groq reports that roughly three million developers and teams use its platform, and names customers including Vercel, Canva, and Robinhood. The typical use is any workload where response speed matters: live chat, voice interfaces, and agent loops that make many model calls in sequence.

How it compares

Groq competes with other providers that host open models behind fast APIs. The main difference is that Groq runs custom silicon, while most rivals run GPUs.

	Groq	Fireworks AI	Together AI	Major GPU clouds
Hardware	Custom LPU	GPU	GPU	GPU
Main pitch	Very fast, predictable inference	Fast open-model serving	Broad open-model catalog	General compute and inference
Own model	No, hosts open models	No, hosts open models	No, hosts open models	Varies
API style	OpenAI-compatible	OpenAI-compatible	OpenAI-compatible	Varies by provider
Best for	Latency-sensitive apps	Tuned open-model endpoints	Model variety and fine-tuning	Mixed training and serving

When not to use it

You need to train or fine-tune models. The LPU targets inference. For training runs, use a GPU cloud.
You need a specific closed model. Groq hosts open-weight models. If your product depends on a proprietary model such as Claude, use that vendor’s API or a platform like Amazon Bedrock .
Latency is not your bottleneck. If your workload is batch processing where total cost matters more than speed per token, compare per-token pricing across providers before committing.
You need a model Groq does not host. Check the current model list first. If your chosen model is absent, a broader catalog provider may fit better.

Sources

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session