Ray Serve
Ray Serve is a framework-agnostic model-serving library on Ray that scales single models and multi-model pipelines across a cluster with autoscaling and Python-native composition.

Ray Serve is a scalable model-serving library built on Ray, the distributed computing framework maintained as open source and commercialised by Anyscale. It lets you deploy machine learning models and plain Python logic as online inference APIs, then scale each piece independently across a cluster. Its focus sets it apart from single-model servers: Ray Serve is built for composing several models and steps into one service, not for squeezing maximum throughput out of one large language model on one node.
The problem it solves is orchestration. A real inference service is rarely one model. It is a preprocessing step, a retrieval call, one or more models, and post-processing glue. Wiring these together across machines, and scaling each part to match its own load, is the hard part. Ray Serve exposes that graph as ordinary Python, so calls between models look like function calls rather than network plumbing.
Where it sits
Ray Serve is the serving layer of the Ray ecosystem. It runs on Ray Core and shares a cluster with the rest of the Ray libraries.
How to use it and how it fits
A deployment is the core unit. You decorate a Python class with @serve.deployment, bind its constructor arguments with .bind(), and run the result with serve.run(). Ray Serve then hosts one or more replicas of that class and routes requests to them.
Composition is where Ray Serve earns its place. One deployment holds a DeploymentHandle to another and calls its methods with .remote(), which runs the call asynchronously somewhere in the cluster. The example below chains a preprocessing step into a model, each as its own deployment that scales independently.
from ray import serve
from ray.serve.handle import DeploymentHandle
@serve.deployment(num_replicas=2)
class Preprocessor:
def clean(self, text: str) -> str:
return text.strip().lower()
@serve.deployment(
autoscaling_config={"min_replicas": 1, "max_replicas": 8}
)
class Classifier:
def __init__(self, prep: DeploymentHandle):
self.prep = prep
# load your PyTorch or scikit-learn model here
async def __call__(self, request):
raw = (await request.json())["text"]
cleaned = await self.prep.clean.remote(raw)
return {"label": self.predict(cleaned)}
def predict(self, text: str) -> str:
return "positive" if "good" in text else "negative"
app = Classifier.bind(Preprocessor.bind())
serve.run(app)Two details matter for scale. First, autoscaling_config adjusts the replica count up and down with load, so a heavy model and a cheap preprocessor size themselves separately. Second, Ray Serve supports dynamic request batching, which groups incoming requests to use vectorised operations more efficiently. For large language model workloads, Ray Serve adds response streaming and multi-node, multi-GPU serving. This same composition model suits AI agents
, where a request may route through several models and tools before returning.
How it compares
Ray Serve occupies a different niche from dedicated LLM inference servers. Those tools optimise one model on one deployment. Ray Serve orchestrates many pieces across a cluster and can call those servers as parts of a larger graph.
| Ray Serve | TGI | vLLM | Managed endpoint | |
|---|---|---|---|---|
| Primary focus | Multi-model composition | Single LLM serving | Single LLM serving | Hosted single model |
| Framework support | Any Python framework | Transformer LLMs | Transformer LLMs | Provider models |
| Runs where | Your Ray cluster | Your server | Your server | Provider infrastructure |
| Scaling unit | Per-deployment replicas | Model replicas | Model replicas | Provider-managed |
| You operate it | Yes | Yes | Yes | No |
| Best for | Pipelines, mixed models | One high-throughput LLM | One high-throughput LLM | Fastest time to live |
If you need raw throughput for a single model, TGI or vLLM are more direct. If you need to stitch several models and steps into one autoscaling service, Ray Serve is the composition layer, and it can host TGI or vLLM inside individual deployments.
When not to use it
Ray Serve adds a cluster and a programming model. That overhead is not always worth it.
- You serve one model with one endpoint. A single-model server such as TGI or vLLM is simpler to run and tune, and a managed endpoint removes operations entirely.
- You want a fully managed service. Ray Serve is a library you deploy and operate yourself. If you prefer not to run infrastructure, a hosted inference endpoint fits better.
- Your team has no Ray experience. The distributed model, replicas, and handles carry a learning curve. For a first production service, weigh that against a simpler path in the zero to production guide .
- Latency budgets are extremely tight and the graph is trivial. Extra hops between deployments add coordination cost that a single process avoids.
Further reading
- Ray Serve documentation : official reference for deployments, composition, and autoscaling.
- Ray project home : overview of Ray Core, Data, Train, Serve, Tune, and RLlib.
- What is inference? : the serving stage Ray Serve is built to run.
- TGI : a single-model LLM server you can compare against or host inside a deployment.
- AI agents : multi-step services that map naturally onto Ray Serve composition.
- From zero to production : a path to your first deployed service.
Sources
- Ray Serve documentation , Ray project, fetched 2026-06-29.
- Ray project home , Ray project, fetched 2026-06-29.