Tool

Added 29 Jun 2026 Last updated 29 Jun 2026 Read time 5 min

Ray Serve

Ray Serve is a framework-agnostic model-serving library on Ray that scales single models and multi-model pipelines across a cluster with autoscaling and Python-native composition.

raymodel-servinginferencedistributed-systemsautoscaling

Connected Inference - Running AI Models in Production AI Agents - Autonomous Task Execution Text Generation Inference (TGI)From Zero to Production: The Complete Path

Learn this your way

Read Guided course

Floating interconnected purple nodes, representing a distributed framework for scaling model serving. — Ray Serve treats each model and each piece of business logic as an independently scaling node in a connected graph.

Ray Serve is a scalable model-serving library built on Ray, the distributed computing framework maintained as open source and commercialised by Anyscale. It lets you deploy machine learning models and plain Python logic as online inference APIs, then scale each piece independently across a cluster. Its focus sets it apart from single-model servers: Ray Serve is built for composing several models and steps into one service, not for squeezing maximum throughput out of one large language model on one node.

The problem it solves is orchestration. A real inference service is rarely one model. It is a preprocessing step, a retrieval call, one or more models, and post-processing glue. Wiring these together across machines, and scaling each part to match its own load, is the hard part. Ray Serve exposes that graph as ordinary Python, so calls between models look like function calls rather than network plumbing.

Where it sits

Ray Serve is the serving layer of the Ray ecosystem. It runs on Ray Core and shares a cluster with the rest of the Ray libraries.

Client

HTTP request DeploymentHandle Callers reach the service over HTTP or from other Python code

Ray Serve

Deployments Replicas Autoscaling Request batching Each deployment scales its replica count on its own

Models and logic

PyTorch TensorFlow Scikit-learn Python business logic Framework agnostic by design

Ray Core

Cluster scheduler Multi-node Multi-GPU Places replicas across machines and accelerators

How to use it and how it fits

A deployment is the core unit. You decorate a Python class with @serve.deployment, bind its constructor arguments with .bind(), and run the result with serve.run(). Ray Serve then hosts one or more replicas of that class and routes requests to them.

Composition is where Ray Serve earns its place. One deployment holds a DeploymentHandle to another and calls its methods with .remote(), which runs the call asynchronously somewhere in the cluster. The example below chains a preprocessing step into a model, each as its own deployment that scales independently.

python

from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment(num_replicas=2)
class Preprocessor:
    def clean(self, text: str) -> str:
        return text.strip().lower()

@serve.deployment(
    autoscaling_config={"min_replicas": 1, "max_replicas": 8}
)
class Classifier:
    def __init__(self, prep: DeploymentHandle):
        self.prep = prep
        # load your PyTorch or scikit-learn model here

    async def __call__(self, request):
        raw = (await request.json())["text"]
        cleaned = await self.prep.clean.remote(raw)
        return {"label": self.predict(cleaned)}

    def predict(self, text: str) -> str:
        return "positive" if "good" in text else "negative"

app = Classifier.bind(Preprocessor.bind())
serve.run(app)

Two details matter for scale. First, autoscaling_config adjusts the replica count up and down with load, so a heavy model and a cheap preprocessor size themselves separately. Second, Ray Serve supports dynamic request batching, which groups incoming requests to use vectorised operations more efficiently. For large language model workloads, Ray Serve adds response streaming and multi-node, multi-GPU serving. This same composition model suits AI agents , where a request may route through several models and tools before returning.

Step 1 Define deployments Wrap each model and logic step in a class with @serve.deployment.

→

Step 2 Bind the graph Pass one deployment into another with .bind() to build the pipeline.

→

Step 3 Run on the cluster serve.run() places replicas across nodes and GPUs.

→

Step 4 Serve and scale Autoscaling and batching adjust each deployment to its own load.

How it compares

Ray Serve occupies a different niche from dedicated LLM inference servers. Those tools optimise one model on one deployment. Ray Serve orchestrates many pieces across a cluster and can call those servers as parts of a larger graph.

	Ray Serve	TGI	vLLM	Managed endpoint
Primary focus	Multi-model composition	Single LLM serving	Single LLM serving	Hosted single model
Framework support	Any Python framework	Transformer LLMs	Transformer LLMs	Provider models
Runs where	Your Ray cluster	Your server	Your server	Provider infrastructure
Scaling unit	Per-deployment replicas	Model replicas	Model replicas	Provider-managed
You operate it	Yes	Yes	Yes	No
Best for	Pipelines, mixed models	One high-throughput LLM	One high-throughput LLM	Fastest time to live

If you need raw throughput for a single model, TGI or vLLM are more direct. If you need to stitch several models and steps into one autoscaling service, Ray Serve is the composition layer, and it can host TGI or vLLM inside individual deployments.

When not to use it

Ray Serve adds a cluster and a programming model. That overhead is not always worth it.

You serve one model with one endpoint. A single-model server such as TGI or vLLM is simpler to run and tune, and a managed endpoint removes operations entirely.
You want a fully managed service. Ray Serve is a library you deploy and operate yourself. If you prefer not to run infrastructure, a hosted inference endpoint fits better.
Your team has no Ray experience. The distributed model, replicas, and handles carry a learning curve. For a first production service, weigh that against a simpler path in the zero to production guide .
Latency budgets are extremely tight and the graph is trivial. Extra hops between deployments add coordination cost that a single process avoids.

Sources

Ray Serve documentation , Ray project, fetched 2026-06-29.
Ray project home , Ray project, fetched 2026-06-29.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session