Floating interconnected purple nodes, representing a distributed framework for scaling model serving.
Ray Serve treats each model and each piece of business logic as an independently scaling node in a connected graph.

Ray Serve is a scalable model-serving library built on Ray, the distributed computing framework maintained as open source and commercialised by Anyscale. It lets you deploy machine learning models and plain Python logic as online inference APIs, then scale each piece independently across a cluster. Its focus sets it apart from single-model servers: Ray Serve is built for composing several models and steps into one service, not for squeezing maximum throughput out of one large language model on one node.

The problem it solves is orchestration. A real inference service is rarely one model. It is a preprocessing step, a retrieval call, one or more models, and post-processing glue. Wiring these together across machines, and scaling each part to match its own load, is the hard part. Ray Serve exposes that graph as ordinary Python, so calls between models look like function calls rather than network plumbing.

Where it sits

Ray Serve is the serving layer of the Ray ecosystem. It runs on Ray Core and shares a cluster with the rest of the Ray libraries.

Client
HTTP request DeploymentHandle Callers reach the service over HTTP or from other Python code
Ray Serve
Deployments Replicas Autoscaling Request batching Each deployment scales its replica count on its own
Models and logic
PyTorch TensorFlow Scikit-learn Python business logic Framework agnostic by design
Ray Core
Cluster scheduler Multi-node Multi-GPU Places replicas across machines and accelerators

How to use it and how it fits

A deployment is the core unit. You decorate a Python class with @serve.deployment, bind its constructor arguments with .bind(), and run the result with serve.run(). Ray Serve then hosts one or more replicas of that class and routes requests to them.

Composition is where Ray Serve earns its place. One deployment holds a DeploymentHandle to another and calls its methods with .remote(), which runs the call asynchronously somewhere in the cluster. The example below chains a preprocessing step into a model, each as its own deployment that scales independently.

python
from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment(num_replicas=2)
class Preprocessor:
    def clean(self, text: str) -> str:
        return text.strip().lower()

@serve.deployment(
    autoscaling_config={"min_replicas": 1, "max_replicas": 8}
)
class Classifier:
    def __init__(self, prep: DeploymentHandle):
        self.prep = prep
        # load your PyTorch or scikit-learn model here

    async def __call__(self, request):
        raw = (await request.json())["text"]
        cleaned = await self.prep.clean.remote(raw)
        return {"label": self.predict(cleaned)}

    def predict(self, text: str) -> str:
        return "positive" if "good" in text else "negative"

app = Classifier.bind(Preprocessor.bind())
serve.run(app)

Two details matter for scale. First, autoscaling_config adjusts the replica count up and down with load, so a heavy model and a cheap preprocessor size themselves separately. Second, Ray Serve supports dynamic request batching, which groups incoming requests to use vectorised operations more efficiently. For large language model workloads, Ray Serve adds response streaming and multi-node, multi-GPU serving. This same composition model suits AI agents , where a request may route through several models and tools before returning.

Step 1 Define deployments Wrap each model and logic step in a class with @serve.deployment.
Step 2 Bind the graph Pass one deployment into another with .bind() to build the pipeline.
Step 3 Run on the cluster serve.run() places replicas across nodes and GPUs.
Step 4 Serve and scale Autoscaling and batching adjust each deployment to its own load.

How it compares

Ray Serve occupies a different niche from dedicated LLM inference servers. Those tools optimise one model on one deployment. Ray Serve orchestrates many pieces across a cluster and can call those servers as parts of a larger graph.

Ray ServeTGIvLLMManaged endpoint
Primary focusMulti-model compositionSingle LLM servingSingle LLM servingHosted single model
Framework supportAny Python frameworkTransformer LLMsTransformer LLMsProvider models
Runs whereYour Ray clusterYour serverYour serverProvider infrastructure
Scaling unitPer-deployment replicasModel replicasModel replicasProvider-managed
You operate itYesYesYesNo
Best forPipelines, mixed modelsOne high-throughput LLMOne high-throughput LLMFastest time to live

If you need raw throughput for a single model, TGI or vLLM are more direct. If you need to stitch several models and steps into one autoscaling service, Ray Serve is the composition layer, and it can host TGI or vLLM inside individual deployments.

When not to use it

Ray Serve adds a cluster and a programming model. That overhead is not always worth it.

  • You serve one model with one endpoint. A single-model server such as TGI or vLLM is simpler to run and tune, and a managed endpoint removes operations entirely.
  • You want a fully managed service. Ray Serve is a library you deploy and operate yourself. If you prefer not to run infrastructure, a hosted inference endpoint fits better.
  • Your team has no Ray experience. The distributed model, replicas, and handles carry a learning curve. For a first production service, weigh that against a simpler path in the zero to production guide .
  • Latency budgets are extremely tight and the graph is trivial. Extra hops between deployments add coordination cost that a single process avoids.

Further reading

Sources