Sending every query to one frontier model wastes money. Most prompts are easy, and a cheaper model answers them at a fraction of the price. Multi-model routing picks the right model per query, so you pay frontier prices only when a request needs frontier reasoning. Routing also removes a single point of failure: when one provider is down, traffic shifts to another.

A central mechanical hub with copper arms radiating to several connected points. No em-dashes.
A router behaves like this hub: each incoming query travels down the one arm that reaches the model best suited to answer it.

The routing decision

Routing is a classification problem. You inspect a query, judge how hard it is, and send it to a model that fits. Easy queries go to a small, cheap model. Hard queries go to a strong, expensive model. A fallback path catches failures.

Step 1 Classify query Read the prompt and estimate its difficulty, topic, and required capability.
Step 2 Route by complexity Decide whether a weak model can answer or a strong model is needed.
Step 3 Cheap or strong model Call the small model for easy work, the frontier model for hard work.
Step 4 Fallback On error or low confidence, retry on a second model or escalate.

The problem: one model for everything

One frontier model for every request fails on two fronts.

First, cost. Most production traffic is repetitive and easy: short answers, simple classification, format conversions. A frontier model answers those, but you pay a premium for capability you do not use. Per-token API prices differ by two orders of magnitude across providers, so the gap between the cheapest and most expensive option is large. [4]

Second, reliability. One model means one provider, one API key, one rate limit, and one outage that stops your product. A single dependency is a single point of failure. Spreading traffic across models and providers removes that risk.

Routing answers both problems at once. You match each query to the cheapest model that can handle it, and you keep a backup ready when the primary model fails.

Routing strategies

Four strategies dominate. They differ in how they decide and in how much they cost to run the decision itself.

Predictive and preference routing

A predictive router learns to choose between a stronger and a weaker model before either runs. RouteLLM trains router models on human preference data, then picks the right model per query. The paper reports that routing can reduce costs by over 2x in some cases without compromising response quality. [1]

The numbers are concrete. Evaluated on MT Bench, MMLU, and GSM8K, RouteLLM achieved cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while reaching 95% of GPT-4 performance. [2][3] On MT Bench, it reached 95% of GPT-4 performance while calling GPT-4 only 26% of the time. [2] The other 74% of queries went to a cheaper model and still satisfied users.

Use predictive routing when you have two tiers (one strong, one weak) and want to maximise how much traffic the weak model absorbs.

Cascades

A cascade tries the cheap model first, then escalates only if the answer looks weak. FrugalGPT names three cost-reduction strategies: prompt adaptation, LLM approximation, and LLM cascade. [4] The cascade calls models in sequence and stops at the first acceptable answer.

FrugalGPT reports it can match the best individual LLM (for example GPT-4) with up to 98% cost reduction, or improve accuracy by 4% at the same cost. [4] The trade-off is latency: an escalated query runs two models instead of one. Use a cascade when you can score answer quality cheaply and most queries pass on the first try.

Semantic and embedding routing

Semantic routing makes the decision in embedding space, without calling an LLM to classify. The semantic-router library from Aurelio Labs maps a query to vectors and matches it against predefined routes. This cuts routing latency from about 5000ms to about 100ms compared to using an LLM as the classifier. [11]

Embedding routing scales to large workloads. The vLLM semantic router paper reports a 10.2 percentage-point accuracy gain on MMLU-Pro while cutting latency 47.1% and token use 48.5% versus direct inference. [5] The router decides which queries need step-by-step reasoning and which do not, so you spend reasoning tokens only where they help.

Use semantic routing when routing latency matters and your routes map to clear topics or intents.

Task-complexity routing

Task-complexity routing groups queries by the capability they demand. Code generation, long-context analysis, and multi-step math go to strong models. Summaries, extraction, and classification go to small models. You can implement this with simple rules, with a classifier, or with the embedding approach above. It pairs well with the comparison in small vs large language models .

LLM gateways and what they add

A gateway sits between your code and the model providers. It gives you one API for many models, plus the operational features you would otherwise build yourself.

LiteLLM gives one OpenAI-format interface to 100+ LLM APIs with cost tracking, load balancing, and logging. [6] You write OpenAI-style calls and swap the model name to switch providers. Portkey’s AI Gateway routes to 1,600+ LLMs with fallback on failed requests, weighted load balancing, and conditional routing. [7] Cloudflare AI Gateway offers caching, rate limiting, request retry and fallback, and dynamic routing. [8] OpenRouter exposes one OpenAI-compatible endpoint to 500+ models from 60+ providers, and a models array enables fallback on context-length errors, moderation, rate limits, and downtime. [9]

The shared value is the same: one API, automatic fallback, load balancing, and caching. You get reliability features without writing retry loops and provider adapters by hand. For a deeper architectural view, see the LLM gateway architecture guide.

Your app
Chat endpoint Agent loop Speaks one API format
Gateway
Routing Fallback Caching Load balancing One place for cost and reliability
Providers
Small model Frontier model Local model Swapped by name, not by rewrite

Recommenders versus proxies

Two designs route traffic, and they sit in different places.

A proxy stands in the request path. Your call goes to the proxy, the proxy calls the model, and the response comes back through the proxy. LiteLLM, Portkey, Cloudflare AI Gateway, and OpenRouter work this way. The proxy sees your prompt, so it can cache, log, and fail over. The cost is that your traffic flows through another service.

A recommender stays out of the path. It predicts the best model for an input, you call that model yourself, and your prompts never touch the recommender. Not Diamond is a recommender, not a proxy. It predicts the best model per input and supports cost or latency trade-offs, and prompts and keys do not pass through its servers. [10] Choose a recommender when you want routing intelligence but must keep prompts and keys on your own infrastructure.

Reliability and failover

Routing is also a reliability tool. The same gateways that cut cost give you automatic failover.

Cloudflare AI Gateway offers caching, rate limiting, request retry and fallback, and dynamic routing, so a failed call retries or shifts to another model without app changes. [8] OpenRouter’s models array enables fallback on context-length errors, moderation, rate limits, and downtime. [9] You list models in priority order, and the gateway walks the list until one succeeds.

Build failover into the default path, not as an afterthought. List a primary model, a same-capability backup from a second provider, and a small model as a last resort. When the primary fails, traffic moves on its own.

Code: a gateway call with fallbacks

This example uses LiteLLM. It tries a cheap model first and falls back through a list when a call fails. LiteLLM speaks the OpenAI format, so the same code targets any supported provider by name. [6]

python
from litellm import completion

# Models in priority order: cheap first, strong backups after.
model_list = [
    "openai/gpt-4o-mini",          # cheap default
    "anthropic/claude-haiku-4-5",  # backup, different provider
    "openai/gpt-4o",               # strong last resort
]

def route_with_fallback(prompt: str) -> str:
    messages = [{"role": "user", "content": prompt}]
    last_error = None
    for model in model_list:
        try:
            response = completion(model=model, messages=messages)
            return response["choices"][0]["message"]["content"]
        except Exception as error:  # rate limit, outage, context length
            last_error = error
            continue
    raise RuntimeError(f"All models failed. Last error: {last_error}")

print(route_with_fallback("Summarise this support ticket in one line."))

For a cascade, you escalate by quality instead of by error. Call the cheap model, score its answer, and call the strong model only when the score is low.

python
from litellm import completion

def confidence(answer: str) -> float:
    # Replace with a real scorer: a judge model, a heuristic, or logprobs.
    return 0.9 if len(answer) > 20 else 0.3

def cascade(prompt: str, threshold: float = 0.5) -> str:
    messages = [{"role": "user", "content": prompt}]

    cheap = completion(model="openai/gpt-4o-mini", messages=messages)
    cheap_answer = cheap["choices"][0]["message"]["content"]
    if confidence(cheap_answer) >= threshold:
        return cheap_answer  # most queries stop here

    strong = completion(model="openai/gpt-4o", messages=messages)
    return strong["choices"][0]["message"]["content"]

print(cascade("Explain the trade-offs of optimistic locking."))

Comparison: gateways and routers

TypeKey featurePrompts pass through
LiteLLMProxy / libraryOpenAI format for 100+ APIs, cost trackingYes
OpenRouterProxyOne endpoint to 500+ models, models fallback arrayYes
PortkeyProxy1,600+ LLMs, conditional routing, weighted balancingYes
Cloudflare AI GatewayProxyCaching, rate limiting, retry, dynamic routingYes
Not DiamondRecommenderPredicts best model per input, cost or latency tuningNo
semantic-routerLibraryEmbedding-space routing, about 100ms decisionsNo

Further reading