🟣 Technical 11 min read

LLM Routing: How to Pick the Right Model for Every Request

Not every prompt needs your biggest model. LLM routing lets you dynamically select the right model per request — balancing quality, latency, and cost. Here's how to build a routing layer.

View all llms depths →

Running every user query through your most capable (and expensive) model is like driving a semi truck to pick up groceries. It works, but you’re burning fuel you don’t need to burn.

LLM routing is the practice of analyzing incoming requests and directing them to the most appropriate model. Simple questions go to fast, cheap models. Complex reasoning tasks go to heavy hitters. The result: lower costs, lower latency, and often better user experience because responses come back faster when they can.

Why routing matters

Consider a typical AI application. Maybe 60-70% of requests are straightforward: summarize this paragraph, extract these fields, answer a factual question. These don’t need a frontier model with 400B parameters and chain-of-thought reasoning. A smaller model handles them perfectly, at 10-50x lower cost and 3-5x lower latency.

The remaining 30-40% — multi-step reasoning, nuanced writing, complex code generation — genuinely benefit from your best model. Routing lets you invest your compute budget where it actually matters.

Routing strategies

Classifier-based routing

Train a lightweight classifier on labeled examples of “easy” vs “hard” requests. This can be a simple logistic regression on embedding features, a small fine-tuned model, or even a rule-based system.

def route_request(prompt: str) -> str:
    features = embed(prompt)
    difficulty = classifier.predict(features)
    
    if difficulty == "simple":
        return "gpt-4o-mini"
    elif difficulty == "moderate":
        return "claude-3.5-sonnet"
    else:
        return "claude-opus-4"

The classifier itself is cheap — a few milliseconds of overhead. The savings compound with every request.

LLM-as-judge routing

Use a small, fast model to assess the complexity of the incoming request, then route accordingly. This is more flexible than a classifier because it can reason about novel request types.

routing_prompt = """Classify this user request as SIMPLE, MODERATE, or COMPLEX.
SIMPLE: factual questions, basic formatting, simple extraction
MODERATE: multi-step tasks, moderate reasoning, standard writing
COMPLEX: nuanced analysis, creative work, complex code, multi-domain reasoning

Request: {user_prompt}
Classification:"""

Cascading (try cheap first)

Start with the cheapest model. If the response quality is below a threshold (measured by confidence scores, self-consistency checks, or a validator), escalate to a more capable model.

This works well when you can quickly assess output quality. For structured outputs (JSON, code that compiles, math with verifiable answers), cascading is particularly effective.

Semantic routing

Embed the request and compare it to cluster centroids that represent different task types. Each cluster maps to a preferred model. This is fast, doesn’t require labeled data (just representative examples per category), and adapts as you add new models.

Building a practical routing layer

A production routing layer needs a few components:

1. Request analysis — Assess the incoming request. Consider: token count, presence of code/math, question complexity signals, required output format.

2. Model registry — Maintain a catalog of available models with their capabilities, costs, latency profiles, and rate limits.

3. Selection logic — Apply your routing strategy. Start simple (rules-based), graduate to ML when you have enough data.

4. Fallback handling — If the selected model is down or rate-limited, have a fallback chain. Don’t let routing failures become user-facing errors.

5. Feedback loop — Log routing decisions alongside quality metrics. Use this data to improve your router over time.

What to measure

  • Cost per request — broken down by route taken
  • Latency percentiles — p50 and p95 by route
  • Quality scores — human ratings or automated evals, by route
  • Routing accuracy — how often did the cheap model produce good-enough results?
  • Escalation rate — for cascading systems, how often do you need the expensive model?

Common pitfalls

Over-routing to cheap models. Saving money is great until quality drops and users leave. Set quality thresholds conservatively at first, then relax as you gather data.

Ignoring latency in routing decisions. A model that’s technically capable but slow may be wrong for interactive use cases. Factor in the user’s latency tolerance.

Static routing rules. Models change. New ones appear. Your traffic patterns shift. Review and update routing logic regularly.

Not accounting for rate limits. If your cheap model has aggressive rate limits, routing too much traffic there causes failures. Include capacity awareness in your router.

When not to route

Sometimes routing adds complexity without enough benefit:

  • Low traffic applications where cost isn’t a concern
  • Safety-critical tasks where you always want your best model
  • Latency-insensitive batch processing where you can afford the best model for everything
  • Early-stage products where you’re still figuring out what quality means

The future of routing

Routing is becoming a first-class concern in LLM infrastructure. Cloud providers are starting to offer built-in routing capabilities. Open-source routing frameworks are maturing. And as the model ecosystem fragments further — with specialized models for code, math, creative writing, and more — intelligent routing becomes not just a cost optimization, but a quality optimization too.

The best routing system is one you barely think about. It runs in the background, saving you money on easy requests and ensuring quality on hard ones. Start with simple rules, measure everything, and iterate.

Simplify

← Reasoning Models: How LLMs Learned to Think Before They Answer

Go deeper

LLM Scaling Laws Explained: Why Bigger Models Aren't Always Better →

Related reads

llmsroutingmodel-selectionoptimizationcost

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.