The dirty secret of most AI apps in 2026 is that they're paying GPT-5 prices to do GPT-5-nano work. Summarisation, classification, entity extraction, intent detection, profanity checks - these are not frontier-reasoning problems. They are pattern-matching problems with a flat ceiling, and the cheapest model on the market hits that ceiling for a fraction of the price.
Model routing is the engineering discipline of sending each request to the smallest model that can answer it correctly. Done well, it cuts inference bills by 60-80% without users noticing. Done badly, it produces quietly wrong answers and a long tail of weird bugs.
Start by classifying your traffic
Before you route anything, log a representative week of prompts and tag each one with the smallest tier that could plausibly answer it. We use four buckets:
- Nano - classification, formatting, simple extraction, tone detection.
- Mini - short answers, summaries under 500 tokens, lightweight rewriting, intent routing.
- Standard - multi-turn chat, long-form generation, tool use without deep reasoning.
- Frontier - code generation, multi-step reasoning, agentic workflows, anything you'd be embarrassed to get wrong.
If you discover that 80% of your traffic is nano or mini work routed to a frontier model, congratulations - you've found the cheque.
The three routing patterns that actually work
1. Static routing by endpoint
The simplest pattern. Each API endpoint in your app maps to a fixed model tier. /classify-intent is always nano. /generate-essay is always frontier. No runtime decisions, no surprises, easy to reason about. This alone gets most teams 70% of the savings with 10% of the complexity. Start here.
2. Cascade routing
Send every request to the cheapest model first. If the response fails a confidence check - schema mismatch, low logprobs, low rubric score, or the model literally says "I don't know" - fall through to a stronger model. This is where you pick up the long tail of hard requests without paying for them on easy ones. Cap the cascade at two hops; three-level cascades almost always cost more than just calling frontier directly.
3. Classifier-led routing
Use a nano-tier model as a router. It reads the user's request, decides which downstream model should handle it, and forwards. This is powerful but adds latency and a failure mode (the classifier itself can be wrong). Only worth it if you have genuinely heterogeneous traffic and clear performance data showing static routing is leaving money on the table.
What you must measure
- Per-tier success rate - does the cheap model actually solve the task?
- Cascade rate - what % of nano calls fall through to a stronger model?
- Cost per successful response, not cost per call.
- P95 latency - cascades add round trips; users feel them.
- User-visible regressions on a held-out evaluation set you re-run on every routing change.
Where teams blow themselves up
Three traps. First, routing on prompt length - long prompts are not the same as hard prompts, and short prompts can be devastatingly hard. Second, hard-coding model names everywhere instead of routing through a single abstraction; when the next-generation model lands you'll regret it. Third, skipping evals; the only way to know a routing change is safe is to run a held-out test set before and after and compare scores, not vibes.
A reasonable starting recipe
Wrap every model call behind one function called callAi(task, input). Inside, route by task name to a fixed tier. Log the tier, latency, cost and a basic success signal for every call. Once you have a week of data, pick the single endpoint with the highest cost and the lowest difficulty and drop it one tier. Repeat. Most teams find their bill halved within a month and their users never noticed.
"Frontier models are amazing. They are also a tax you pay every time you use one for a task a smaller model could have done."