How to choose the right model for the right task

Most model-selection decisions are made the same way: someone reads a launch post, someone else sees a benchmark on Twitter, a third person says "GPT just works," and a model is picked. Six months later the team discovers the model is wrong for the job - too slow, too expensive, too cautious, too chatty, or quietly bad at the one thing the product actually needs.

Choosing a model is a product decision, not a benchmark decision. Here is the framework we use.

Step 1: Write the job description

Before looking at any model, write down what the model has to do, in the same way you'd write a job description for a human. What is the input? What is the output? What does "good" look like? What does "unacceptable" look like? How fast does the answer need to come back? Who is the user, and what age are they? Most failed model decisions skip this step.

Step 2: Pick the constraints that actually matter

Every model is a trade-off across roughly seven axes:

Reasoning depth - can it handle multi-step problems without losing the thread?
Latency - first-token and full-response time at your traffic volume.
Cost per million input/output tokens at expected mix.
Context window - how much can you stuff into a single call?
Instruction-following discipline - does it follow your system prompt under pressure?
Safety behaviour - refusal patterns, tone, and stance toward sensitive topics.
Multimodality - does it natively read images, audio, files, or tools?

Rank these for your job. Almost every team finds that only two or three matter, and the rest are noise. A bedtime-story generator does not care about reasoning depth. A code assistant does not care about audio. A safety-critical kids' chatbot trades a little reasoning for a lot of refusal discipline.

Step 3: Map task to tier - not to lab

Stop thinking in lab names. Start thinking in tiers:

Nano-tier - classification, extraction, tone detection, formatting.
Mini-tier - short answers, summarisation, light rewriting, routing decisions.
Standard-tier - multi-turn chat, long-form generation, tool use, RAG answers.
Frontier-tier - agentic reasoning, code generation, hard reasoning chains.
Multimodal-tier - image/audio/video understanding or generation.

Once you know the tier, pick the cheapest model in that tier that passes your evaluations. The cheapest model is almost always good enough, and using it leaves headroom in your budget for the requests that genuinely need frontier reasoning.

Step 4: Evaluate on your own data

Public benchmarks measure averages on tasks that aren't yours. Build a 50-200 example evaluation set from your actual product. Score each candidate model against it - automatically if you can, by hand if you can't. The model that wins your eval set is the right model. Nothing else matters.

Step 5: Plan to swap it

The right model in May 2026 will not be the right model in May 2027. New releases will be cheaper, faster, or smarter - usually all three. Build your app so that swapping models is a one-line change behind a routing abstraction. Run your evaluation set on every new candidate. Switch when the numbers say so, not when the launch post says so.

Common mistakes

Picking the biggest model "to be safe" - you're not being safe, you're being slow and expensive.
Picking the cheapest model "to save money" - without evals, you don't know what you're shipping.
Trusting public leaderboards instead of your own data.
Locking model names directly into product code with no abstraction layer.
Treating safety as a model property instead of a system property - see our other writing on this.

"The best model is the cheapest one that passes your eval set. Everything else is marketing."