Build a Token Router with Embeddings and Prompt Templates

Skip the training pipeline and the GPU — embeddings, cosine similarity, and structured prompts are enough to cut your LLM bill by 80%. The idea Every query has a shape — topic, complexity, expected output format. You can detect that shape in <5ms using embeddings, then: Pick a prompt template — pre-built system prompt with format constraints, cached by the provider Pick a model — cheap for easy queries, strong for hard ones Cap output tokens — templates define expected length All of this works with pure geometry in embedding space — no model training, no preference data required. ...

March 23, 2026 · 7 min · Minh-Nhut Nguyen

Your LLM Bill Is 80% Waste. Here Are 7 Fixes.

You’re sending every question to your most expensive model. That’s like routing every patient to the head surgeon — stitches or heart transplant, same price. Seven levers, applied in order. Together: 80–95% cost reduction. Lever 1: Route by Difficulty 65% of queries don’t need your best model. A router classifies each query in <5ms using embeddings (mathematical fingerprints) — no AI call needed. ┌─────────┐ Query ──►│ Router │──► "What's our refund policy?" ──► Cheap model ($0.80/MTok) │ (<5ms) │──► "Design a caching layer" ──► Strong model ($15/MTok) └─────────┘ Category matching — compare query fingerprint to example phrases Keyword overrides — “OWASP” → always route to code review Complexity scoring — 5-line function → cheap; 200-line system → strong Watch your delegation cost. In agentic architectures, an orchestrator dispatches tasks to sub-agents. The orchestrator’s prompt to a sub-agent is output tokens for the orchestrator — 3–5x more expensive than input. A verbose 500-token delegation prompt costs the same as 1,500–2,500 input tokens. ...

March 23, 2026 · 8 min · Minh-Nhut Nguyen