nmnhut

Build a Token Router with Embeddings and Prompt Templates

Skip the training pipeline and the GPU — embeddings, cosine similarity, and structured prompts are enough to cut your LLM bill by 80%. The idea Every query has a shape — topic, complexity, expected output format. You can detect that shape in <5ms using embeddings, then: Pick a prompt template — pre-built system prompt with format constraints, cached by the provider Pick a model — cheap for easy queries, strong for hard ones Cap output tokens — templates define expected length All of this works with pure geometry in embedding space — no model training, no preference data required. ...

Two Roads to AI Agents: Code or Markdown?

The same task, two radically different approaches. One says “write code to orchestrate the LLM.” The other says “write markdown to teach it.” Both produce agents that reason, use tools, and complete complex work. Knowing when to reach for which is the skill that matters in 2026. The SDK Way: Agents as Code Agent SDKs — OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI — let you build agents programmatically. You register tools as functions, write instructions, and the SDK runs the loop: prompt the LLM, execute tool calls, feed results back, repeat. ...

Workshop: Build an AI Video Pipeline — Skills vs SDK in Practice

Part 2 of Two Roads to AI Agents. This time we apply the framework to something real. In Part 1, we drew the line between Agent SDKs (code orchestration) and Agent Skills (markdown knowledge). Now let’s see where that line falls in practice. We’ll walk through a pipeline that turns any article into a narrated MP4 — and at each step, I’ll label whether it belongs in SDK code or Skill knowledge, and why. ...

Your LLM Bill Is 80% Waste. Here Are 7 Fixes.

You’re sending every question to your most expensive model. That’s like routing every patient to the head surgeon — stitches or heart transplant, same price. Seven levers, applied in order. Together: 80–95% cost reduction. Lever 1: Route by Difficulty 65% of queries don’t need your best model. A router classifies each query in <5ms using embeddings (mathematical fingerprints) — no AI call needed. ┌─────────┐ Query ──►│ Router │──► "What's our refund policy?" ──► Cheap model ($0.80/MTok) │ (<5ms) │──► "Design a caching layer" ──► Strong model ($15/MTok) └─────────┘ Category matching — compare query fingerprint to example phrases Keyword overrides — “OWASP” → always route to code review Complexity scoring — 5-line function → cheap; 200-line system → strong Result: 65% of traffic costs 3.75x less. Hard queries get better models. ...

Building an AI Video Pipeline: From Text to Narrated MP4 with Remotion and ElevenLabs

I wanted a way to turn my blog posts into narrated videos without spending hours in video editors. What I ended up building was a full pipeline: give it an article, a URL, or a PowerPoint file — get back a 1080p MP4 with animated slides, syntax-highlighted code blocks, and an AI voiceover in any language. The whole thing is open source: github.com/nmnhut-it/educational-video-pipeline. This post walks through how it works and how you can build your own. ...

From Raw Text to Agent Teams: How LLM Tooling Evolved

LLMs Don’t Actually “Call” Tools Here is a fact that surprises most people: LLMs cannot run code. They cannot call APIs. They cannot read files. All they do is predict the next token — they output raw text, one piece at a time. So how do they “use tools”? The trick: structured wishes During training, models are fine-tuned on augmented datasets that include tool call examples. The training pipeline works roughly like this: ...