LLMOps · Cost

The Real Cost of LLMs in Production: Token Economics for Enterprises

Prabhakar Gupta · Principal AI Architect · 06 May 2026 · 7 min read

Nobody cancels an AI project because the demo was wrong. They cancel it because the pilot that cost ₹2,000 a day quietly became ₹2,00,000 a day at rollout — and nobody could explain which feature was burning it.

Agent cost has become a first-class architectural concern for a simple reason: agents multiply tokens. One user request fans into planning calls, tool calls, retries, reflection steps — a 12-step agent run can consume 50–100× the tokens of the chat completion your finance model assumed. If you priced the project on chatbot economics, your unit economics are fiction.

01Measure before you optimise

You cannot cut what you cannot attribute. Instrument every call with: feature, user, agent, step, model, and token counts — then build the one dashboard that matters: cost per completed task, not cost per token. In one deployment, that dashboard revealed a single retry loop in a low-value formatting step consuming 31% of total spend. The fix took an afternoon. Finding it without attribution would have been impossible.

02The optimisation stack, in order of leverage

(1) Model routing: most agent steps — extraction, classification, formatting — don't need a frontier model. Route by step difficulty; reserve the expensive model for planning and final synthesis. Typical saving: 40–70%. (2) Caching: prompt-prefix caching for your stable system prompts and tool schemas, plus semantic caching for repeated questions — enterprise query distributions are brutally long-tailed. (3) Context discipline: every token you retrieve is a token you pay to send, often on every step — a reranked top-5 beats a stuffed top-30 on cost and accuracy. (4) Output budgets: structured outputs with tight schemas instead of essays. (5) Batching for anything that isn't interactive — overnight document processing has no business paying real-time prices.

The governing metric

Track cost-per-completed-task against the human baseline it replaces. ₹14 per resolved ticket against a ₹220 human baseline is a business; ₹260 against ₹220 is a hobby with a dashboard.

Bottom line: design the economics with the architecture, not after it. The teams treating tokens like cloud spend in 2014 — unmetered, unowned, surprising — are the ones whose AI programs die in the budget review, not the demo.

No spam. Unsubscribe anytime. New Tuesdays.

Build systems, not demos

My live 8-week Agentic AI course covers all of this in working code — batch 01 starts 7 July, limited to 50 seats.

View the course →