Runaway LLM Costs in a .NET API: Root Cause and Fix

LLM APIs charge per token - and in production, that cost accumulates far faster than most teams expect. We have seen services go from a few dollars a day to hundreds overnight, with no obvious change in traffic. The bill arrives before the alert does.
The complete implementation - with token-budget middleware, model tiering logic, and response caching wired into a real ASP.NET Core API - is on Patreon, where members get production-ready source code alongside every article.
If this is the problem you are solving right now, Chapter 15 of the AI-Powered .NET APIs course covers cost control end to end: token budgets per user and per endpoint, model tiering, caching LLM calls and embeddings, and rate limiting AI endpoints - all inside one running ASP.NET Core support API with source code you can run immediately.
Why LLM Costs Spiral in Production
The per-token cost model feels negligible in development. A few test calls at fractions of a cent apiece - it barely registers. Then you ship to production and the cost curve bends sharply upward. Several patterns drive this, and most of them are invisible until you start measuring.
The Problem Is Usually Not Traffic
In the incidents I have worked through, runaway cost is rarely caused by a genuine traffic spike. The real culprits are almost always one of these:
1. Unbounded context windows. Every message in a multi-turn conversation is sent back to the model on each call. A support session that starts at 200 tokens can reach 8,000 tokens by message 10, and 40,000 by message 40 if the conversation is never trimmed. Multiply that by concurrent users and cost explodes without any change in request count.
2. Missing MaxOutputTokens on completions. Without an explicit limit, some providers return responses up to their model's maximum context length. A poorly phrased prompt that triggers a verbose model response can cost 10x what a bounded one would. In production I have seen MaxOutputTokens missing on 60% of calls in codebases that were written by teams who tested only on short prompts.
3. Model over-allocation. Using gpt-4o or a comparable frontier model for tasks that a smaller model handles just as well - classification, extraction, yes/no routing decisions - is the fastest way to overbill. A triage step that costs $0.015 per call on a frontier model can cost $0.00015 on a small model. At 100,000 daily classifications that difference is $1,500 per day.
4. No caching on repeated prompts. FAQ lookups, system prompt variations, and templated queries often repeat with identical or near-identical inputs. Without a cache layer, each repetition hits the model and bills at full token rate.
5. No per-user or per-endpoint budget. There is no natural circuit breaker. One user running 500 requests, or one endpoint left without a rate limit on AI calls, can consume what the entire application budget was designed to cover.
Diagnosing the Root Cause
Before fixing anything, instrument first. Flying blind - patching suspected causes without data - wastes time and often addresses the wrong problem.
With Microsoft.Extensions.AI in .NET 10, you can attach telemetry middleware directly to IChatClient:
// Wrap your IChatClient registration with a logging decorator
services.AddChatClient(innerClient)
.UseLogging()
.UseOpenTelemetry();
The OpenTelemetry GenAI semantic conventions emit per-request token counts, model name, and latency as spans. Wire those into your existing OTel pipeline - Jaeger, Grafana, or Azure Monitor - and you will see cost per endpoint within minutes.
Look for:
Which endpoints consume the most tokens per request
Whether input or output tokens dominate (uncontrolled output = missing
MaxOutputTokens)Conversation length distribution (a long tail of 40+ turn sessions is a trimming problem)
Model breakdown (are expensive models handling cheap tasks?)
The Fixes
Fix 1: Set MaxOutputTokens on Every Completion
This is the single highest-return fix and takes two minutes to apply across a codebase. Every ChatOptions object must have an explicit MaxOutputTokens value:
var options = new ChatOptions
{
MaxOutputTokens = 512, // set per endpoint; classification needs 50, summarization needs 300
Temperature = 0.2f
};
var response = await chatClient.GetResponseAsync(messages, options, cancellationToken);
The right value depends on the task. Classification and routing: 50-100 tokens. Summarization: 200-400. Open-ended support answers: 500-800. Never leave it at the provider default.
Fix 2: Trim Conversation History
For multi-turn conversations, apply a trimming strategy before sending the message list to the model. The simplest production-safe approach is a sliding window:
private static IList<ChatMessage> TrimHistory(IList<ChatMessage> messages, int maxTokenBudget)
{
// Keep system message always; slide the window over user/assistant turns
var systemMessages = messages.Where(m => m.Role == ChatRole.System).ToList();
var conversationMessages = messages.Where(m => m.Role != ChatRole.System).ToList();
// Rough estimate: 1 token ~ 4 characters
int estimatedTokens = conversationMessages.Sum(m => m.Text?.Length / 4 ?? 0);
while (estimatedTokens > maxTokenBudget && conversationMessages.Count > 2)
{
conversationMessages.RemoveAt(0); // drop oldest user turn
if (conversationMessages.Count > 0) conversationMessages.RemoveAt(0); // drop matching assistant turn
estimatedTokens = conversationMessages.Sum(m => m.Text?.Length / 4 ?? 0);
}
return systemMessages.Concat(conversationMessages).ToList();
}
This is the shape of the pattern - the full version on Patreon includes proper token counting using the model's actual tokenizer rather than the character approximation, which matters at the edges of the budget window.
Fix 3: Model Tiering - Cheap Model First
Route tasks by complexity. The IChatClient abstraction in Microsoft.Extensions.AI makes provider swapping a one-line config change, which means you can register multiple clients and dispatch based on task type:
// Register tiered clients
services.AddKeyedChatClient("fast", fastModelInner); // small/cheap model
services.AddKeyedChatClient("capable", capableModelInner); // frontier model
In your service layer, resolve the right client based on the operation:
Classification, intent detection, routing decisions - use the fast client
Initial summarization - use the fast client
Complex reasoning, synthesis, multi-step answers - use the capable client
A pattern we ship in AI-powered APIs: run the cheap model first. If its confidence score (via structured output) is below a threshold, escalate to the capable model. This keeps the 80% of easy cases cheap while reserving spend for the genuinely hard ones.
Fix 4: Cache LLM Calls
For endpoints where the same-or-similar prompt repeats, a cache layer cuts cost dramatically. HybridCache (introduced in .NET 9, stable in .NET 10) gives you an L1 in-memory + L2 Redis cache with built-in stampede protection:
var cacheKey = $"llm:{HashPrompt(systemPrompt, userMessage)}";
var result = await hybridCache.GetOrCreateAsync(cacheKey, async _ =>
await chatClient.GetResponseAsync(messages, options, cancellationToken),
new HybridCacheEntryOptions { Expiration = TimeSpan.FromHours(1) }
);
The trade-off to watch: caching is appropriate for deterministic lookups (FAQs, structured extractions) but wrong for conversational flows where each turn depends on context. Caching the wrong things produces stale or misleading responses. Measure the cache hit rate per endpoint - if it is below 20%, the content is probably too dynamic and the cache is adding latency without saving cost.
For a deeper look at caching strategies inside an AI-powered .NET API, the RAG Pattern in ASP.NET Core post covers how retrieval and caching interact when grounding responses in your own knowledge base.
Fix 5: Per-User and Per-Endpoint Token Budgets
Rate limiting AI endpoints is not about requests per second - it is about tokens per time window. A middleware layer that tracks token consumption and enforces a budget is the circuit breaker the system is missing.
The key integration point: capture the usage from the ChatResponse:
var response = await chatClient.GetResponseAsync(messages, options, ct);
var tokensUsed = response.Usage?.TotalTokenCount ?? 0;
await tokenBudgetService.RecordUsageAsync(userId, endpoint, tokensUsed);
if (await tokenBudgetService.IsOverBudgetAsync(userId, endpoint))
{
// return 429 with Retry-After, or fall back to static response
}
Keep daily and per-minute windows. Daily budget exhaustion warrants a graceful degradation (static canned answer, redirect to human support). Per-minute exhaustion warrants a 429 with a Retry-After header. Both need observability - track budget exhaustion events in your OTel pipeline alongside token counts. This is directly tied to the security guardrails covered in the prompt injection post - an attacker who crafts a prompt that triggers a very long model response is both a security issue and a cost issue simultaneously.
What to Do First
If your LLM costs are already running hot, apply these in order:
Instrument first - wire OpenTelemetry GenAI tracing and identify the worst-offending endpoints by token/request before touching any code
Set MaxOutputTokens on every call - immediate cost reduction, zero functional risk
Add conversation trimming - prevents the unbounded context growth that causes the steepest cost curves
Review model allocation - audit whether frontier models are doing work a smaller model handles equivalently
Add caching - only after you understand which prompts repeat (measure first, cache second)
Add token budgets - the circuit breaker that caps downside if everything else fails
The trade-off that bit us hardest: teams that skip instrumentation and go straight to "reduce MaxOutputTokens on everything" tend to degrade response quality on the tasks that actually need depth. Measure first, then cut precisely.
FAQ
What is the most common cause of runaway LLM costs in a .NET API?
Unbounded conversation history is the most frequent root cause. Without a trimming strategy, every turn in a multi-turn session adds to the token count sent on subsequent calls. A 40-turn conversation can easily send 30,000-50,000 tokens per request on a model where input tokens are billed at the same rate as output.
How do I see which endpoints are consuming the most tokens in ASP.NET Core?
Wire the UseOpenTelemetry() middleware from Microsoft.Extensions.AI into your IChatClient registration. This emits token counts per request using the OpenTelemetry GenAI semantic conventions. Route those spans to Grafana, Jaeger, or Azure Monitor and group by gen_ai.operation.model and endpoint route.
Should I always cache LLM responses?
No. Caching is appropriate for deterministic or near-deterministic prompts - FAQs, classification, extraction with fixed templates. For conversational or context-dependent calls, caching produces stale responses. Measure cache hit rates per endpoint; below 20% suggests the prompt is too dynamic to benefit.
What is model tiering in a .NET AI API?
Model tiering means registering multiple IChatClient instances - a cheap/fast model and a capable/frontier model - and dispatching based on task complexity. Classification and routing use the cheap model; complex reasoning uses the capable one. The Microsoft.Extensions.AI keyed DI support makes this straightforward to wire in ASP.NET Core.
How does MaxOutputTokens affect LLM cost?
Most providers bill for both input and output tokens. Without an explicit MaxOutputTokens limit in ChatOptions, the model can generate responses up to its maximum context length. For open-ended prompts, this can produce responses 10-20x longer than a bounded call. Setting a realistic per-task limit is one of the fastest cost-reduction levers available.
What is a safe per-user daily token budget for an ASP.NET Core AI API?
It depends entirely on your model pricing and use case. A reasonable starting point: measure the 95th percentile of daily token usage per active user over a one-week period, then set the hard budget at 3x that value. This accommodates legitimate heavy users while blocking runaway loops. Adjust based on cost data after the first two weeks in production.
About the Author
I'm Celin Daniel, Co-founder of Coding Droplets. I've been building .NET and ASP.NET Core systems in production for 13+ years - APIs, distributed backends, enterprise platforms. Everything I write here comes from real shipping experience: patterns that held up, trade-offs that bit us, and lessons learned the hard way.
GitHub: codingdroplets
YouTube: Coding Droplets
Website: codingdroplets.com






