# AI-Powered .NET API Production Readiness Checklist (2026)

Wiring an LLM into an ASP.NET Core endpoint takes an afternoon. Making that endpoint survive real traffic, real users, and a real invoice at the end of the month is the part nobody shows you. This **AI-powered .NET API production readiness checklist** is the list I run through before any `IChatClient`\-backed endpoint goes live, because in production I've watched every one of these items fail at least once: a single unbounded prompt that burned through a monthly token budget overnight, a "summarize this document" feature that turned into a data-exfiltration path, and a chat endpoint that looked fine until the model provider had a slow afternoon and took the whole API down with it. If you want the complete, runnable version of these controls wired into one working support-desk API, the annotated source on [Patreon](https://www.patreon.com/CodingDroplets) goes well past the snippets here and shows how the pieces connect end to end.

Most AI tutorials stop at the demo and leave the production hardening as an exercise. Getting cost control, security, and observability right at the same time - not bolted on afterwards - is exactly what the [AI-Powered .NET APIs course](https://aiapis.codingdroplets.com/) walks through in Chapter 19, where the same production-readiness checklist is applied against a real codebase you can run.

[![AI-Powered .NET APIs](https://newsletter.codingdroplets.com/images/ai-api-course-banner-1.jpg align="center")](https://aiapis.codingdroplets.com/)

Treat the items below as gates, not suggestions. Each one is cheap to add before launch and expensive to retrofit after an incident.

## What Does Production Readiness Mean for an AI-Powered .NET API?

Production readiness for an AI endpoint means three things are true at once: it cannot run away on cost, it cannot be tricked into acting against you, and you can see exactly what it did when something goes wrong. A traditional API has bounded, predictable work per request. An LLM call does not - cost, latency, and output are all variable and partly controlled by whoever wrote the prompt. That is the gap this checklist closes.

At a glance, the 12 controls are:

1.  Cap tokens and cost on every request
    
2.  Tier your models
    
3.  Cache responses and embeddings
    
4.  Rate-limit AI endpoints on their own budget
    
5.  Add timeouts, retries, and a fallback model
    
6.  Treat every model response as untrusted input
    
7.  Ground answers and allow an "I don't know" exit
    
8.  Defend against prompt injection
    
9.  Give tools and agents least privilege
    
10.  Handle PII and data residency
     
11.  Instrument every call with OpenTelemetry
     
12.  Gate quality with an evaluation suite in CI
     

## 1\. Cap Tokens and Cost on Every Request

The most common and most avoidable AI incident is a runaway bill. Every call to `GetResponseAsync` should carry an explicit output cap so no single request can balloon, and you should track a per-user or per-tenant token budget on top of that.

```csharp
var options = new ChatOptions
{
    MaxOutputTokens = 500,   // hard ceiling per request
    Temperature = 0.2f
};
```

`ChatOptions` is part of [`Microsoft.Extensions.AI`](https://learn.microsoft.com/en-us/dotnet/ai/microsoft-extensions-ai) (10.x, .NET 10). A cap here is the difference between a bad prompt costing a fraction of a cent and costing a fortune. I cover the full pattern in [Runaway LLM Costs in a .NET API](https://codingdroplets.com/runaway-llm-costs-dotnet-api).

## 2\. Tier Your Models

Not every request needs your most expensive model. Route the cheap, high-volume work (classification, short extraction, routing) to a small fast model and reserve the premium model for the requests that genuinely need it. With `Microsoft.Extensions.AI` this is a registration concern - you can resolve different `IChatClient` pipelines per use case rather than hardcoding one model everywhere. Model tiering is usually the single biggest lever on your monthly cost without any visible quality loss.

## 3\. Cache Responses and Embeddings

Identical inputs should not pay for identical model calls twice. The built-in caching client layers a distributed cache around your pipeline, and embeddings in particular are worth caching because the same source text always produces the same vector.

```csharp
builder.Services.AddChatClient(innerClient)
    .UseDistributedCache()      // skip the model on a cache hit
    .UseFunctionInvocation()
    .UseOpenTelemetry();
```

A word of caution: cache deterministic calls, not personalized or high-temperature ones. Caching a creative, user-specific response and replaying it for someone else is a correctness bug, not an optimization.

## 4\. Rate-Limit AI Endpoints on Their Own Budget

Your AI endpoints need a tighter limit than the rest of your API, because each request is far more expensive. You can apply ASP.NET Core's built-in `AddRateLimiter` at the endpoint, and for limits scoped to the model itself a small `DelegatingChatClient` slots straight into the pipeline.

```csharp
client = client.AsBuilder()
    .UseRateLimiting(perUserLimiter)   // custom DelegatingChatClient
    .Build();
```

Partition the limit by user, API key, or tenant - never a single global bucket that one noisy client can drain for everyone else.

## 5\. Add Timeouts, Retries, and a Fallback Model

Model providers have slow minutes and bad afternoons. Without a timeout, a hanging upstream call holds your thread and cascades into the rest of the API. Wrap provider calls with a sensible timeout, retry transient failures with backoff, and define a fallback model so a primary outage degrades gracefully instead of failing hard. While you are here, make sure every call flows a `CancellationToken` end to end - when a user closes a streamed `GetStreamingResponseAsync` connection, you want that work to stop and free the resource rather than keep paying for tokens nobody will read.

## 6\. Treat Every Model Response as Untrusted Input

A model can return malformed JSON, hallucinated fields, or values outside any range you expected. Use structured outputs to get a typed result, then validate it exactly as you would validate a request body from the public internet.

```csharp
var response = await client.GetResponseAsync<TicketTriage>(prompt);
if (!validator.Validate(response.Result).IsValid)
{
    // reject, retry, or fall back - never trust it blindly
}
```

Deserializing into a C# record is not validation. The model is an untrusted producer, and the gap between "it parsed" and "it is correct" is where bugs live. See [Model Returns Invalid JSON in .NET](https://codingdroplets.com/structured-output-invalid-json-dotnet-fixes) for the failure modes.

## 7\. Ground Answers and Allow an "I Don't Know" Exit

If your endpoint answers questions about your own data, ground it with retrieval (RAG) instead of relying on the model's training. Just as important, give the model an explicit path to refuse: a grounded prompt that says "if the context does not contain the answer, say you don't know" prevents confident fabrication. A refusal is a correct answer when the data isn't there - a plausible-sounding invention is a production incident waiting for a screenshot.

## 8\. Defend Against Prompt Injection

Prompt injection is the SQL injection of the AI era, and it comes in two forms: direct (a user types "ignore your instructions") and indirect (a poisoned document or web page the model reads as part of RAG). Keep system instructions separate from user content, filter and constrain inputs, and never let model output flow into a privileged action without a check. This is the item teams skip most and regret most. The defensive patterns are detailed in [Preventing Prompt Injection in ASP.NET Core AI APIs](https://codingdroplets.com/preventing-prompt-injection-aspnet-core-ai-apis).

## 9\. Give Tools and Agents Least Privilege

The moment you enable tool calling or an agent, the model can trigger real code in your system. Allow-list exactly which tools are exposed, validate every argument the model supplies, and require explicit human confirmation for anything destructive. `AIFunctionFactory` makes wiring a .NET method as a tool trivial - which is precisely why the access boundary has to be deliberate. Treat the model as an unprivileged caller, never as a trusted service.

## 10\. Handle PII and Data Residency

Decide, before launch, what data is allowed to leave your boundary and reach a hosted model. Strip or mask PII you don't need to send, log responsibly (turn off sensitive-data capture in production telemetry), and where regulation demands it, route sensitive workloads to a local model instead of a cloud provider. The flexibility of `Microsoft.Extensions.AI` is that swapping a cloud client for a local one is a one-line registration change, so data residency becomes a configuration decision rather than a rewrite.

## 11\. Instrument Every Call with OpenTelemetry

You cannot operate what you cannot see. The OpenTelemetry chat client emits traces, token counts, and latency per request following the [OpenTelemetry Semantic Conventions for Generative AI](https://opentelemetry.io/docs/specs/semconv/gen-ai/), so spend per endpoint, cache-hit rate, and slow calls show up on a dashboard instead of in a surprise invoice.

```csharp
.UseOpenTelemetry(configure: o => o.EnableSensitiveData = false);
```

Keep `EnableSensitiveData` off in production so prompts and responses are not written into your traces. Token counts and latency are what you want in the telemetry, not user content.

## 12\. Gate Quality with an Evaluation Suite in CI

A change to a prompt, a model version, or a temperature can silently degrade quality with zero compile errors and zero failing unit tests. `Microsoft.Extensions.AI.Evaluation` lets you score responses for relevance and groundedness and wire those checks into your test run, so a regression fails the build instead of reaching users. You can unit test that prompts build correctly and that tools validate their input - but only an evaluation suite catches "the answers got worse."

## Putting the Checklist to Work

You do not need all twelve on day one, but you do need to know which gaps you are shipping with. In practice I treat 1, 5, 6, 8, and 11 - cost cap, resilience, output validation, injection defense, and observability - as non-negotiable for any public AI endpoint, and layer the rest in as the surface grows. The pattern across every item is the same one that makes any ASP.NET Core API production-grade: bound the cost, distrust the input, secure the boundary, and make the system observable. AI just raises the stakes on each.

## Frequently Asked Questions

### What Should Be on an AI API Production Readiness Checklist for .NET?

At minimum: a hard token and cost cap per request, model tiering, caching, dedicated rate limits, timeouts with a fallback model, structured-output validation, prompt-injection defenses, least-privilege tools, PII and data-residency handling, OpenTelemetry instrumentation, and an evaluation suite in CI. The first five protect availability and cost, the middle group protects security and correctness, and the last two keep the system observable and stable over time.

### How Do I Control LLM Costs in a .NET API?

Set `MaxOutputTokens` on every `ChatOptions`, tier your models so cheap requests use a small model, cache deterministic calls and embeddings, and rate-limit AI endpoints separately from the rest of your API. Then instrument cost per request with OpenTelemetry so you can see spend per endpoint before it becomes a surprise rather than after.

### How Do I Stop Prompt Injection in an ASP.NET Core AI API?

Separate system instructions from user content, validate and constrain inputs, and treat any text the model reads (including RAG documents) as untrusted. Critically, never let model output trigger a privileged action without an explicit check, and require human confirmation for destructive tool calls. Defense in depth matters here because no single filter catches every injection variant.

### Do I Need OpenTelemetry for AI Endpoints in .NET?

Yes. AI calls have variable cost and latency you cannot predict from the code alone, so per-request token counts, latency, and spend are operational essentials, not nice-to-haves. The `Microsoft.Extensions.AI` OpenTelemetry client follows the GenAI semantic conventions and layers onto any existing OpenTelemetry pipeline with one builder call.

### How Do I Test or Evaluate an LLM-Powered .NET API Before Production?

Unit test the deterministic parts - that prompts assemble correctly and tools validate their arguments - then use `Microsoft.Extensions.AI.Evaluation` to score model responses for relevance and groundedness. Wire those evaluations into CI so a prompt or model change that lowers quality fails the build instead of silently shipping.

* * *

## About the Author

I'm Celin Daniel, Co-founder of [Coding Droplets](https://codingdroplets.com/). I've been building .NET and ASP.NET Core systems in production for 13+ years - APIs, distributed backends, enterprise platforms. Everything I write here comes from real shipping experience: patterns that held up, trade-offs that bit us, and lessons learned the hard way.

*   GitHub: [codingdroplets](http://github.com/codingdroplets/)
    
*   YouTube: [Coding Droplets](https://www.youtube.com/@CodingDroplets)
    
*   Website: [codingdroplets.com](https://codingdroplets.com/)
