# How to Build a Multimodal AI Endpoint in ASP.NET Core: Image to Structured Data

A support agent uploads a screenshot of an error dialog. A finance team drops a photo of a paper receipt into a form. A field engineer snaps a picture of a serial-number plate. In every case the business does not want the image - it wants the structured data trapped inside it. Building multimodal AI in ASP.NET Core is how you turn those pixels into a typed C# object your API can validate, store, and act on. The good news in 2026 is that this no longer needs a bespoke vision service or a Python sidecar: the same `IChatClient` you already use for chat handles images, and it does it against OpenAI, Azure OpenAI, GitHub Models, or a local Ollama model with a one-line provider swap.

In production I have shipped versions of this exact endpoint, and the part that bites teams is never the "send an image" call - it is everything around it: file size limits, choosing a vision-capable model, forcing the model to return JSON you can trust, and handling the image that turns out to be a blurry mess. The patterns that survived real traffic, with annotated end-to-end source code and the edge cases stripped out of this article, live on [Patreon](https://www.patreon.com/CodingDroplets) - ready to clone and adapt to your own domain.

Wiring a vision model into a single endpoint is one thing. Making it reliable alongside cost control, structured-output validation, and prompt-safety is what the [AI-Powered .NET APIs course](https://aiapis.codingdroplets.com/) walks through end to end - Chapter 7 builds this exact "image in, typed C# object out" feature inside one running ASP.NET Core support API, so the full context is always in front of you.

[![AI-Powered .NET APIs](https://newsletter.codingdroplets.com/images/ai-api-course-banner-1.jpg align="center")](https://aiapis.codingdroplets.com/)

## The Business Problem: Data Is Locked Inside Images

Most line-of-business data still arrives as pictures. Receipts, ID cards, whiteboard notes, error screenshots, packing slips, meter readings - all of it carries information a human reads in a second but a traditional API cannot parse at all. The old answer was a dedicated OCR pipeline plus a pile of regex and per-document-type parsers that broke the moment a vendor changed their receipt layout.

A modern vision-capable large language model collapses that pipeline. It reads the image the way a person does, then - crucially for an API - it can return the result as a schema you define, not as free-form prose. The job of your ASP.NET Core endpoint is to accept the upload, hand the bytes to the model with a precise instruction, and marshal the answer into a strongly typed record. That record is the contract the rest of your system depends on.

The concrete example I will use is a receipt-analysis endpoint: a user posts an image, and the API returns the vendor, total, date, and line items as typed fields. The same shape works for any "extract structured data from an image" use case - only the target record and the instruction change.

## How Do You Send an Image to an LLM in ASP.NET Core?

You send an image the same way you send text: through a `ChatMessage`. The difference is that a message's `Contents` collection can hold more than one part. A multimodal message carries a `TextContent` (your instruction) and a `DataContent` (the image bytes plus its media type) side by side, and the model sees both together.

```csharp
// Requires a vision-capable model (for example gpt-4o, or a local
// vision model such as llava / gemma3 via Ollama).
var message = new ChatMessage(ChatRole.User,
[
    new TextContent("What error is shown in this screenshot?"),
    new DataContent(imageBytes, "image/png")
]);

ChatResponse response = await chatClient.GetResponseAsync(message);
```

That is the entire mechanism. `DataContent` takes the raw bytes and a media type such as `image/png` or `image/jpeg`; the provider adapter base64-encodes it and formats it for whichever model you configured. No special vision client, no separate SDK. The `IChatClient` abstraction is documented in the official [Microsoft.Extensions.AI guidance](https://learn.microsoft.com/en-us/dotnet/ai/microsoft-extensions-ai), and it is the same interface across every supported provider.

> Version note: `DataContent`, `TextContent`, and the multi-part `ChatMessage` constructor shown here are part of `Microsoft.Extensions.AI` 10.x (current stable is 10.7.0 at the time of writing) on .NET 10. Vision only works if the configured model actually supports image input - a text-only model will error or ignore the image.

## Design Decisions Before You Write the Endpoint

Three decisions shape everything downstream, and getting them wrong is what causes the 2 a.m. incident rather than the compile error.

**Which model, and where it runs.** Vision-capable cloud models (GPT-4o and friends) are accurate and need zero infrastructure, but every image is billed by tokens and image tokens are not cheap - a full-resolution photo can cost far more than a paragraph of text. A local Ollama vision model keeps images on your own hardware, which matters for receipts and IDs that carry personal data. The point of `Microsoft.Extensions.AI` is that this is a registration detail, not an architectural one.

**How large an image you accept.** Never let an unbounded upload reach the model. Cap the request body, cap the image dimensions, and downscale before you send. Larger images cost more tokens and rarely improve accuracy past a point - a receipt photographed at 4000px wide gives no better extraction than the same receipt at 1200px, but it can cost several times as much.

**What "structured" means for your domain.** Define the C# record first. The record is your schema, and the model is instructed to fill it. If a field can be missing on a bad image, model it as nullable rather than pretending the model always finds it.

## The Implementation Walkthrough

### Register the Chat Client

Registration is one line, and it is the only line that changes when you switch providers. This is the single strongest reason to go through `IChatClient` rather than a provider SDK directly - the interface contract is covered in the Microsoft docs for [the IChatClient interface](https://learn.microsoft.com/en-us/dotnet/ai/ichatclient).

```csharp
// Cloud: OpenAI vision-capable model
builder.Services.AddChatClient(
    new OpenAIClient(apiKey).GetChatClient("gpt-4o").AsIChatClient());

// Local swap (same endpoint code): an Ollama vision model
// builder.Services.AddChatClient(
//     new OllamaChatClient(new Uri("http://localhost:11434"), "llava"));
```

### Define the Target Record

The record is the contract. Model optional fields as nullable so a smudged total does not become a silent zero.

```csharp
public record ReceiptData(
    string Vendor,
    decimal? Total,
    DateOnly? PurchaseDate,
    string[] LineItems);
```

### Return Typed Data, Not a String

This is where multimodal gets genuinely useful for an API. Instead of `GetResponseAsync`, call the generic overload with your record type. The library generates a JSON schema from `ReceiptData`, instructs the model to conform to it, and deserializes the reply into a real object.

```csharp
var messages = new List<ChatMessage>
{
    new(ChatRole.System,
        "You extract data from receipt images. If a field is not " +
        "clearly visible, leave it null. Never guess."),
    new(ChatRole.User,
    [
        new TextContent("Extract the vendor, total, purchase date, and line items."),
        new DataContent(imageBytes, mediaType)
    ])
};

ChatResponse<ReceiptData> response =
    await chatClient.GetResponseAsync<ReceiptData>(messages);

ReceiptData receipt = response.Result;
```

The vision read and the structured output happen in a single round trip: image in, typed C# object out. That combination - not either feature alone - is what makes this worth shipping.

### Expose It as a Minimal API Endpoint

Now wrap it in an endpoint that accepts a file upload. Keep the model call thin and push validation to the edges.

```csharp
app.MapPost("/receipts/analyze",
    async (IFormFile file, IChatClient chat, CancellationToken ct) =>
{
    if (file.Length == 0 || file.Length > 5_000_000)
        return Results.BadRequest("Image must be between 1 byte and 5 MB.");

    using var ms = new MemoryStream();
    await file.CopyToAsync(ms, ct);

    var messages = BuildReceiptPrompt(ms.ToArray(), file.ContentType);
    var response = await chat.GetResponseAsync<ReceiptData>(
        messages, cancellationToken: ct);

    return Results.Ok(response.Result);
})
.DisableAntiforgery();  // multipart upload endpoint
```

Notice what is deliberately small: the endpoint validates size and type, calls the model, and returns the result. Everything hard - retries, downscaling, model tiering, telemetry - belongs in the pipeline around this, not inside the handler.

## Treat Model Output as Untrusted Input

Here is the trade-off that bit us the first time: a strongly typed return value feels safe, and it is not. The model can hallucinate a `Total` on a blurry image, misread a date, or invent a line item that was never there. Worse, an image can carry text designed to hijack your instruction - the multimodal equivalent of a classic injection, where the "prompt" arrives painted into the picture.

So the typed record is where validation begins, not where it ends. Run the same domain rules you would run on any client-supplied payload: is the total within a sane range, is the date not in the future, do the line items sum to something close to the total? If the model returns malformed JSON under load, you want a deliberate fallback rather than a 500 - I cover that failure mode in detail in [Model Returns Invalid JSON in .NET: Structured Output Fixes](https://codingdroplets.com/structured-output-invalid-json-dotnet-fixes). And because the image itself is attacker-controllable, the guardrails in [Preventing Prompt Injection in ASP.NET Core AI APIs](https://codingdroplets.com/preventing-prompt-injection-aspnet-core-ai-apis) apply just as much to a picture as to a text field.

## Cost and Performance Trade-offs

Vision endpoints have a cost profile that surprises teams used to text-only chat. A few things I have measured hold up as rules of thumb:

*   **Downscale before you send.** Resizing a photo to roughly 1000-1500px on the long edge before the call routinely cuts image-token cost by more than half with no measurable accuracy loss for document extraction.
    
*   **Tier your models.** Not every image needs your best model. A cheap model can handle clean, high-contrast receipts; reserve the expensive vision model for the images the cheap one flags as low-confidence.
    
*   **Set aggressive timeouts and a fallback.** Vision calls are slower than text calls. A user waiting on an upload will abandon a spinner faster than they abandon a chat, so cap latency and degrade gracefully.
    
*   **Cache when the same image repeats.** In workflows where the same document is submitted more than once (resubmissions, retries), a content-hash cache avoids paying twice for an identical extraction.
    

## What to Do Next

You now have the shape of a multimodal endpoint: accept an upload, send `TextContent` plus `DataContent` through `IChatClient`, and deserialize the reply into a typed record you validate like any other untrusted input. From here the natural next steps are streaming partial results back to the client for long extractions, and wiring the same vision call into a background job when you are processing documents in bulk rather than interactively - the approach in [How to Stream LLM Responses in ASP.NET Core with IChatClient](https://codingdroplets.com/stream-llm-responses-aspnet-core-ichatclient) carries over directly.

The core lesson from shipping this in production: the model call is the easy 10 percent. The durable 90 percent is the validation, sizing, cost control, and failure handling that turns a clever demo into an endpoint you can trust with real user uploads.

## Frequently Asked Questions

### What model do I need for multimodal AI in ASP.NET Core?

You need a vision-capable model - one that accepts image input, not just text. GPT-4o is the common cloud choice; for a local, data-stays-on-your-hardware option you can pull a vision model such as llava or gemma3 through Ollama. Because everything goes through `IChatClient`, switching between them is a one-line registration change, and the rest of your endpoint code stays identical.

### How do I send an image to an LLM with Microsoft.Extensions.AI?

Create a `ChatMessage` whose `Contents` collection holds both a `TextContent` (your instruction) and a `DataContent` built from the image bytes and its media type, for example `new DataContent(bytes, "image/png")`. Pass that message to `GetResponseAsync`. The provider adapter handles the base64 encoding and formatting for the specific model, so you never touch the wire format yourself.

### How do I get structured JSON back from an image instead of plain text?

Call the generic `GetResponseAsync<T>()` overload with your target record type. The library generates a JSON schema from your C# type, instructs the model to conform to it, and deserializes the response into an instance of that type. Combined with a `DataContent` image part, this gives you the full "image in, typed object out" flow in a single round trip.

### How large can the uploaded image be, and should I resize it?

Always cap it - both the HTTP request body and the image dimensions. Beyond a certain resolution, larger images cost more tokens without improving extraction accuracy, so downscaling to roughly 1000-1500px on the long edge before the call is usually the right default. Reject empty and oversized uploads at the endpoint before the model is ever called.

### Is it safe to trust the data the vision model returns?

No - treat it as untrusted input, exactly like a request body from any client. The model can misread values on a poor image, and the image can even contain text crafted to manipulate your instruction. Validate every field against your domain rules, model uncertain fields as nullable, and apply the same prompt-injection guardrails you would use for text-based AI endpoints.

* * *

## About the Author

I'm Celin Daniel, Co-founder of [Coding Droplets](https://codingdroplets.com/). I've been building .NET and ASP.NET Core systems in production for 13+ years - APIs, distributed backends, enterprise platforms. Everything I write here comes from real shipping experience: patterns that held up, trade-offs that bit us, and lessons learned the hard way.

*   GitHub: [codingdroplets](http://github.com/codingdroplets/)
    
*   YouTube: [Coding Droplets](https://www.youtube.com/@CodingDroplets)
    
*   Website: [codingdroplets.com](https://codingdroplets.com/)
