Deploying, evaluating, and calling models in Microsoft Foundry: a production guide for architects

I have spent the last few programmes wiring Microsoft Foundry models into real workloads, and the same confusions keep surfacing in design reviews. People conflate the model with the deployment, pick a deployment type by habit, and discover the cost implications only when the first invoice lands. This post is the guide I wish my teams had read first.

I have deliberately left out resource and project creation, RBAC setup, and quota-increase mechanics, because those sit in the getting-started post. Here I focus on the four decisions that actually move cost and behaviour: what a deployment is, which deployment type to choose, how to evaluate before you commit, and how to call the model cleanly from the SDK.

One caveat up front. Foundry naming and feature status move quickly, and pricing on Azure renders dynamically. Treat every number below as something to verify on the Azure pricing calculator before you budget, and check the (preview) tag on the specific Learn page before you depend on a feature.

Resource, project, model, deployment. The deployment name, not the model name, is the addressable unit for inference.

What “deploying a model” actually means

The mental model that saved my teams the most time is to separate four things. The resource is the Microsoft.CognitiveServices account of kind AIServices: it is the governance, quota, and networking boundary. The project is the RBAC and isolation boundary inside it, and it owns agents, connections, evaluations, and threads.

The model is an item in the catalogue, whether an Azure OpenAI model, a Microsoft model, or a partner model from Anthropic, Meta, Mistral, Cohere, DeepSeek, xAI, and others. The deployment is what you get when you deploy a model into the resource with a chosen deployment type (the SKU) and quota. It has a deployment name that you choose, and that name, not the underlying model name, is the addressable unit for inference.

This matters in code. Microsoft’s quickstart is explicit: the model parameter requires the model deployment name, and if your deployment name differs from the underlying model name you adjust your code accordingly. An agent definition references the same deployment name through its model field.

There is a useful exception since early 2026. Instant models (preview) let you call a supported model by name with no deployment at all, drawing on a separate global quota pool. They route to the latest evergreen version by default (pin a version by appending a date suffix), and during preview they are available only in West US 3 projects. Microsoft frames deployments as something you level up to, not a gate you must pass first. I reach for a deployment when I need reserved throughput, custom content filters, data residency, or enterprise configuration.

Deployment types: the decision that fixes everything else

The deployment type is the single biggest choice. It fixes data residency, latency variance, the quota model, and how you pay. Microsoft groups the types into standard (pay-per-token), provisioned (reserved PTU), and batch (async, 50% off), each available at global, data-zone, or regional scope. Data stored at rest always remains in the designated Azure geography: the differences below are about where inference is processed and how throughput is guaranteed.

Deployment type	SKU code	Data processing scope	Billing model	SLA / latency	Best for
Instant (preview)	N/A (no deployment)	Any Azure region	Pay-per-token (global quota pool)	Best-effort, no SLA	Getting started, prototyping
Global Standard	GlobalStandard	Any Azure region	Pay-per-token	Best-effort, highest default quota	General workloads, highest quota
Data Zone Standard	DataZoneStandard	Within US or EU data zone	Pay-per-token	Best-effort, higher quota than regional	EU/US data-zone compliance
Standard (regional)	Standard	Single deployment region	Pay-per-token	Best-effort, limited regional capacity	Regional compliance, low to medium volume
Global Provisioned	GlobalProvisionedManaged	Any Azure region	Reserved PTU (hourly or reservation)	Guaranteed throughput, low latency variance	Predictable high throughput
Data Zone Provisioned	DataZoneProvisionedManaged	Within US or EU data zone	Reserved PTU	Guaranteed throughput plus data-zone	Data-zone plus predictable throughput
Regional Provisioned	ProvisionedManaged	Single deployment region	Reserved PTU	Guaranteed throughput, strict residency	Regional compliance plus throughput
Global Batch	GlobalBatch	Any Azure region	50% off Global Standard	No real-time SLA, 24-hour target	Large async jobs
Data Zone Batch	DataZoneBatch	Within US or EU data zone	50% off	No real-time SLA, 24-hour target	Large async jobs with data-zone
Developer	DeveloperTier	Any Azure region	Pay-per-token	No SLA, no residency guarantee, 24-hour lifetime then auto-deleted	Fine-tuned model evaluation only

A few notes from Learn that bite teams in production. Not all models support all types, so check “Foundry Models sold by Azure” for availability. With Global Standard and Data Zone Standard, a primary-region interruption affects all traffic initially routed there. Developer deployments self-delete after 24 hours, so they are for evaluating fine-tuned models, not for anything that needs to persist.

Choosing a deployment type: residency requirement, then traffic pattern, then volume.

Choosing by requirement is usually faster than reading the full matrix.

If you need	Use
No residency restriction	Global Standard or Global Provisioned
EU or US data-zone compliance	Data Zone Standard / Data Zone Provisioned
Single-region residency	Standard or Regional Provisioned
Quick start or prototype	Instant models (preview)
Variable, bursty traffic	Standard or Global Standard (pay-per-token)
Consistent high volume	Provisioned types
Large, non-time-sensitive jobs	Global Batch or Data Zone Batch
Low latency variance	Provisioned types
Fine-tuned model evaluation	Developer

Three platform features are worth knowing before you commit. Spillover (GA, 2026) routes overflow from a provisioned deployment (a 429 when PTUs are exhausted, for example) to a matching Standard deployment in the same resource, billed at the standard per-token rate. The data-processing level must match (global provisioned to global standard), and it works with the Foundry Agent Service but not the Responses API, so plan gateway-level fallback there. Priority processing (GA, 2026) is a pay-per-call fast lane for latency-sensitive Standard workloads at a premium over Standard. Model router is a deployable chat model that picks an underlying model per prompt, and it now supports the GPT-5 series.

Spillover: a provisioned deployment hits a 429, overflow routes to a matching Standard deployment in the same resource, billed per token. Does not cover the Responses API.

Cost: the three levers, and the free money teams forget

Per-token rates on Azure match OpenAI’s direct API. The premium you pay buys compliance, private networking, Entra authentication, support, and a single invoice. The deployment type changes how you pay, not the underlying token rate for a given scope.

Indicative pay-as-you-go rates follow, for Global Standard, per 1M tokens. Verify these on the Azure pricing calculator: they are correct as of June 2026 per third-party aggregators (PricePerToken.com, last updated 14 June 2026) consistent with OpenAI list prices, not guaranteed Microsoft figures.

Model	Input / 1M	Output / 1M	Notes
GPT-4.1	\$2.00	\$8.00	1M-token context
GPT-4.1-mini	\$0.40	\$1.60	strong cost/quality for routers
GPT-5	\$1.25	\$10.00	flagship reasoning, ~272,000-token context
GPT-5-mini	\$0.25	\$2.00
GPT-5-nano	\$0.05	\$0.40	cheapest

Pay-as-you-go vs PTU vs Batch across volume, with the 150 to 200M tokens/month break-even band marked. Verify on the Azure pricing calculator.

The three cost levers are pay-as-you-go, Provisioned Throughput Units (PTU), and Batch. Pay-as-you-go wins for variable traffic. Batch runs at 50% of Global Standard with a 24-hour target turnaround and a separate enqueued-token quota, so async jobs do not disrupt online traffic. Input is JSONL, one request per line with a unique custom_id, and you pay only for completed work.

PTU is reserved capacity, billed hourly per deployed unit regardless of tokens consumed. The GPT-4o-class Global provisioned rate is roughly \$1/hour per PTU. Reservations give large term discounts: per Microsoft Learn’s onboarding page, a 1-month reservation is around 64% off and a 1-year around 70% off for GPT-4o-class, with example rates stamped “Azure pricing as of January 1, 2025”. Minimums matter too: Global and Data Zone Provisioned require 15 PTU, Regional Provisioned requires 25 PTU for mini/nano-class and 50 PTU for larger models.

Two scope and discount rules round this out. For a given model, Data Zone is roughly +10% over Global and Regional is roughly +10% to +25%. Cached input tokens receive an automatic discount (roughly 50% to 90% off the input rate on repeated prefixes), so keep system prompts byte-identical across requests to trigger it.

On the PTU break-even, do not commit on a calculator estimate. Third-party analysis (AZ365.ai, 2026) puts break-even at roughly 150 to 200 million tokens per month for GPT-5, but that assumes 100% sustained utilisation. I run pay-as-you-go for 30 to 60 days, measure P95 hourly throughput, then size PTU against real telemetry. One more trap: rate limiting estimates max processed tokens at request time including max_tokens, so an over-large max_tokens can self-throttle you.

Evaluating a model before you commit

Before deployment I shortlist with the Foundry model leaderboard (preview), which ranks catalogue models on quality, safety, cost, and throughput with trade-off charts and side-by-side comparison of up to three models. Cost benchmarks assume a 3:1 input-to-output ratio. This narrows the field cheaply before I spend on my own evaluation.

Then I evaluate on my own data. Model and dataset evaluation is GA (agent evaluation remains preview), runnable from the portal or via the azure-ai-evaluation SDK. AI-assisted evaluators need an Azure OpenAI deployment as the judge, and Microsoft recommends gpt-5-mini for a good cost/quality balance.

For a RAG system, GroundednessEvaluator is the first one I set up, because it is the leading indicator of hallucination risk. I pair it with RelevanceEvaluator and the safety evaluators (ViolenceEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, and similar). Quality evaluators such as CoherenceEvaluator and FluencyEvaluator use a 1 to 5 Likert scale with a default pass threshold of 3. Different evaluators have different data needs: groundedness needs the source context, ROUGE-style evaluators need ground-truth references, and tool-call accuracy needs the full agent message trace. Results publish to Azure Monitor and Application Insights, so I alert on groundedness regressions. The decision rule is simple: shortlist with the leaderboard, evaluate the top two or three on your own data, and pick the cheapest model that clears your thresholds.

Using a deployed model in an agent

An agent references the deployment by name through its model field. In Azure AI Projects 2.x:

			
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import PromptAgentDefinition
project = AIProjectClient(
    endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>",
    credential=DefaultAzureCredential(),
)
agent = project.agents.create_version(
    agent_name="my-agent",
    definition=PromptAgentDefinition(
        model="gpt-5-mini",   # the DEPLOYMENT name (or an instant-model name)
        instructions="You are a helpful assistant that answers general questions",
    ),
)

		

To converse, create an OpenAI-compatible client and use the Responses API with an agent reference:

			
openai = project.get_openai_client()
conversation = openai.conversations.create()
response = openai.responses.create(
    conversation=conversation.id,
    extra_body={"agent_reference": {"name": "my-agent"}},
    input="...",
)

		

The Foundry Agent Service went GA in March 2026. The teaching point worth repeating in reviews: the agent’s model is the deployment name, which you find under Models + Endpoints in the portal.

Calling the model via the Foundry SDK

The SDK consolidated. azure-ai-projects 2.x is now the single Foundry SDK, covering agents, inference, evaluations, and memory, with the standalone azure-ai-agents dependency folded in. The 2.0.0 stable release shipped on 6 March 2026 and current PyPI is 2.2.0. Code written for 2.x is incompatible with 1.x: the old from_connection_string and .inference.get_chat_completions_client() patterns were removed, so budget a refactor sprint if you built against the beta.

Authenticate with DefaultAzureCredential from azure-identity. The recommended current pattern is to get an OpenAI-compatible client from the project:

			
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
project = AIProjectClient(
    endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>",
    credential=DefaultAzureCredential(),
)
with project.get_openai_client() as client:
    response = client.responses.create(
        model="gpt-5-mini",            # deployment name
        input="What is the size of France in square miles?",
    )
    print(response.output_text)

		

For OpenAI-style chat completions specifically:

			
with project.get_openai_client() as client:
    resp = client.chat.completions.create(
        model="gpt-4.1",               # deployment name
        messages=[{"role": "user", "content": "How many feet are in a mile?"}],
        temperature=0.7,
        max_tokens=500,
    )
    print(resp.choices[0].message.content)

		

Use the project endpoint and Responses API for Foundry features (agents, evaluations, tracing, content filters). Use the direct /openai/v1 endpoint for maximum OpenAI compatibility, lowest latency, or embeddings, which the project endpoint does not currently route. On .NET the equivalents are Azure.AI.Projects, Azure.AI.Extensions.OpenAI, and Azure.Identity, with the same pattern: construct AIProjectClient, get a chat or responses client, and pass the deployment name. One gotcha: do not install the preview Azure.AI.Projects.OpenAI alongside the GA Azure.AI.Extensions.OpenAI, because duplicate types cause ambiguous references.

Inference parameters, and why reasoning models break the rules

These are standard OpenAI parameters on the chat-completions and responses calls.

Parameter	What it does	Range / default	Notes
`temperature`	Scales the whole distribution	0 to 2, default 1.0	Tune this or `top_p`, not both
`top_p`	Nucleus sampling	0 to 1, default 1.0	0.9 a common safety net; no `top_k` exposed
`max_tokens` / `max_completion_tokens`	Caps output tokens	set conservatively	Reasoning models require `max_completion_tokens`
`frequency_penalty`	Penalises repeated tokens	-2.0 to 2.0, default 0	Leave at 0 for code/JSON
`presence_penalty`	Encourages new topics	-2.0 to 2.0, default 0	Harmful for structured output
`stop`	Stop sequences	list of strings
`seed`	Best-effort reproducibility	integer	Not guaranteed, pin model version too
`response_format`	text, json_object, or JSON schema
`reasoning_effort`	Reasoning models only	low / medium / high	Higher means more tokens, latency, cost

The guidance on temperature versus top_p is consistent across Microsoft and OpenAI: alter one or the other, not both. My default is to leave top_p at its default and tune temperature, reaching for top_p only when I want to keep temperature fixed for style but trim the occasional weird token. A common enterprise default for GPT-style chat is temperature 0.2 to 0.3 with top_p 0.8 to 0.95.

Reasoning models behave differently and this catches teams out. The GPT-5 series and o-series (o1, o3, o3-mini, o4-mini) reject temperature, top_p, presence_penalty, frequency_penalty, logprobs, logit_bias, and max_tokens. Sending temperature typically returns a 400 “Unsupported parameter”. Instead you use reasoning_effort (low/medium/high, with newer models adding none/minimal/xhigh) and max_completion_tokens on chat completions or max_output_tokens on the Responses API. A wrinkle to watch: gpt-5.1 defaults reasoning_effort to none, so migrating from an earlier reasoning model may require you to pass an effort level explicitly to get any reasoning at all. System messages are treated as developer messages on the o-series.

Reasoning models reject temperature, top_p, and the penalties, and substitute reasoning_effort and max_completion_tokens.

The practical fix is a shared wrapper that branches on model family and strips unsupported parameters before the call. That one piece of plumbing prevents most of the 400 errors that bite teams moving to GPT-5.

What I would actually do

Default to Global Standard, then narrow only for a reason: Data Zone when EU or US residency is required (accept roughly +10%), Regional only for strict single-region residency (accept +10% to +25% and smaller capacity). Do not buy PTU on a guess: run pay-as-you-go for 30 to 60 days, measure P95 hourly throughput, and commit to a 1-year reservation only once sustained volume sits in the 150 to 200M tokens/month range for GPT-5-class.

Turn on the free discounts. Route anything async to Batch, make system prompts byte-identical for cached-input savings, and send routing, extraction, and classification to a mini or nano model while reserving flagships for the hard cases. Add spillover to any provisioned customer-facing deployment, but remember it does not cover the Responses API.

Gate model choice on evaluation, not vibes, and make parameter handling model-aware. Those two habits, plus standardising on azure-ai-projects 2.x with DefaultAzureCredential, have removed most of the surprises my teams used to hit in production.

References

Image credits

All diagrams in this post are my own. They illustrate concepts documented on Microsoft Learn (linked in the References above); pricing figures shown are indicative and should be verified on the Azure pricing calculator.

Deploying, evaluating, and calling models in Microsoft Foundry: a production guide for architects

What “deploying a model” actually means

Deployment types: the decision that fixes everything else

Cost: the three levers, and the free money teams forget

Evaluating a model before you commit

Using a deployed model in an agent

Calling the model via the Foundry SDK

Inference parameters, and why reasoning models break the rules

What I would actually do

References

Image credits

Share this:

Comments

Leave a comment Cancel reply

More posts

Deploying, evaluating, and calling models in Microsoft Foundry: a production guide for architects

The case of the 6,000 orphaned contacts: debugging GAB dual-write in Dynamics 365

Copilot Cowork: the agent that does the work — and the extensibility model architects should actually study

Microsoft IQ: the intelligence layer your agents inherit — and what it actually changes for enterprise AI builders