This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article provides a quick reference and detailed description of the quotas and limits for Foundry Models sold by Azure. For quotas and limits specific to the Azure OpenAI in Foundry Models, see Quotas and limits in Azure OpenAI.
Microsoft Foundry is introducing an update to quota management to bring consistency and predictability to how quota is managed across deployments. Starting with Realtime Translate and Realtime Whisper, quota for deployments is tracked at the subscription level—shared across all resources and regions—rather than being allocated separately per resource or per region.
This change consolidates quota into shared pools:
For the models that are onboarded the new quota management system:
This consolidation allows Microsoft Foundry to offer supported models consistently across all Foundry regions, regardless of how quota is distributed across resources or regions.
Important
The updated quota management currently applies only to Realtime Translate and Realtime Whisper. For all other Foundry Models covered in this article, quotas and limits are managed per region, per subscription, and per model or deployment type. In the future, these quota guidelines will also apply to some existing models and to new Foundry Model launches.
The following sections provide a quick guide to the default quotas and limits that apply to Foundry Models. Quotas and limits aren't enforced at the tenant level. Instead, the highest level of quota restrictions is scoped at the Azure subscription level. Tokens per minute (TPM) and requests per minute (RPM) limits are defined per region, per subscription, and per model or deployment type.
| Foundry resources per region per Azure subscription | 100 |
| Max projects per resource | 250 |
| Max deployments per resource (model deployments within a Foundry resource) | 32 |
The following table lists limits for Foundry Models for the following rates:
| Azure OpenAI models | Varies per model and SKU. See limits for Azure OpenAI. | Varies per model and SKU. See limits for Azure OpenAI. | Varies. See Azure OpenAI limits. |
| - DeepSeek-R1 - DeepSeek-V3-0324 |
5,000,000 | 5,000 | 300 |
| - Llama 3.3 70B Instruct - Llama-4-Maverick-17B-128E-Instruct-FP8 - Grok 3 - Grok 3 mini |
400,000 | 1,000 | 300 |
| - Flux.2-Pro | not applicable | - Low (Default): 15 - Medium: 30 - High (Enterprise): 100 |
not applicable |
| - Flux-Pro 1.1 - Flux.1-Kontext Pro |
not applicable | 2 capacity units (6 requests per minute) | not applicable |
| Rest of models | 400,000 | 1,000 | 300 |
To increase your quota:
Due to high demand, limit increase requests are evaluated individually.
| Max number of custom headers in API requests1 | 10 |
1 Current APIs allow up to 10 custom headers, which the pipeline passes through and returns. If you exceed this header count, your request results in an HTTP 431 error. To resolve this error, reduce the header volume. Future API versions won't pass through custom headers. Don't depend on custom headers in future system architectures.
Global Standard deployments use Azure's global infrastructure to dynamically route customer traffic to the data center with best availability for the customer's inference requests. This infrastructure enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variabilities in response latency.
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer's usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
Submit the quota increase request form to request quota increases for Foundry Models sold by Azure, Azure OpenAI models, and Anthropic models. Except for Anthropic models, Models from partners and community don't support quota increases.
Quota increase requests are processed in the order they're received, and priority goes to customers who actively use their existing quota allocation. Requests that don't meet this condition might be denied.
To minimize issues related to rate limits, use the following techniques:
Set the client-side timeout explicitly based on the following guidance.
Note
If not explicitly set, the client side timeout exists as per the library used, and may not be the same limits as above.
29 minutes here doesn't mean all requests take 29 minutes but rather depending on context tokens, generated tokens, and cache hit rates, requests can take up to 29 minutes.
Set a timeout that's less than these values, tuned to your traffic patterns.
For reasoning models including streaming requests, all the reasoning tokens are first generated and then summarized before sending the first response token back to the user.
You can modify the reasoning effort parameter to control the number of reasoning tokens generated in the process.
| HTTP 429 Too Many Requests | Token-per-minute or request-per-minute limit exceeded | Implement retry logic with exponential backoff. Use the Retry-After header value. |
| HTTP 431 Request Header Fields Too Large | More than 10 custom headers sent | Reduce custom headers to 10 or fewer. |
| Quota page shows 0 available | Subscription or regional quota fully allocated | Move unused quota from another deployment. To increase your limit, request a quota increase. |
| Model not available in region | Model isn't deployed or supported in the selected region | Check model availability and choose an available region. |
Was this page helpful?
Was this page helpful?
Need help with this topic?
Want to try using Ask Learn to clarify or guide you through this topic?