What is rate limiting?
6 min read
·┌──────────────────────────────────────────────────────────┐ │ ═══════════════════════════════════════════════════ │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ──────────────────────────────────────────────────── │ │ ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ █████████████████████████████████░░░░░░░░░░░░░░░░░░ │ │ ██████████████████████████████████████░░░░░░░░░░░░░ │ │ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ──────────────────────────────────────────────────── │ │ ███████████████████████████████████████░░░░░░░░░░░░ │ └──────────────────────────────────────────────────────────┘
Rate limiting is the practice of restricting how many requests or how much data a client can send to an API within a given time period. Every major AI API provider implements rate limits to protect their infrastructure, ensure fair access across users, and manage capacity. Understanding rate limits and how to work within them is essential for building reliable AI applications.
How Rate Limiting Works
When you make requests to an AI API, the provider tracks your usage across multiple dimensions. If you exceed a limit, the API returns an error (typically HTTP 429 "Too Many Requests") instead of processing your request. The error response usually includes headers telling you when you can retry.
Rate limits are applied per account or per API key, meaning all requests from your application count toward the same limits.
Types of Rate Limits
AI providers typically enforce several types of rate limits simultaneously:
[Requests per minute (RPM)]: The maximum number of API calls you can make in a 60-second window. For example, a tier might allow 500 requests per minute. This prevents any single user from flooding the system with rapid-fire calls.
[Tokens per minute (TPM)]: The maximum number of tokens (input plus output) you can process per minute. A limit of 200,000 TPM means your combined input and output tokens across all requests in a minute cannot exceed that amount. This is important because a single request with a very long prompt consumes far more resources than a short one.
[Tokens per day (TPD)]: Some providers also impose daily token limits, providing an additional ceiling on total usage over a 24-hour period.
[Concurrent requests]: The maximum number of requests being processed at the same time. Even if you are under your RPM limit, having too many simultaneous in-flight requests may be restricted.
[Images or other resource-specific limits]: For multimodal models or image generation APIs, there may be separate limits on images per minute or other resource types.
Why Rate Limits Exist
Rate limits serve several important purposes:
[Infrastructure protection]: AI inference is computationally expensive. Without limits, a single client with a bug or aggressive usage pattern could consume enough GPU resources to degrade service for everyone else.
[Fair access]: Rate limits ensure that all users get a reasonable share of available capacity. During periods of high demand, limits prevent a few heavy users from crowding out everyone else.
[Cost management]: For providers, rate limits help manage infrastructure costs and capacity planning. They also protect users from accidentally running up enormous bills due to buggy code.
[Service stability]: Controlled request flow is easier to manage and monitor. It allows providers to maintain consistent latency and availability.
Rate Limit Tiers
Most providers use a tiered system where your rate limits increase as you build a track record:
[OpenAI] uses tiers numbered 1 through 5, with limits increasing based on account age, payment history, and total spend. A new account might start with 500 RPM and 200,000 TPM, while a Tier 5 account could have 10,000 RPM and 10,000,000 TPM.
[Anthropic] similarly assigns rate limit tiers based on usage and spend. Higher tiers unlock significantly higher RPM and TPM limits.
[Google] manages quotas through the Google Cloud console, with default limits that can be increased through quota requests.
[Cohere], [Mistral], and other providers each have their own tier and limit structures, usually documented in their API reference.
To increase your limits, you typically need to: spend more on the platform (which automatically upgrades your tier), request a limit increase through the provider's dashboard, or contact sales for enterprise-level limits.
Handling Rate Limits in Code
Robust applications must handle rate limit errors gracefully. Here are the key patterns:
[Exponential backoff]: When you receive a 429 error, wait before retrying, and increase the wait time with each consecutive failure. A common pattern is to wait 1 second after the first failure, 2 seconds after the second, 4 seconds after the third, and so on, up to a maximum wait time.
[Retry-After headers]: Many providers include a Retry-After header in 429 responses that tells you exactly how many seconds to wait. Always check for and respect this header before falling back to exponential backoff.
[Client-side rate limiting]: Rather than hitting the limit and handling errors, proactively limit your request rate on the client side. Use a token bucket or leaky bucket algorithm to ensure you never exceed known limits.
[Request queuing]: Maintain a queue of pending requests and process them at a controlled rate. This smooths out bursts and keeps your request rate steady.
Most popular AI SDKs handle basic retry logic automatically. The OpenAI Python and Node libraries, for example, include built-in retry with exponential backoff for rate limit errors. Anthropic's SDK behaves similarly.
Rate Limits Across Providers
Understanding how limits compare across providers helps with architecture decisions:
Provider limits vary widely and change frequently, so always check current documentation. However, some general observations hold true:
- ▸Higher-capability models (like GPT-4 or Claude Opus) typically have lower rate limits than smaller models (like GPT-4o-mini or Claude Haiku)
- ▸Image generation endpoints usually have much lower RPM limits than text generation
- ▸Embedding endpoints often have higher throughput limits than completion endpoints
- ▸Enterprise and committed-spend plans generally offer significantly higher limits
Strategies for Working Within Limits
Beyond basic error handling, there are architectural strategies for managing rate limits effectively:
[Use appropriate model sizes]: If your task does not require the most capable model, use a smaller one. Smaller models typically have higher rate limits and lower costs.
[Batch where possible]: Use batch APIs for workloads that do not require real-time responses. Batch requests often have separate, more generous limits.
[Distribute across keys]: For large organizations, using multiple API keys (associated with different projects or teams) can provide separate rate limit pools. Some providers explicitly support this.
[Cache responses]: If you might send identical requests, cache the responses locally. This reduces API calls and avoids wasting rate limit budget on duplicate work.
[Implement request prioritization]: When operating near limits, prioritize user-facing requests over background processing. Queue lower-priority requests for quieter periods.
[Load balance across providers]: For applications that can use multiple AI providers, distributing requests across providers effectively multiplies your available capacity. If you hit OpenAI's limits, route to Anthropic or Google temporarily.
[Monitor and alert]: Track your rate limit utilization in real time. Set up alerts when you approach limits so you can take action before errors affect users.
Common Pitfalls
Watch out for these common issues:
[Ignoring token-based limits]: Developers often focus on RPM but forget about TPM. A single request with a massive prompt can consume a large portion of your token budget.
[Not handling retries correctly]: Retrying immediately without backoff makes rate limit problems worse, not better. Always add delay between retries.
[Burst traffic]: Applications that send many requests simultaneously (for example, when a batch job kicks off) can blow through rate limits instantly. Spread requests over time.
Rate limiting is a fact of life when working with AI APIs. Building your application to handle limits gracefully from the start saves you from painful debugging and outages later.