LLM API Rate Limiting & Retry Strategies: Complete Guide (2026)
Master LLM API rate limiting, exponential backoff retry, concurrent request management for OpenAI, DeepSeek V4, Claude 4, Gemini. Code examples in Python, Node.js, and curl.
LLM API Rate Limiting & Retry Strategies: Complete Guide (2026)
Published: June 29, 2026 · 15 min read
Introduction
Every LLM API — from OpenAI's GPT-5 to DeepSeek V4, Claude 4, and Gemini 2.5 — enforces rate limits. Hit them, and you get 429 Too Many Requests or 503 Service Unavailable. In production, how you handle these errors determines whether your app feels robust or flaky.
In 2026, the landscape is more fragmented than ever. Different providers use different limit models (tokens per minute vs requests per minute vs concurrent connections vs cost-based tiers). A strategy that works for one might waste budget on another.
This guide covers:
- How each major provider enforces limits (OpenAI, DeepSeek, Anthropic, Google)
- Industry-standard retry patterns with exponential backoff
- Concurrent request management in Python and Node.js
- Cost-aware rate limiting for budget-sensitive workloads
Getting started with these APIs? See our LLM API Pricing Comparison 2026 for cost data, and the Best LLM APIs 2026 guide for model selection.
How Major Providers Enforce Rate Limits
OpenAI (GPT-5, GPT-4o)
OpenAI uses a tiered RPM/TPM (Requests Per Minute / Tokens Per Minute) model:
| Tier | RPM Limit | TPM Limit | Cost |
|---|---|---|---|
| Free | 3 | 40K | $0 |
| Tier 1 | 500 | 60K | $5 spent |
| Tier 2 | 5,000 | 300K | $50 spent |
| Tier 3 | 10,000 | 500K | $250 spent |
| Tier 5 | Custom | Custom | $1K+ spent |
OpenAI returns 429 with a Retry-After header. The error body includes "type": "rate_limit_error" and a "message" like "Rate limit exceeded for gpt-5".
DeepSeek V4
DeepSeek V4 uses a dual-limit system:
- RPM limit: 3,000 requests/min for standard tier (6,000 for priority)
- TPD limit: Tokens per day (soft cap, can be raised)
- Cache-hit streaming: Each cache-hit request counts as 0.5 toward RPM, effectively doubling throughput
DeepSeek returns 429 with a granular {"error": {"code": "rate_limit", "remaining": 0, "reset": 123456789}} body, which includes the exact Unix timestamp when the limit resets.
Anthropic (Claude 4)
Anthropic uses a request-based model with workspace-level limits:
- Default: 50 requests/min for API Keys, 1,000/min for Workspace
- Extended thinking: Each request consumes more capacity
- Response: Returns
529(not429!) with"error": {"type": "overloaded_error"}
Anthropic's use of 529 is unique — you must handle that status code specifically.
Google Gemini 2.5
Google uses per-model tiers with generations per minute (GPM):
- Free: 10 GPM, 1,500 generations/day
- Pay-as-you-go: 2,000 GPM (standard), 5,000 (high-throughput tier)
- Response: Returns
429with{"error": {"code": 429, "status": "RESOURCE_EXHAUSTED"}}
Universal Retry Pattern: Exponential Backoff with Jitter
The gold standard across all providers. Here's the pattern:
import time, random, requests
class LlmApiClient:
def __init__(self, base_url, api_key, max_retries=5):
self.base_url = base_url
self.headers = {"Authorization": f"Bearer {api_key}"}
self.max_retries = max_retries
def request(self, payload):
for attempt in range(self.max_retries):
resp = requests.post(
self.base_url,
headers=self.headers,
json=payload
)
if resp.status_code in (200, 201):
return resp.json()
if resp.status_code in (429, 503, 529):
wait = self._backoff(attempt, resp)
print(f"Rate limited (attempt {attempt+1}). Waiting {wait:.1f}s...")
time.sleep(wait)
continue
# Non-retryable error
resp.raise_for_status()
raise Exception(f"Max retries ({self.max_retries}) exceeded")
def _backoff(self, attempt, response):
# 1) Use Retry-After header if present
retry_after = response.headers.get("Retry-After")
if retry_after:
return float(retry_after)
# 2) Use DeepSeek-style reset timestamp
try:
body = response.json()
reset_ts = body.get("error", {}).get("reset")
if reset_ts:
return max(0, reset_ts - time.time()) + random.uniform(0, 1)
except Exception:
pass
# 3) Exponential backoff with jitter
base = 2 ** attempt
jitter = random.uniform(0, min(base, 60)) # cap jitter at 60s
return min(base + jitter, 120) # cap at 120sNode.js Version
async function llmRequest(url, apiKey, payload, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const resp = await fetch(url, {
method: 'POST',
headers: { 'Authorization': `Bearer ${apiKey}` },
body: JSON.stringify(payload),
});
if (resp.ok) return await resp.json();
if ([429, 503, 529].includes(resp.status)) {
const wait = backoff(attempt, resp);
console.log(`Rate limited (attempt ${attempt+1}). Waiting ${wait}ms...`);
await new Promise(r => setTimeout(r, wait));
continue;
}
throw new Error(`API error: ${resp.status} ${await resp.text()}`);
}
throw new Error('Max retries exceeded');
}
function backoff(attempt, resp) {
const retryAfter = resp.headers.get('Retry-After');
if (retryAfter) return parseInt(retryAfter) * 1000;
const base = Math.pow(2, attempt) * 1000;
const jitter = Math.random() * Math.min(base, 60000);
return Math.min(base + jitter, 120000);
}Concurrent Request Management
When sending many requests simultaneously, rate limiting becomes probabilistic. You need a token bucket or semaphore to stay under limits.
Python: Async Semaphore
import asyncio, httpx
class RateLimitedClient:
def __init__(self, api_key, max_concurrent=10):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.client = httpx.AsyncClient(
headers={"Authorization": f"Bearer {api_key}"}
)
async def request(self, payload):
async with self.semaphore:
for _ in range(3): # retry
resp = await self.client.post(
"https://api.openai.com/v1/chat/completions",
json=payload
)
if resp.status_code == 200:
return resp.json()
if resp.status_code in (429, 503):
await asyncio.sleep(2)
continue
resp.raise_for_status()Node.js: Bottleneck
import Bottleneck from 'bottleneck';
const limiter = new Bottleneck({
minTime: 200, // max 5 requests per second
maxConcurrent: 10, // max 10 concurrent
reservoir: 3000, // token bucket: 3000 tokens/hour
reservoirRefreshInterval: 60 * 60 * 1000,
reservoirRefreshAmount: 3000,
});
async function rateLimitedRequest(url, payload) {
return limiter.schedule(() => fetch(url, {
method: 'POST',
headers: { 'Authorization': `Bearer ${process.env.API_KEY}` },
body: JSON.stringify(payload),
}).then(r => r.json()));
}Provider-Specific Edge Cases
OpenAI: Tier Upgrades
OpenAI limits are a function of total spend. If you're on Tier 2 and need higher limits, pre-fund your account. The upgrade is automatic at each spend threshold but can take hours to reflect.
Tip: Use tokenpapa's OpenAI proxy for Tier 5-level limits without the $1K spend commitment.
DeepSeek V4: Cache-Hit Multiplier
DeepSeek V4's cache-hit streaming is a game-changer for rate management. Each cache-hit request counts as half an RPM unit. If your prompt reuse ratio is high (e.g., system prompt-based apps), your effective throughput doubles.
To maximize cache hits:
- Use identical system prompts across requests
- Keep conversation prefixes consistent
- Batch similar inputs together
See our DeepSeek Cache Hit Optimization Guide for detailed tuning.
Anthropic: 529 Overloaded
Anthropic returns 529 (not 429) when overloaded. Many generic HTTP clients don't handle this by default. Always add 529 to your retry status list.
Anthropic also has a claude-4-thinking endpoint where "extended thinking" tokens consume 2x the capacity of normal tokens. If you're getting unexpected 529s, try reducing thinking_tokens in your request.
Gemini: Daily Quota Reset
Gemini's per-day generation limit resets at midnight Pacific Time. If you hit RESOURCE_EXHAUSTED late in the day, either:
- Switch to the high-throughput tier (2x cost)
- Stagger workloads across multiple API keys
- Cache responses aggressively
Cost-Aware Rate Limiting
Rate limiting isn't just about avoiding errors — it's about managing budget. Here's a practical heuristic:
| Scenario | Strategy |
|---|---|
| Batch processing (non-urgent) | Low concurrency (5-10), long retry delays |
| Real-time chat | Moderate concurrency (20-50), short retry delays |
| Cost-sensitive throughput | Token bucket with daily budget cap |
| Benchmarking/comparison | Sequential requests, no concurrency |
Budget bucket example:
class BudgetAwareLimiter:
def __init__(self, daily_budget=10.0): # $10/day cap
self.daily_budget = daily_budget
self.spent_today = 0.0
def can_afford(self, estimated_cost):
return self.spent_today + estimated_cost <= self.daily_budget
def record_spend(self, tokens, cost_per_token):
self.spent_today += tokens * cost_per_tokenCombine this with the tokenpapa API dashboard to monitor real-time spend across models.
Monitoring Rate Limit Health
Production systems need observability. Track these metrics:
- 429/529 rate — percentage of requests that hit limits (threshold: < 1%)
- Average retry wait time — how long your app stalls on limits
- Effective throughput — actual completed requests per minute vs. theoretical max
- Cache hit ratio — for DeepSeek, aim for > 40%
Most providers expose these through their dashboards. For multi-provider setups, tokenpapa's unified API provides a single dashboard across OpenAI, DeepSeek, Claude, and Gemini.
Conclusion
Rate limiting is an unavoidable reality of LLM APIs in 2026, but it's manageable with the right patterns:
- Exponential backoff with jitter handles all providers
- Token bucket + semaphores enable safe concurrent requests
- Provider-specific edge cases (529 from Anthropic, cache-hit from DeepSeek) need custom handling
- Budget-aware limiting prevents cost surprises
For production workloads, consider using tokenpapa as a unified gateway — it handles rate-limit normalization, automatic retries, and cost tracking across all major providers out of the box.
Ready to build? Sign up at tokenpapa.ai and get started with $5 in free credits.
How is this guide?
Mistral AI API Complete Guide for Developers (2026)
Complete guide to Mistral AI API in 2026. Mistral Large 2, Small, and Embed models pricing ($0.20-$2/1M input), features like function calling, JSON mode, and how to access from overseas via TokenPAPA.
How to Fine-Tune LLMs via API in 2026: DeepSeek, GPT-5, Claude 4 & More
Complete guide to fine-tuning LLMs via API in 2026. Covers DeepSeek V4 fine-tuning, OpenAI GPT-5 fine-tuning, Claude 4 custom models, Qwen fine-tuning, dataset preparation, cost comparison, and production deployment.
