Master LLM API rate limiting, exponential backoff retry, concurrent request management for OpenAI, DeepSeek V4, Claude 4, Gemini. Code examples in Python, Node.js, and curl.

LLM API Rate Limiting & Retry Strategies: Complete Guide (2026)

Published: June 29, 2026 · 15 min read

Introduction

Every LLM API — from OpenAI's GPT-5 to DeepSeek V4, Claude 4, and Gemini 2.5 — enforces rate limits. Hit them, and you get 429 Too Many Requests or 503 Service Unavailable. In production, how you handle these errors determines whether your app feels robust or flaky.

In 2026, the landscape is more fragmented than ever. Different providers use different limit models (tokens per minute vs requests per minute vs concurrent connections vs cost-based tiers). A strategy that works for one might waste budget on another.

This guide covers:

How each major provider enforces limits (OpenAI, DeepSeek, Anthropic, Google)
Industry-standard retry patterns with exponential backoff
Concurrent request management in Python and Node.js
Cost-aware rate limiting for budget-sensitive workloads

Getting started with these APIs? See our LLM API Pricing Comparison 2026 for cost data, and the Best LLM APIs 2026 guide for model selection.

How Major Providers Enforce Rate Limits

OpenAI (GPT-5, GPT-4o)

OpenAI uses a tiered RPM/TPM (Requests Per Minute / Tokens Per Minute) model:

Tier	RPM Limit	TPM Limit	Cost
Free	3	40K	$0
Tier 1	500	60K	$5 spent
Tier 2	5,000	300K	$50 spent
Tier 3	10,000	500K	$250 spent
Tier 5	Custom	Custom	$1K+ spent

OpenAI returns 429 with a Retry-After header. The error body includes "type": "rate_limit_error" and a "message" like "Rate limit exceeded for gpt-5".

DeepSeek V4

DeepSeek V4 uses a dual-limit system:

RPM limit: 3,000 requests/min for standard tier (6,000 for priority)
TPD limit: Tokens per day (soft cap, can be raised)
Cache-hit streaming: Each cache-hit request counts as 0.5 toward RPM, effectively doubling throughput

DeepSeek returns 429 with a granular {"error": {"code": "rate_limit", "remaining": 0, "reset": 123456789}} body, which includes the exact Unix timestamp when the limit resets.

Anthropic (Claude 4)

Anthropic uses a request-based model with workspace-level limits:

Default: 50 requests/min for API Keys, 1,000/min for Workspace
Extended thinking: Each request consumes more capacity
Response: Returns 529 (not 429!) with "error": {"type": "overloaded_error"}

Anthropic's use of 529 is unique — you must handle that status code specifically.

Google Gemini 2.5

Google uses per-model tiers with generations per minute (GPM):

Free: 10 GPM, 1,500 generations/day
Pay-as-you-go: 2,000 GPM (standard), 5,000 (high-throughput tier)
Response: Returns 429 with {"error": {"code": 429, "status": "RESOURCE_EXHAUSTED"}}

Universal Retry Pattern: Exponential Backoff with Jitter

The gold standard across all providers. Here's the pattern:

import time, random, requests

class LlmApiClient:
    def __init__(self, base_url, api_key, max_retries=5):
        self.base_url = base_url
        self.headers = {"Authorization": f"Bearer {api_key}"}
        self.max_retries = max_retries

    def request(self, payload):
        for attempt in range(self.max_retries):
            resp = requests.post(
                self.base_url,
                headers=self.headers,
                json=payload
            )

            if resp.status_code in (200, 201):
                return resp.json()

            if resp.status_code in (429, 503, 529):
                wait = self._backoff(attempt, resp)
                print(f"Rate limited (attempt {attempt+1}). Waiting {wait:.1f}s...")
                time.sleep(wait)
                continue

            # Non-retryable error
            resp.raise_for_status()

        raise Exception(f"Max retries ({self.max_retries}) exceeded")

    def _backoff(self, attempt, response):
        # 1) Use Retry-After header if present
        retry_after = response.headers.get("Retry-After")
        if retry_after:
            return float(retry_after)

        # 2) Use DeepSeek-style reset timestamp
        try:
            body = response.json()
            reset_ts = body.get("error", {}).get("reset")
            if reset_ts:
                return max(0, reset_ts - time.time()) + random.uniform(0, 1)
        except Exception:
            pass

        # 3) Exponential backoff with jitter
        base = 2 ** attempt
        jitter = random.uniform(0, min(base, 60))  # cap jitter at 60s
        return min(base + jitter, 120)  # cap at 120s

Node.js Version

async function llmRequest(url, apiKey, payload, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const resp = await fetch(url, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${apiKey}` },
      body: JSON.stringify(payload),
    });

    if (resp.ok) return await resp.json();

    if ([429, 503, 529].includes(resp.status)) {
      const wait = backoff(attempt, resp);
      console.log(`Rate limited (attempt ${attempt+1}). Waiting ${wait}ms...`);
      await new Promise(r => setTimeout(r, wait));
      continue;
    }

    throw new Error(`API error: ${resp.status} ${await resp.text()}`);
  }
  throw new Error('Max retries exceeded');
}

function backoff(attempt, resp) {
  const retryAfter = resp.headers.get('Retry-After');
  if (retryAfter) return parseInt(retryAfter) * 1000;

  const base = Math.pow(2, attempt) * 1000;
  const jitter = Math.random() * Math.min(base, 60000);
  return Math.min(base + jitter, 120000);
}

Concurrent Request Management

When sending many requests simultaneously, rate limiting becomes probabilistic. You need a token bucket or semaphore to stay under limits.

Python: Async Semaphore

import asyncio, httpx

class RateLimitedClient:
    def __init__(self, api_key, max_concurrent=10):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.client = httpx.AsyncClient(
            headers={"Authorization": f"Bearer {api_key}"}
        )

    async def request(self, payload):
        async with self.semaphore:
            for _ in range(3):  # retry
                resp = await self.client.post(
                    "https://api.openai.com/v1/chat/completions",
                    json=payload
                )
                if resp.status_code == 200:
                    return resp.json()
                if resp.status_code in (429, 503):
                    await asyncio.sleep(2)
                    continue
                resp.raise_for_status()

Node.js: Bottleneck

import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  minTime: 200,        // max 5 requests per second
  maxConcurrent: 10,   // max 10 concurrent
  reservoir: 3000,     // token bucket: 3000 tokens/hour
  reservoirRefreshInterval: 60 * 60 * 1000,
  reservoirRefreshAmount: 3000,
});

async function rateLimitedRequest(url, payload) {
  return limiter.schedule(() => fetch(url, {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${process.env.API_KEY}` },
    body: JSON.stringify(payload),
  }).then(r => r.json()));
}

Provider-Specific Edge Cases

OpenAI: Tier Upgrades

OpenAI limits are a function of total spend. If you're on Tier 2 and need higher limits, pre-fund your account. The upgrade is automatic at each spend threshold but can take hours to reflect.

Tip: Use tokenpapa's OpenAI proxy for Tier 5-level limits without the $1K spend commitment.

DeepSeek V4: Cache-Hit Multiplier

DeepSeek V4's cache-hit streaming is a game-changer for rate management. Each cache-hit request counts as half an RPM unit. If your prompt reuse ratio is high (e.g., system prompt-based apps), your effective throughput doubles.

To maximize cache hits:

Use identical system prompts across requests
Keep conversation prefixes consistent
Batch similar inputs together

See our DeepSeek Cache Hit Optimization Guide for detailed tuning.

Anthropic: 529 Overloaded

Anthropic returns 529 (not 429) when overloaded. Many generic HTTP clients don't handle this by default. Always add 529 to your retry status list.

Anthropic also has a claude-4-thinking endpoint where "extended thinking" tokens consume 2x the capacity of normal tokens. If you're getting unexpected 529s, try reducing thinking_tokens in your request.

Gemini: Daily Quota Reset

Gemini's per-day generation limit resets at midnight Pacific Time. If you hit RESOURCE_EXHAUSTED late in the day, either:

Switch to the high-throughput tier (2x cost)
Stagger workloads across multiple API keys
Cache responses aggressively

Cost-Aware Rate Limiting

Rate limiting isn't just about avoiding errors — it's about managing budget. Here's a practical heuristic:

Scenario	Strategy
Batch processing (non-urgent)	Low concurrency (5-10), long retry delays
Real-time chat	Moderate concurrency (20-50), short retry delays
Cost-sensitive throughput	Token bucket with daily budget cap
Benchmarking/comparison	Sequential requests, no concurrency

Budget bucket example:

class BudgetAwareLimiter:
    def __init__(self, daily_budget=10.0):  # $10/day cap
        self.daily_budget = daily_budget
        self.spent_today = 0.0

    def can_afford(self, estimated_cost):
        return self.spent_today + estimated_cost <= self.daily_budget

    def record_spend(self, tokens, cost_per_token):
        self.spent_today += tokens * cost_per_token

Combine this with the tokenpapa API dashboard to monitor real-time spend across models.

Monitoring Rate Limit Health

Production systems need observability. Track these metrics:

429/529 rate — percentage of requests that hit limits (threshold: < 1%)
Average retry wait time — how long your app stalls on limits
Effective throughput — actual completed requests per minute vs. theoretical max
Cache hit ratio — for DeepSeek, aim for > 40%

Most providers expose these through their dashboards. For multi-provider setups, tokenpapa's unified API provides a single dashboard across OpenAI, DeepSeek, Claude, and Gemini.

Conclusion

Rate limiting is an unavoidable reality of LLM APIs in 2026, but it's manageable with the right patterns:

Exponential backoff with jitter handles all providers
Token bucket + semaphores enable safe concurrent requests
Provider-specific edge cases (529 from Anthropic, cache-hit from DeepSeek) need custom handling
Budget-aware limiting prevents cost surprises

For production workloads, consider using tokenpapa as a unified gateway — it handles rate-limit normalization, automatic retries, and cost tracking across all major providers out of the box.

Ready to build? Sign up at tokenpapa.ai and get started with $5 in free credits.

LLM API Rate Limiting & Retry Strategies: Complete Guide (2026)

目次