1. Introduction
APIs are like the blood vessels of modern software. They help different systems communicate with each other smoothly and reliably. But when thousands (or even millions) of developers start using the same API — like OpenAI’s API — some kind of control is necessary to keep everything stable and fair.
That’s where API rate limits come in.
In simple words, rate limits define how much and how fast you can send data or make requests to an API. For example, OpenAI allows every user to make only a certain number of requests per minute (RPM) and consume a limited number of tokens per minute (TPM). If you go above that limit, you’ll get a 429 Too Many Requests error.
Why OpenAI Has Rate Limits
Imagine a crowded railway ticket counter. If everyone rushes in without order, the system crashes — people get stuck, tickets aren’t issued, and chaos follows. Similarly, OpenAI’s API serves thousands of developers worldwide. To keep things fair and avoid system overload, OpenAI enforces rate limits.
Rate limits ensure:
- Fair usage: Every developer gets a fair share of system resources.
- Stability: The API stays fast and reliable for everyone.
- Security: It helps prevent abuse or accidental infinite loops.
- Predictable cost: Developers can estimate usage and cost more accurately.
Why Understanding Rate Limits Is Crucial for Developers
Whether you are a fresher learning API integration or an experienced engineer building production-level AI applications, understanding rate limits is critical.
If you ignore them:
- Your app might stop responding suddenly.
- You may hit throttling during traffic spikes.
- You may lose user trust due to frequent “please try again later” messages.
By understanding OpenAI API rate limits properly, you can:
– Design scalable backend systems
– Avoid unnecessary errors and downtime
– Optimize your token usage
– Improve performance and reduce costs
What You’ll Learn from This Blog
By the end of this detailed tutorial, you’ll have complete clarity on:
- What OpenAI API rate limits are
- Difference between TPM (Tokens per Minute) and RPM (Requests per Minute)
- How to calculate your throughput
- How to handle rate-limit errors gracefully using Python code
- And how to build a small throughput calculator to plan your API usage
We’ll go step-by-step in plain, easy Indian English — with examples and formulas — so that even a fresher developer can understand and implement these concepts confidently.
Example Scenario: Why This Matters
Let’s say you’re building a chatbot that uses GPT-4 via the OpenAI API.
You have 1000 users, and each user sends one message per minute. Each message uses roughly 200 tokens (prompt + response).
That’s 1000 × 200 = 200,000 tokens per minute.
If your OpenAI TPM limit is 100,000 tokens/minute — your chatbot will crash or delay replies once traffic crosses that mark.
If you had known your rate limits in advance and planned throughput, you could:
- Queue or stagger requests, or
- Request a higher quota from OpenAI.
This is exactly what this blog helps you understand and solve.
2. Understanding OpenAI API Rate Limits
When you call any OpenAI model (like GPT-4 or GPT-4o) through an API, OpenAI must balance billions of requests from developers across the world.
To maintain this balance, the platform sets rate limits — that is, boundaries on how much traffic your application can send within a fixed time window.
What Is a Rate Limit in an API?
A rate limit defines how many requests or how much data a single user (or API key) can send to the API within a fixed period — usually a minute.
For OpenAI, the limits are mainly based on two key metrics:
- Requests Per Minute (RPM)
- Tokens Per Minute (TPM)
You need to keep both within the allowed quota.
What Is RPM (Requests Per Minute)?
RPM means how many separate API calls you can make in one minute.
- Example: If your plan allows 60 RPM, that means you can make one request every second on average.
- If you exceed it, OpenAI will return an HTTP status
429 Too Many Requests.
Think of RPM as a traffic police limit — you cannot send too many cars (requests) on the road (API) within a minute.
Example
If your app sends multiple prompts rapidly to GPT-4 (say, 80 requests in 10 seconds) while your RPM limit is 60, you’ll immediately hit a throttle even if the rest of the minute is idle.
So, it’s not only how many per minute but how fast you send them that matters.
What Is TPM (Tokens Per Minute)?
TPM is the total number of tokens (input + output) your API calls can consume in one minute.
A token is roughly four English characters, or about ¾ of a word.
Every request consumes:
- Input tokens → the text you send in your prompt.
- Output tokens → the text OpenAI’s model generates in response.
So if you send a 200-token prompt and ask for 300 output tokens, that one request uses up to 500 tokens in total.
Example
Let’s say your account has a limit of 90 000 tokens per minute.
If each request uses 900 tokens, you can safely send about: 90,000÷900=100requestsperminute90,000 ÷ 900 = 100 requests per minute90,000÷900=100requestsperminute
If you try 120 requests, you’ll exceed your TPM cap and get a 429 error.
RPM vs TPM — How They Work Together
Both limits apply simultaneously, and whichever is hit first will throttle your app.
| Scenario | RPM Limit | TPM Limit | What You Hit First | Why |
|---|---|---|---|---|
| Many tiny requests | High token efficiency (50 tokens each) | Normal TPM | RPM | Too many calls, though tokens are small |
| Few heavy requests | Low call frequency | Heavy prompts (> 1 000 tokens each) | TPM | Each request consumes lots of tokens |
| Balanced usage | Moderate calls | Moderate tokens | Depends | Both limits share the load |
How OpenAI Counts Tokens
OpenAI uses its internal tokenizer (the tiktoken library) to break text into small pieces called tokens.
Example text:
“Hello India, how are you?”
This may break into tokens like:
[ "Hello", " India", ",", " how", " are", " you", "?" ]
That’s 7 tokens.
Approximate guideline
| Language / Type | Tokens per word (approx.) | Example |
|---|---|---|
| English | 0.75 | “Hello world” = 2 tokens |
| Indian languages (Hindi / Tamil / Marathi) | 1.0 – 1.2 | Varies slightly due to unicode |
| Code (Python / JavaScript) | 1.2 – 1.4 | Keywords, spaces, punctuation count more |
You can use Python’s tiktoken library to check your exact token count before sending requests.
Why Tokens Matter More Than You Think
For most GPT-based applications, token usage directly impacts cost and performance:
- Every token is billed — so excessive tokens = higher cost.
- Every token consumes quota — so large prompts slow down throughput.
- Fewer tokens = faster response time.
If your prompt is 1000 tokens long, GPT-4 must read and process those tokens before generating an answer. That’s why planning your token budget is crucial.
| Real-World Analogy | Meaning in OpenAI API |
|---|---|
| Number of letters you send in a post | Tokens you consume |
| Number of envelopes you post per minute | Requests per minute |
| Post office limit (per minute) | API rate limit |
| If you send too many letters (tokens) or envelopes (requests) at once | You’ll be throttled (429 error) |
How Limits Differ by Model and Account Tier
OpenAI offers different rate limits depending on:
- Model type: GPT-4o, GPT-4-Turbo, GPT-3.5-Turbo, etc.
- Plan: Free tier, Pay-as-you-go, or Enterprise.
- Usage history: Older accounts with consistent billing often receive higher quotas automatically.
Approximate examples (subject to change):
| Model | RPM | TPM | Intended Use |
|---|---|---|---|
| GPT-3.5-Turbo | 3 500 RPM | 350 000 TPM | General apps, chatbots |
| GPT-4-Turbo | 500 RPM | 80 000 TPM | Complex apps |
| GPT-4o | 1 000 RPM | 250 000 TPM | High throughput apps / production |
(These numbers vary — always check your account’s dashboard.)
How the Limit Resets
Rate limits work in rolling windows, not fixed one-minute clocks.
That means if you hit your TPM limit at 12:00:30, you don’t have to wait till 12:01:00. Your tokens start freeing up gradually as time passes.
Think of it as a conveyor belt — as older requests move out of the last minute, you gain space for new ones.
What Happens When You Exceed the Limit
When you exceed either RPM or TPM, the API responds with:
HTTP/1.1 429 Too Many Requests
{
"error": {
"message": "Rate limit reached for requests or tokens. Please try again later.",
"type": "rate_limit_error",
"param": null,
"code": null
}
}
You can read response headers like:
x-ratelimit-limit-requestsx-ratelimit-remaining-requestsx-ratelimit-reset-requestsx-ratelimit-limit-tokensx-ratelimit-remaining-tokensx-ratelimit-reset-tokens
These tell you how much quota remains and when it resets — very helpful for automatic retries.
| Concept | Description | Developer Tip |
|---|---|---|
| RPM | Number of API calls allowed per minute | Batch requests or queue them |
| TPM | Total tokens (prompt + response) per minute | Optimise prompts to reduce token usage |
| Token | 4 characters ≈ ¾ word | Use tiktoken to measure |
| Throttling Error | 429 Too Many Requests | Implement retry with exponential backoff |
| Limit Reset | Rolling window (~60 s) | Plan traffic smoothing |
3. Real-World Example & How to Read Rate-Limit Headers
Understanding theory is one thing.
But to make this concept real, let’s walk through a practical scenario that every developer integrating OpenAI’s API faces sooner or later.
Scenario: A Chatbot with Growing Traffic
Imagine you’ve built a customer support chatbot using GPT-4o via OpenAI’s API.
At the start, you have:
- 10 active users
- Each sending about 2 messages per minute
- Each message (prompt + reply) uses around 500 tokens
So total usage per minute =
10 users × 2 messages × 500 tokens = 10 000 tokens per minute
This is well within most rate limits.
Now, suppose your app grows, and you suddenly get:
- 500 users
- Each sending 3 messages per minute
That’s 500 × 3 × 500 = 750 000 tokens per minute
If your TPM limit is 250 000, your app will start failing intermittently.
You’ll notice delayed responses or see this familiar error:
{
"error": {
"message": "Rate limit reached for tokens per min. Please try again later.",
"type": "rate_limit_error",
"code": null
}
}
This is OpenAI protecting its infrastructure from overload — not a bug in your code.
Error 429 – What It Really Means
The 429 Too Many Requests response means you’re exceeding either:
- The number of allowed requests per minute (RPM), or
- The total tokens per minute (TPM).
OpenAI doesn’t always specify which limit you crossed — but the response headers can tell you exactly what’s happening.
Rate-Limit Headers Explained
When you make an API call, OpenAI includes special headers in its response.
Here’s what they look like (values shown are examples):
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 14
x-ratelimit-reset-requests: 2025-10-22T09:34:52Z
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 12000
x-ratelimit-reset-tokens: 2025-10-22T09:34:56Z
Let’s decode these step-by-step:
| Header Name | Meaning | Example Value | Explanation |
|---|---|---|---|
x-ratelimit-limit-requests | Your maximum RPM quota | 60 | You can make up to 60 API calls per minute |
x-ratelimit-remaining-requests | How many API calls you have left in the current window | 14 | You’ve already made 46 requests |
x-ratelimit-reset-requests | When your request quota resets | 09:34:52 | You can send more calls after this time |
x-ratelimit-limit-tokens | Your total TPM quota | 90 000 | You can consume up to 90 000 tokens per minute |
x-ratelimit-remaining-tokens | Tokens left for the current minute | 12 000 | You’ve used 78 000 tokens so far |
x-ratelimit-reset-tokens | When your token quota resets | 09:34:56 | You can send more tokens after this time |
Python Example – Monitoring Rate-Limit Headers
Let’s see how you can monitor these headers automatically in Python.
This is very helpful when you want your application to detect when you’re close to the limit and back off before hitting it.
import openai
import time
openai.api_key = "YOUR_API_KEY"
def call_openai(prompt):
try:
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
# Print model response
print("Response:", response.choices[0].message.content.strip())
# Print rate limit info from headers
headers = response.response_ms.headers if hasattr(response, "response_ms") else response.headers
print("\n--- Rate Limit Info ---")
for key, value in headers.items():
if "ratelimit" in key:
print(f"{key}: {value}")
except openai.error.RateLimitError as e:
print("⚠️ Rate limit reached. Waiting before retry...")
time.sleep(5)
return call_openai(prompt)
except Exception as e:
print("Error:", e)
call_openai("Explain token limits in OpenAI API in one paragraph.")
Explanation:
- The function sends a request to GPT-4o with a small prompt.
- After receiving the response, it prints out all rate-limit headers.
- If it hits the limit, it catches the
RateLimitError, waits for 5 seconds, and retries.
You can expand this logic to:
- Store header data in a log file or database.
- Monitor how close you are to your limit.
- Automatically reduce request frequency when near the cap.
Tip: Exponential Backoff for Safety
A smart way to handle 429 errors is to retry after a delay, increasing the delay each time.
This is called exponential backoff — a widely used technique in distributed systems.
Here’s a simple pattern:
import time
import random
def exponential_backoff(retry_count):
wait = min(60, (2 ** retry_count) + random.random())
print(f"Retrying after {wait:.2f} seconds...")
time.sleep(wait)
This avoids hammering the API again immediately, which could make things worse.
Instead, it waits for a gradually longer time after each retry, up to 60 seconds.
Example Log Output (Real API Monitoring)
Here’s what a sample output may look like in your logs:
Response: Tokens are small pieces of text processed by the model...
--- Rate Limit Info ---
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 52
x-ratelimit-reset-requests: 2025-10-22T09:44:21Z
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 83500
x-ratelimit-reset-tokens: 2025-10-22T09:44:23Z
This tells you:
- You’ve made 8 of 60 requests in this minute.
- You’ve used ~6 500 tokens of your 90 000 TPM.
- You still have plenty of quota left.
Such real-time visibility helps you predict throttling before it happens.
Simple Dashboard Idea
If you’re building a production app, you can display these metrics visually:
| Metric | Description | Status |
|---|---|---|
| Requests Used | Number of API calls this minute | 8 / 60 |
| Tokens Used | Total tokens used this minute | 6 500 / 90 000 |
| Remaining Time | When limits reset | 35 seconds |
| Throttling Risk | High / Medium / Low | Low |
This dashboard helps you or your DevOps team maintain a healthy flow of API traffic without sudden failures.
Why Reading Headers Is Crucial
If you’re working in a real-world application (like chatbots, data analysis tools, or automation systems), reading rate-limit headers helps you:
- Avoid random API failures (by adjusting request rate).
- Estimate real-time throughput.
- Plan scaling — know when you need a higher quota.
- Debug faster — find out whether a failure is due to your code or OpenAI’s limits.
| Topic | What You Learned |
|---|---|
| Error 429 | It means you exceeded RPM or TPM |
| Rate-limit headers | Show remaining quota and reset time |
| Python monitoring | You can log headers to avoid throttling |
| Exponential backoff | Safely retry without flooding the API |
| Dashboard metrics | Useful for production monitoring |
4. RPM vs TPM: Deep Comparison and Interaction
When you use OpenAI’s API, both Requests Per Minute (RPM) and Tokens Per Minute (TPM) limits apply simultaneously.
You can think of them as two gates guarding the same road:
- One limits how many vehicles enter (RPM).
- The other limits how much total weight those vehicles carry (TPM).
If you break either rule, you’ll get throttled — even if the other gate was still open.
The Core Difference Between RPM and TPM
| Feature | RPM (Requests Per Minute) | TPM (Tokens Per Minute) |
|---|---|---|
| Definition | Number of API calls you can send in one minute | Total number of tokens (input + output) you can send and receive per minute |
| Unit | Requests (calls) | Tokens (prompt + response) |
| Who hits it first | Apps sending many small requests rapidly | Apps sending few large prompts or responses |
| Typical use case | Real-time chatbots, web UIs, small prompts | Large document summarizers, coding assistants, translators |
| Throttling reason | Too many hits per minute | Too many tokens processed per minute |
| Optimization approach | Add batching, throttling, or queues | Reduce prompt size or output tokens |
Relationship Between RPM and TPM
Although both are separate, they’re mathematically related through average token usage per request.
Let’s call:
T_request= Average tokens used per request (input + output)TPM_limit= Your total allowed tokens per minuteRPM_limit= Your total allowed requests per minute
Then your maximum feasible requests per minute (based on tokens) is:
Requests_by_TPM = T_request / TPM_limit
And your actual allowed requests per minute is:
Actual_RPM=min(RPM_limit,Requests_by_TPM)
This formula helps you decide whether your app will be limited by RPM or TPM first.
Example 1 – Small Prompts, High Traffic
Let’s say:
- TPM limit = 100 000
- RPM limit = 500
- Average request = 200 tokens
Requests_by_TPM=100,000÷200=500
So here, both RPM and TPM allow roughly the same throughput — 500 requests per minute.
You’re balanced .
Example 2 – Large Prompts, Fewer Requests
Now imagine:
- TPM limit = 100 000
- RPM limit = 500
- Average request = 800 tokens
Requests_by_TPM=100,000÷800=125
Even though your RPM limit is 500, your token usage caps you at 125 requests per minute.
You’ll hit the TPM limit before you ever reach your RPM cap.
That’s why optimizing tokens is often more impactful than optimizing the number of requests.
Example 3 – Heavy Output Models (GPT-4o)
For GPT-4o, many developers hit TPM early because outputs are richer and longer.
Suppose you send only 50 requests per minute, but each uses 3 000 tokens (large code or document generation):
50×3,000=150,000 tokensperminute
If your TPM limit is 120 000, you’ll get throttled, even though your RPM (50 ≪ 500) looks fine.
Moral: You can hit your token limit without ever hitting your request limit.
How to Find Out Which Limit You’re Hitting
You can inspect the headers from your API response (as we saw earlier):
| Header | What It Tells You | If Value Is Small |
|---|---|---|
x-ratelimit-remaining-requests | Requests left before hitting RPM | You’re close to request limit |
x-ratelimit-remaining-tokens | Tokens left before hitting TPM | You’re close to token limit |
You can log these values in your code or monitoring dashboard.
Once you see either of them dropping rapidly, you know where the bottleneck is.
Analogy: Cars on a Highway
Think of OpenAI’s servers like a toll plaza:
- Each car = one API request.
- Each car’s weight = number of tokens used in that request.
- There’s a limit on how many cars can pass (RPM) and how much total weight can pass (TPM).
If you send many small cars, you might hit the car limit (RPM).
If you send fewer but very heavy trucks, you’ll hit the weight limit (TPM).
Either way, the toll gate (OpenAI API) won’t let more through until some time passes.
Quick Throughput Calculator Table
Here’s a simple ready-to-use table for developers to estimate how many requests per minute are possible for different prompt sizes and token limits.
| TPM Limit | Avg Tokens per Request | Possible RPM (approx) | Bottleneck |
|---|---|---|---|
| 60 000 | 200 | 300 | RPM / TPM balanced |
| 60 000 | 500 | 120 | TPM |
| 60 000 | 1 000 | 60 | TPM |
| 120 000 | 200 | 500 | RPM |
| 120 000 | 800 | 150 | TPM |
| 250 000 | 500 | 500 | RPM balanced |
| 250 000 | 1 500 | 166 | TPM |
How to use this table:
Find your TPM limit (from your account dashboard) and estimate your average token use per request (using the tiktoken library).
Then look up your possible RPM. That’s the safe request rate you should aim for.
Key Developer Takeaways
| Principle | Explanation | Practical Tip |
|---|---|---|
| Both RPM and TPM matter | You can hit either limit first | Monitor both in your logs |
| Tokens dominate cost | Large prompts = large cost and throttling | Minimise unnecessary context |
| Balance requests & tokens | The lighter your prompt, the more requests you can send | Compress or summarise input |
| Always plan for bursts | Short spikes may exceed limits even if your average is safe | Implement a queue system |
| Monitor real time | Track rate-limit headers and token usage | Use logging or metrics dashboard |
Pro Developer Tip: Estimate Safe Concurrency
If your TPM limit is 250 000 tokens/min and each request consumes 500 tokens,
you can handle ≈ 500 requests/minute safely.
If each request takes ~2 seconds, your safe concurrent requests = (500requests/60seconds)×2seconds≈16parallelcalls(500 requests / 60 seconds) × 2 seconds ≈ 16 parallel calls(500requests/60seconds)×2seconds≈16parallelcalls
So, about 16 concurrent threads can run safely without hitting limits.
More than that, you risk short-term throttling.
Real-World Use Cases
| Use Case | Typical RPM Behavior | Typical TPM Behavior | Common Issue |
|---|---|---|---|
| Chatbot / Assistant | Many small requests | Few tokens per call | Hits RPM first |
| Content Generator | Fewer requests | Large output tokens | Hits TPM first |
| Code Assistant | Moderate RPM | Moderate tokens (500–1 500) | Balanced |
| Batch Processor | Large batch inputs | Many tokens per call | TPM limit often reached |
By mapping your application type to this table, you can predict where your limitation lies and plan accordingly.
Debugging When Limits Are Hit
When your API suddenly throws a RateLimitError, you can quickly identify which limit was exceeded:
- Check the error response headers.
- If
remaining-tokens= 0 → TPM exceeded. - If
remaining-requests= 0 → RPM exceeded. - If both still high but you see throttling → likely bursting too fast (too many requests in a few seconds).
You can handle bursts with rate-limiting libraries in your backend (like asyncio.Semaphore, tenacity, or ratelimit in Python).
When to Request Higher Limits
If your application consistently operates near the top of your limits:
- Check the usage dashboard on OpenAI’s account page
- Collect logs showing average RPM/TPM usage.
- Contact OpenAI support with your use case and current usage patterns.
Usually, consistent, responsible usage over time automatically earns higher quotas.
| Topic | Key Insight | Developer Action |
|---|---|---|
| RPM | Number of requests per minute | Control request frequency |
| TPM | Total tokens per minute | Optimise prompt and output |
| Interaction | Whichever limit is hit first throttles you | Monitor both with headers |
| Formula | Actual_RPM = min(RPM_limit, TPM_limit / T_request) | Use this to plan throughput |
| Throughput table | Helps forecast capacity | Use before scaling |
| Safe concurrency | (Actual_RPM / 60) × request_duration | Decide number of threads/workers |
5. Throughput Calculator: Estimation & Implementation
In this section, we’ll convert the ideas from Sections 2–4 into a practical calculator you can use every day.
We’ll first define the inputs, then write the formulas, and finally build a small Python utility that prints a clean table and tells you how many requests you can safely send per minute and how many workers (concurrent calls) you can run without hitting limits.
We’ll keep the language simple and the maths light, so freshers can follow easily, and experienced developers can plug this into CI/ops right away.
What Inputs Do We Need?
To estimate safe throughput, collect these 6 values:
TPM_limit– Your OpenAI Tokens Per Minute limit.RPM_limit– Your OpenAI Requests Per Minute limit.T_prompt– Average input tokens per request.T_output– Average output tokens per request (or yourmax_tokensif you keep it fixed).latency_sec– Average time a single request takes (seconds).burst_factor(optional) – Extra safety margin for short spikes. Default:1.0(no extra safety). Use0.8for 20% headroom.
From (3) and (4) we compute T_request = T_prompt + T_output.
Core Formulas

This tells you roughly how many simultaneous calls you can keep in flight, without triggering per-second/burst throttles.
Tip: Use a small queue + token bucket/throttle so workers don’t fire all at once.
Worked Examples (Step-by-Step)
Example A — Small Prompts (Chatbot)
TPM_limit = 120,000RPM_limit = 500T_prompt = 150,T_output = 150⇒T_request = 300latency_sec = 2.0,burst_factor = 0.9
Calculations:
Requests_by_TPM = 120,000 ÷ 300 = 400Actual_RPM = min(500, 400) = 400Safe_RPM = floor(400 × 0.9) = 360Tokens_per_min = 360 × 300 = 108,000Safe_Concurrency = floor((360/60) × 2.0) = floor(6 × 2) = 12
Conclusion: Target ~360 RPM with ~12 concurrent calls.
Example B — Large Outputs (Content Generator)
TPM_limit = 250,000RPM_limit = 500T_prompt = 400,T_output = 1200⇒T_request = 1600latency_sec = 6.0,burst_factor = 0.85
Calculations:
Requests_by_TPM = 250,000 ÷ 1600 = 156Actual_RPM = min(500, 156) = 156Safe_RPM = floor(156 × 0.85) = 132Tokens_per_min = 132 × 1600 = 211,200Safe_Concurrency = floor((132/60) × 6.0) = floor(2.2 × 6) = 13
Conclusion: Target ~132 RPM with ~13 concurrent calls.
Ready-to-Use Throughput Calculator Table
Pick the row closest to your usage pattern. (You can adjust numbers later with the Python tool.)
Assumptions per block are listed; each block shows how Avg Tokens/Req changes the safe RPM under different TPM limits. We keep burst_factor = 0.9 for safety.
Block 1 — RPM_limit = 500
| TPM Limit | Avg Tokens/Req | Requests_by_TPM | Actual_RPM | Safe_RPM (×0.9) | Tokens/min @ Safe |
|---|---|---|---|---|---|
| 60,000 | 200 | 300 | 300 | 270 | 54,000 |
| 60,000 | 500 | 120 | 120 | 108 | 54,000 |
| 60,000 | 1,000 | 60 | 60 | 54 | 54,000 |
| 120,000 | 200 | 600 | 500 | 450 | 90,000 |
| 120,000 | 800 | 150 | 150 | 135 | 108,000 |
| 250,000 | 500 | 500 | 500 | 450 | 225,000 |
| 250,000 | 1,500 | 166 | 166 | 149 | 223,500 |
Note: When
Requests_by_TPMexceedsRPM_limit, Actual_RPM is capped byRPM_limit.
Block 2 — RPM_limit = 1000
| TPM Limit | Avg Tokens/Req | Requests_by_TPM | Actual_RPM | Safe_RPM (×0.9) | Tokens/min @ Safe |
|---|---|---|---|---|---|
| 120,000 | 200 | 600 | 600 | 540 | 108,000 |
| 120,000 | 500 | 240 | 240 | 216 | 108,000 |
| 250,000 | 300 | 833 | 833 | 749 | 224,700 |
| 250,000 | 1,000 | 250 | 250 | 225 | 225,000 |
Use these tables to sanity-check your plan before load testing.
A Practical Python Calculator (CLI-Style)
The script below:
- Asks for your TPM/RPM limits, prompt/output tokens, avg latency, and burst factor.
- Prints your safe RPM, expected tokens/min, and recommended concurrency.
- Also prints a small scenario table for multiple token sizes (nice for quick “what-if” checks).
You can paste this into a file like
throughput_calc.pyand run with Python 3.10+.
import math
def calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec=2.0, burst_factor=0.9):
T_request = T_prompt + T_output
if T_request <= 0:
raise ValueError("Average tokens per request must be > 0")
requests_by_tpm = TPM_limit // T_request
actual_rpm = min(RPM_limit, requests_by_tpm)
safe_rpm = math.floor(actual_rpm * burst_factor)
tokens_per_min = safe_rpm * T_request
safe_concurrency = math.floor((safe_rpm / 60.0) * latency_sec)
return {
"T_request": T_request,
"Requests_by_TPM": requests_by_tpm,
"Actual_RPM": actual_rpm,
"Safe_RPM": safe_rpm,
"Tokens_per_min": tokens_per_min,
"Safe_Concurrency": safe_concurrency
}
def pretty(n): # simple thousands formatting
return f"{n:,}"
def scenario_table(TPM_limit, RPM_limit, lat, burst, token_sizes=(200, 300, 500, 800, 1000, 1500)):
print("\n=== Scenario Table (vary Avg Tokens/Req) ===")
print(f"TPM_limit={pretty(TPM_limit)}, RPM_limit={pretty(RPM_limit)}, latency={lat}s, burst_factor={burst}")
print(f"{'AvgTokens':>10} | {'ReqByTPM':>9} | {'ActualRPM':>9} | {'SafeRPM':>7} | {'Tok/min':>10}")
print("-" * 60)
for t in token_sizes:
res = calc_throughput(TPM_limit, RPM_limit, T_prompt=t//2, T_output=t - t//2, latency_sec=lat, burst_factor=burst)
print(f"{t:>10} | {pretty(res['Requests_by_TPM']):>9} | {pretty(res['Actual_RPM']):>9} | {pretty(res['Safe_RPM']):>7} | {pretty(res['Tokens_per_min']):>10}")
if __name__ == "__main__":
# === Collect inputs ===
try:
TPM_limit = int(input("Enter TPM_limit (e.g., 250000): ").strip())
RPM_limit = int(input("Enter RPM_limit (e.g., 500): ").strip())
T_prompt = int(input("Avg prompt tokens per request (e.g., 200): ").strip())
T_output = int(input("Avg output tokens per request (e.g., 300): ").strip())
latency_sec = float(input("Avg latency per request in seconds (e.g., 2.0): ").strip() or "2.0")
burst_factor = float(input("Burst safety factor (0.7–1.0, e.g., 0.9): ").strip() or "0.9")
res = calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec, burst_factor)
print("\n=== Throughput Result ===")
print(f"Avg tokens/request (T_request) : {pretty(res['T_request'])}")
print(f"Requests limited by TPM : {pretty(res['Requests_by_TPM'])} rpm")
print(f\"Actual RPM (respect RPM/TPM) : {pretty(res['Actual_RPM'])} rpm\")
print(f\"Safe RPM (after burst factor): {pretty(res['Safe_RPM'])} rpm\")
print(f\"Tokens per minute at Safe RPM: {pretty(res['Tokens_per_min'])}\")
print(f\"Recommended Safe Concurrency : {pretty(res['Safe_Concurrency'])} workers\")
# quick what-if table
scenario_table(TPM_limit, RPM_limit, latency_sec, burst_factor)
print("\nTips:")
print("• Lower Avg tokens/request to increase possible RPM under the same TPM.")
print("• If Safe_RPM equals RPM_limit often, consider asking for higher RPM or batch requests.")
print("• If Tokens_per_min ~= TPM_limit, reduce T_output or prompt size, or request higher TPM.")
except Exception as e:
print("Input/Error:", e)
How to use:
- Run the script and enter your values (from your OpenAI dashboard and logs).
- The script prints your Safe RPM and Safe Concurrency.
- Use the scenario table to see how changing the average token usage changes your capacity.
Interpreting the Calculator’s Output
- If
Actual_RPM=RPM_limitandTokens_per_minis much lower thanTPM_limit, then RPM is your bottleneck. Consider:- Batching multiple small prompts in one request (if your workflow allows), or
- Asking for a higher RPM quota.
- If
Tokens_per_min≈TPM_limit, then TPM is your bottleneck. Consider:- Reducing
max_tokens, - Compressing context (few-shot > many-shot),
- Using summaries/embeddings to shrink inputs.
- Reducing
- If
Safe_Concurrencyis very low (e.g., 2–3) but your service needs higher parallelism, then:- Work with a request queue + token bucket (dispatch slowly but consistently),
- Spread traffic more evenly (no spikes).
Optional: Simple Token Bucket (Pseudocode)
If you want a tiny in-process throttle that respects a target RPM, this idea helps:
bucket_capacity = Safe_RPM # e.g., 360
refill_rate = Safe_RPM / 60 # tokens per second
bucket = bucket_capacity
loop:
now = time()
bucket = min(bucket_capacity, bucket + refill_rate * (now - last))
if bucket >= 1:
bucket -= 1
dispatch_request()
else:
sleep_until_next_token()
This keeps your per-second rate smooth so you don’t accidentally burst and get 429s.
6. Best Practices to Handle Rate Limits
Once you understand rate limits, the next step is handling them smartly in production.
Here’s a concise checklist of best developer practices:
Technical Practices
- Implement Exponential Backoff: Retry after increasing delays (e.g., 2s → 4s → 8s).
- Add Random Jitter: Add a small random time to avoid synchronized retries.
- Use Queues: Maintain a small request queue to control bursts.
- Batch Requests: Combine multiple small prompts into a single call if possible.
- Cache Responses: If the same prompt repeats, reuse the last response.
Application Design Practices
- Track Usage: Log and visualize rate-limit headers.
- Optimize Prompts: Keep prompts shorter and remove redundant context.
- Limit
max_tokens: Don’t set huge output sizes when not needed. - Graceful Errors: Show friendly “Please wait” messages to users.
- Request Higher Quota: If consistently maxing out, request higher RPM/TPM from OpenAI.
7. How to Monitor and Debug Rate Limits
Key Things to Track
x-ratelimit-remaining-requestsx-ratelimit-remaining-tokens- Response latency trends (to detect throttling).
Monitoring Setup
- Use Prometheus + Grafana or a simple CSV log to track usage.
- Add alerts when remaining tokens fall below 10 % of quota.
- Log timestamp, RPM, TPM, and error 429 frequency.
Debug Steps
- If you see 429 errors → print header values.
- If remaining-tokens = 0 → reduce output size.
- If remaining-requests = 0 → slow down request rate.
- If both fine → likely burst issue → add throttling.
8. Common Mistakes Developers Make
| Mistake | Impact | Quick Fix |
|---|---|---|
| Ignoring token usage | Sudden throttling | Measure with tiktoken |
Setting huge max_tokens | Wasted TPM | Tune output length |
| Sending many short bursts | 429 errors | Queue + Backoff |
| Not reading headers | No visibility | Always log them |
| Assuming fixed minute reset | Unpredictable retry | Remember: rolling window |
9. FAQs
Q 1. Does the limit reset exactly every minute?
No — it’s a rolling window. Tokens free up gradually within 60 seconds.
Q 2. Are limits same for all models?
No. GPT-4 and GPT-4o have tighter TPMs than GPT-3.5. Check dashboard.
Q 3. What happens if I have multiple API keys?
Each key has its own quota, but total usage per account still matters.
Q 4. Can I increase limits manually?
Yes. Submit a request via OpenAI dashboard or upgrade your billing tier.
Q 5. Why am I throttled even under limits?
Because you’re sending short bursts; API enforces sub-minute smoothing.
10. Final Tips for Developers
- Plan for scale — design your API calls to respect both limits.
- Simulate load using your throughput calculator before launch.
- Log everything — headers, tokens, timestamps, latency.
- Distribute traffic evenly through seconds, not in bursts.
- Stay modular — keep token counting, backoff, and logging in separate utility functions.
11. Real-World Workflow Example
Example: AI Writing App
A content-generation web app sends large prompts to GPT-4o.
- Each request ≈ 1200 tokens (input + output).
- TPM = 250 000, RPM = 500.
- From the calculator → safe ≈ 150 requests/min.
- Add a small queue (size = 20) + retry with jitter.
- Monitor headers in logs.
- If tokens remain > 10 %, increase throughput slightly.
Result → stable app, no 429 errors, predictable latency ≈ 2.2 s per call.
12. Key Takeaways
| Focus Area | Why It Matters | Quick Action |
|---|---|---|
| TPM | Controls total work done | Optimise prompts |
| RPM | Controls call frequency | Smooth requests |
| Backoff | Prevents repeated throttling | Implement exponential retry |
| Monitoring | Ensures visibility | Log headers + alerts |
| Throughput Planning | Avoids overload | Use calculator before deployment |
Understanding both TPM and RPM helps you:
- Predict performance
- Control cost
- Scale your app reliably
13. Conclusion
OpenAI’s API is extremely powerful, but power always comes with control.
Rate limits are not roadblocks — they’re traffic rules to keep the system stable for everyone.
By mastering the balance between Requests Per Minute (RPM) and Tokens Per Minute (TPM), you can:
- Prevent unexpected 429 errors,
- Optimise your costs, and
- Build production-grade AI applications confidently.
You now have:
- A deep understanding of how limits work,
- A Python tool to calculate throughput,
- Real-world handling techniques, and
- A clear strategy for scaling OpenAI APIs safely.
Next Step:
Run your throughput calculator, log your first few thousand calls, and watch how smoothly your app scales without hitting a single rate-limit wall.

Comments