OpenAI API Rate Limits Explained: TPM, RPM & Throughput

1. Introduction

APIs are like the blood vessels of modern software. They help different systems communicate with each other smoothly and reliably. But when thousands (or even millions) of developers start using the same API — like OpenAI’s API — some kind of control is necessary to keep everything stable and fair.
That’s where API rate limits come in.

In simple words, rate limits define how much and how fast you can send data or make requests to an API. For example, OpenAI allows every user to make only a certain number of requests per minute (RPM) and consume a limited number of tokens per minute (TPM). If you go above that limit, you’ll get a 429 Too Many Requests error.

Why OpenAI Has Rate Limits

Imagine a crowded railway ticket counter. If everyone rushes in without order, the system crashes — people get stuck, tickets aren’t issued, and chaos follows. Similarly, OpenAI’s API serves thousands of developers worldwide. To keep things fair and avoid system overload, OpenAI enforces rate limits.

Rate limits ensure:

Fair usage: Every developer gets a fair share of system resources.
Stability: The API stays fast and reliable for everyone.
Security: It helps prevent abuse or accidental infinite loops.
Predictable cost: Developers can estimate usage and cost more accurately.

Why Understanding Rate Limits Is Crucial for Developers

Whether you are a fresher learning API integration or an experienced engineer building production-level AI applications, understanding rate limits is critical.
If you ignore them:

Your app might stop responding suddenly.
You may hit throttling during traffic spikes.
You may lose user trust due to frequent “please try again later” messages.

By understanding OpenAI API rate limits properly, you can:
– Design scalable backend systems
– Avoid unnecessary errors and downtime
– Optimize your token usage
– Improve performance and reduce costs

What You’ll Learn from This Blog

By the end of this detailed tutorial, you’ll have complete clarity on:

What OpenAI API rate limits are
Difference between TPM (Tokens per Minute) and RPM (Requests per Minute)
How to calculate your throughput
How to handle rate-limit errors gracefully using Python code
And how to build a small throughput calculator to plan your API usage

We’ll go step-by-step in plain, easy Indian English — with examples and formulas — so that even a fresher developer can understand and implement these concepts confidently.

Example Scenario: Why This Matters

Let’s say you’re building a chatbot that uses GPT-4 via the OpenAI API.
You have 1000 users, and each user sends one message per minute. Each message uses roughly 200 tokens (prompt + response).
That’s 1000 × 200 = 200,000 tokens per minute.

If your OpenAI TPM limit is 100,000 tokens/minute — your chatbot will crash or delay replies once traffic crosses that mark.

If you had known your rate limits in advance and planned throughput, you could:

Queue or stagger requests, or
Request a higher quota from OpenAI.

This is exactly what this blog helps you understand and solve.

2. Understanding OpenAI API Rate Limits

When you call any OpenAI model (like GPT-4 or GPT-4o) through an API, OpenAI must balance billions of requests from developers across the world.
To maintain this balance, the platform sets rate limits — that is, boundaries on how much traffic your application can send within a fixed time window.

What Is a Rate Limit in an API?

A rate limit defines how many requests or how much data a single user (or API key) can send to the API within a fixed period — usually a minute.

For OpenAI, the limits are mainly based on two key metrics:

Requests Per Minute (RPM)
Tokens Per Minute (TPM)

You need to keep both within the allowed quota.

What Is RPM (Requests Per Minute)?

RPM means how many separate API calls you can make in one minute.

Example: If your plan allows 60 RPM, that means you can make one request every second on average.
If you exceed it, OpenAI will return an HTTP status 429 Too Many Requests.

Think of RPM as a traffic police limit — you cannot send too many cars (requests) on the road (API) within a minute.

Example

If your app sends multiple prompts rapidly to GPT-4 (say, 80 requests in 10 seconds) while your RPM limit is 60, you’ll immediately hit a throttle even if the rest of the minute is idle.

So, it’s not only how many per minute but how fast you send them that matters.

What Is TPM (Tokens Per Minute)?

TPM is the total number of tokens (input + output) your API calls can consume in one minute.
A token is roughly four English characters, or about ¾ of a word.

Every request consumes:

Input tokens → the text you send in your prompt.
Output tokens → the text OpenAI’s model generates in response.

So if you send a 200-token prompt and ask for 300 output tokens, that one request uses up to 500 tokens in total.

Example

Let’s say your account has a limit of 90 000 tokens per minute.
If each request uses 900 tokens, you can safely send about: 90,000÷900=100requestsperminute90,000 ÷ 900 = 100 requests per minute90,000÷900=100requestsperminute

If you try 120 requests, you’ll exceed your TPM cap and get a 429 error.

RPM vs TPM — How They Work Together

Both limits apply simultaneously, and whichever is hit first will throttle your app.

Scenario	RPM Limit	TPM Limit	What You Hit First	Why
Many tiny requests	High token efficiency (50 tokens each)	Normal TPM	RPM	Too many calls, though tokens are small
Few heavy requests	Low call frequency	Heavy prompts (> 1 000 tokens each)	TPM	Each request consumes lots of tokens
Balanced usage	Moderate calls	Moderate tokens	Depends	Both limits share the load

How OpenAI Counts Tokens

OpenAI uses its internal tokenizer (the tiktoken library) to break text into small pieces called tokens.

Example text:

“Hello India, how are you?”

This may break into tokens like:

[ "Hello", " India", ",", " how", " are", " you", "?" ]
That’s 7 tokens.

Approximate guideline

Language / Type	Tokens per word (approx.)	Example
English	0.75	“Hello world” = 2 tokens
Indian languages (Hindi / Tamil / Marathi)	1.0 – 1.2	Varies slightly due to unicode
Code (Python / JavaScript)	1.2 – 1.4	Keywords, spaces, punctuation count more

You can use Python’s tiktoken library to check your exact token count before sending requests.

Why Tokens Matter More Than You Think

For most GPT-based applications, token usage directly impacts cost and performance:

Every token is billed — so excessive tokens = higher cost.
Every token consumes quota — so large prompts slow down throughput.
Fewer tokens = faster response time.

If your prompt is 1000 tokens long, GPT-4 must read and process those tokens before generating an answer. That’s why planning your token budget is crucial.

Real-World Analogy	Meaning in OpenAI API
Number of letters you send in a post	Tokens you consume
Number of envelopes you post per minute	Requests per minute
Post office limit (per minute)	API rate limit
If you send too many letters (tokens) or envelopes (requests) at once	You’ll be throttled (429 error)

How Limits Differ by Model and Account Tier

OpenAI offers different rate limits depending on:

Model type: GPT-4o, GPT-4-Turbo, GPT-3.5-Turbo, etc.
Plan: Free tier, Pay-as-you-go, or Enterprise.
Usage history: Older accounts with consistent billing often receive higher quotas automatically.

Approximate examples (subject to change):

Model	RPM	TPM	Intended Use
GPT-3.5-Turbo	3 500 RPM	350 000 TPM	General apps, chatbots
GPT-4-Turbo	500 RPM	80 000 TPM	Complex apps
GPT-4o	1 000 RPM	250 000 TPM	High throughput apps / production

(These numbers vary — always check your account’s dashboard.)

How the Limit Resets

Rate limits work in rolling windows, not fixed one-minute clocks.
That means if you hit your TPM limit at 12:00:30, you don’t have to wait till 12:01:00. Your tokens start freeing up gradually as time passes.

Think of it as a conveyor belt — as older requests move out of the last minute, you gain space for new ones.

What Happens When You Exceed the Limit

When you exceed either RPM or TPM, the API responds with:

HTTP/1.1 429 Too Many Requests
{
  "error": {
    "message": "Rate limit reached for requests or tokens. Please try again later.",
    "type": "rate_limit_error",
    "param": null,
    "code": null
  }
}

You can read response headers like:

x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-reset-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens
x-ratelimit-reset-tokens

These tell you how much quota remains and when it resets — very helpful for automatic retries.

Concept	Description	Developer Tip
RPM	Number of API calls allowed per minute	Batch requests or queue them
TPM	Total tokens (prompt + response) per minute	Optimise prompts to reduce token usage
Token	4 characters ≈ ¾ word	Use `tiktoken` to measure
Throttling Error	`429 Too Many Requests`	Implement retry with exponential backoff
Limit Reset	Rolling window (~60 s)	Plan traffic smoothing

3. Real-World Example & How to Read Rate-Limit Headers

Understanding theory is one thing.
But to make this concept real, let’s walk through a practical scenario that every developer integrating OpenAI’s API faces sooner or later.

Scenario: A Chatbot with Growing Traffic

Imagine you’ve built a customer support chatbot using GPT-4o via OpenAI’s API.

At the start, you have:

10 active users
Each sending about 2 messages per minute
Each message (prompt + reply) uses around 500 tokens

So total usage per minute =
10 users × 2 messages × 500 tokens = 10 000 tokens per minute

This is well within most rate limits.

Now, suppose your app grows, and you suddenly get:

500 users
Each sending 3 messages per minute
That’s 500 × 3 × 500 = 750 000 tokens per minute

If your TPM limit is 250 000, your app will start failing intermittently.
You’ll notice delayed responses or see this familiar error:

{
  "error": {
    "message": "Rate limit reached for tokens per min. Please try again later.",
    "type": "rate_limit_error",
    "code": null
  }
}

This is OpenAI protecting its infrastructure from overload — not a bug in your code.

Error 429 – What It Really Means

The 429 Too Many Requests response means you’re exceeding either:

The number of allowed requests per minute (RPM), or
The total tokens per minute (TPM).

OpenAI doesn’t always specify which limit you crossed — but the response headers can tell you exactly what’s happening.

Rate-Limit Headers Explained

When you make an API call, OpenAI includes special headers in its response.
Here’s what they look like (values shown are examples):

x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 14
x-ratelimit-reset-requests: 2025-10-22T09:34:52Z

x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 12000
x-ratelimit-reset-tokens: 2025-10-22T09:34:56Z

Let’s decode these step-by-step:

Header Name	Meaning	Example Value	Explanation
`x-ratelimit-limit-requests`	Your maximum RPM quota	60	You can make up to 60 API calls per minute
`x-ratelimit-remaining-requests`	How many API calls you have left in the current window	14	You’ve already made 46 requests
`x-ratelimit-reset-requests`	When your request quota resets	09:34:52	You can send more calls after this time
`x-ratelimit-limit-tokens`	Your total TPM quota	90 000	You can consume up to 90 000 tokens per minute
`x-ratelimit-remaining-tokens`	Tokens left for the current minute	12 000	You’ve used 78 000 tokens so far
`x-ratelimit-reset-tokens`	When your token quota resets	09:34:56	You can send more tokens after this time

Python Example – Monitoring Rate-Limit Headers

Let’s see how you can monitor these headers automatically in Python.
This is very helpful when you want your application to detect when you’re close to the limit and back off before hitting it.

import openai
import time

openai.api_key = "YOUR_API_KEY"

def call_openai(prompt):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100
        )
        
        # Print model response
        print("Response:", response.choices[0].message.content.strip())

        # Print rate limit info from headers
        headers = response.response_ms.headers if hasattr(response, "response_ms") else response.headers
        print("\n--- Rate Limit Info ---")
        for key, value in headers.items():
            if "ratelimit" in key:
                print(f"{key}: {value}")

    except openai.error.RateLimitError as e:
        print("⚠️ Rate limit reached. Waiting before retry...")
        time.sleep(5)
        return call_openai(prompt)
    except Exception as e:
        print("Error:", e)

call_openai("Explain token limits in OpenAI API in one paragraph.")

Explanation:

The function sends a request to GPT-4o with a small prompt.
After receiving the response, it prints out all rate-limit headers.
If it hits the limit, it catches the RateLimitError, waits for 5 seconds, and retries.

You can expand this logic to:

Store header data in a log file or database.
Monitor how close you are to your limit.
Automatically reduce request frequency when near the cap.

Tip: Exponential Backoff for Safety

A smart way to handle 429 errors is to retry after a delay, increasing the delay each time.
This is called exponential backoff — a widely used technique in distributed systems.

Here’s a simple pattern:

import time
import random

def exponential_backoff(retry_count):
    wait = min(60, (2 ** retry_count) + random.random())
    print(f"Retrying after {wait:.2f} seconds...")
    time.sleep(wait)

This avoids hammering the API again immediately, which could make things worse.
Instead, it waits for a gradually longer time after each retry, up to 60 seconds.

Example Log Output (Real API Monitoring)

Here’s what a sample output may look like in your logs:

Response: Tokens are small pieces of text processed by the model...

--- Rate Limit Info ---
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 52
x-ratelimit-reset-requests: 2025-10-22T09:44:21Z
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 83500
x-ratelimit-reset-tokens: 2025-10-22T09:44:23Z

This tells you:

You’ve made 8 of 60 requests in this minute.
You’ve used ~6 500 tokens of your 90 000 TPM.
You still have plenty of quota left.

Such real-time visibility helps you predict throttling before it happens.

Simple Dashboard Idea

If you’re building a production app, you can display these metrics visually:

Metric	Description	Status
Requests Used	Number of API calls this minute	8 / 60
Tokens Used	Total tokens used this minute	6 500 / 90 000
Remaining Time	When limits reset	35 seconds
Throttling Risk	High / Medium / Low	Low

This dashboard helps you or your DevOps team maintain a healthy flow of API traffic without sudden failures.

Why Reading Headers Is Crucial

If you’re working in a real-world application (like chatbots, data analysis tools, or automation systems), reading rate-limit headers helps you:

Avoid random API failures (by adjusting request rate).
Estimate real-time throughput.
Plan scaling — know when you need a higher quota.
Debug faster — find out whether a failure is due to your code or OpenAI’s limits.

Topic	What You Learned
Error 429	It means you exceeded RPM or TPM
Rate-limit headers	Show remaining quota and reset time
Python monitoring	You can log headers to avoid throttling
Exponential backoff	Safely retry without flooding the API
Dashboard metrics	Useful for production monitoring

4. RPM vs TPM: Deep Comparison and Interaction

When you use OpenAI’s API, both Requests Per Minute (RPM) and Tokens Per Minute (TPM) limits apply simultaneously.
You can think of them as two gates guarding the same road:

One limits how many vehicles enter (RPM).
The other limits how much total weight those vehicles carry (TPM).

If you break either rule, you’ll get throttled — even if the other gate was still open.

The Core Difference Between RPM and TPM

Feature	RPM (Requests Per Minute)	TPM (Tokens Per Minute)
Definition	Number of API calls you can send in one minute	Total number of tokens (input + output) you can send and receive per minute
Unit	Requests (calls)	Tokens (prompt + response)
Who hits it first	Apps sending many small requests rapidly	Apps sending few large prompts or responses
Typical use case	Real-time chatbots, web UIs, small prompts	Large document summarizers, coding assistants, translators
Throttling reason	Too many hits per minute	Too many tokens processed per minute
Optimization approach	Add batching, throttling, or queues	Reduce prompt size or output tokens

Relationship Between RPM and TPM

Although both are separate, they’re mathematically related through average token usage per request.

Let’s call:

T_request = Average tokens used per request (input + output)
TPM_limit = Your total allowed tokens per minute
RPM_limit = Your total allowed requests per minute

Then your maximum feasible requests per minute (based on tokens) is:

Requests_by_TPM = T_request / TPM_limit

And your actual allowed requests per minute is:

Actual_RPM=min(RPM_limit,Requests_by_TPM)

This formula helps you decide whether your app will be limited by RPM or TPM first.

Example 1 – Small Prompts, High Traffic

Let’s say:

TPM limit = 100 000
RPM limit = 500
Average request = 200 tokens

Requests_by_TPM=100,000÷200=500

So here, both RPM and TPM allow roughly the same throughput — 500 requests per minute.
You’re balanced .

Example 2 – Large Prompts, Fewer Requests

Now imagine:

TPM limit = 100 000
RPM limit = 500
Average request = 800 tokens

Requests_by_TPM=100,000÷800=125

Even though your RPM limit is 500, your token usage caps you at 125 requests per minute.
You’ll hit the TPM limit before you ever reach your RPM cap.

That’s why optimizing tokens is often more impactful than optimizing the number of requests.

Example 3 – Heavy Output Models (GPT-4o)

For GPT-4o, many developers hit TPM early because outputs are richer and longer.
Suppose you send only 50 requests per minute, but each uses 3 000 tokens (large code or document generation):

50×3,000=150,000 tokensperminute

If your TPM limit is 120 000, you’ll get throttled, even though your RPM (50 ≪ 500) looks fine.

Moral: You can hit your token limit without ever hitting your request limit.

How to Find Out Which Limit You’re Hitting

You can inspect the headers from your API response (as we saw earlier):

Header	What It Tells You	If Value Is Small
`x-ratelimit-remaining-requests`	Requests left before hitting RPM	You’re close to request limit
`x-ratelimit-remaining-tokens`	Tokens left before hitting TPM	You’re close to token limit

You can log these values in your code or monitoring dashboard.
Once you see either of them dropping rapidly, you know where the bottleneck is.

Analogy: Cars on a Highway

Think of OpenAI’s servers like a toll plaza:

Each car = one API request.
Each car’s weight = number of tokens used in that request.
There’s a limit on how many cars can pass (RPM) and how much total weight can pass (TPM).

If you send many small cars, you might hit the car limit (RPM).
If you send fewer but very heavy trucks, you’ll hit the weight limit (TPM).

Either way, the toll gate (OpenAI API) won’t let more through until some time passes.

Quick Throughput Calculator Table

Here’s a simple ready-to-use table for developers to estimate how many requests per minute are possible for different prompt sizes and token limits.

TPM Limit	Avg Tokens per Request	Possible RPM (approx)	Bottleneck
60 000	200	300	RPM / TPM balanced
60 000	500	120	TPM
60 000	1 000	60	TPM
120 000	200	500	RPM
120 000	800	150	TPM
250 000	500	500	RPM balanced
250 000	1 500	166	TPM

How to use this table:
Find your TPM limit (from your account dashboard) and estimate your average token use per request (using the tiktoken library).
Then look up your possible RPM. That’s the safe request rate you should aim for.

Key Developer Takeaways

Principle	Explanation	Practical Tip
Both RPM and TPM matter	You can hit either limit first	Monitor both in your logs
Tokens dominate cost	Large prompts = large cost and throttling	Minimise unnecessary context
Balance requests & tokens	The lighter your prompt, the more requests you can send	Compress or summarise input
Always plan for bursts	Short spikes may exceed limits even if your average is safe	Implement a queue system
Monitor real time	Track rate-limit headers and token usage	Use logging or metrics dashboard

Pro Developer Tip: Estimate Safe Concurrency

If your TPM limit is 250 000 tokens/min and each request consumes 500 tokens,
you can handle ≈ 500 requests/minute safely.
If each request takes ~2 seconds, your safe concurrent requests = (500requests/60seconds)×2seconds≈16parallelcalls(500 requests / 60 seconds) × 2 seconds ≈ 16 parallel calls(500requests/60seconds)×2seconds≈16parallelcalls

So, about 16 concurrent threads can run safely without hitting limits.
More than that, you risk short-term throttling.

Real-World Use Cases

Use Case	Typical RPM Behavior	Typical TPM Behavior	Common Issue
Chatbot / Assistant	Many small requests	Few tokens per call	Hits RPM first
Content Generator	Fewer requests	Large output tokens	Hits TPM first
Code Assistant	Moderate RPM	Moderate tokens (500–1 500)	Balanced
Batch Processor	Large batch inputs	Many tokens per call	TPM limit often reached

By mapping your application type to this table, you can predict where your limitation lies and plan accordingly.

Debugging When Limits Are Hit

When your API suddenly throws a RateLimitError, you can quickly identify which limit was exceeded:

Check the error response headers.
If remaining-tokens = 0 → TPM exceeded.
If remaining-requests = 0 → RPM exceeded.
If both still high but you see throttling → likely bursting too fast (too many requests in a few seconds).

You can handle bursts with rate-limiting libraries in your backend (like asyncio.Semaphore, tenacity, or ratelimit in Python).

When to Request Higher Limits

If your application consistently operates near the top of your limits:

Check the usage dashboard on OpenAI’s account page
Collect logs showing average RPM/TPM usage.
Contact OpenAI support with your use case and current usage patterns.

Usually, consistent, responsible usage over time automatically earns higher quotas.

Topic	Key Insight	Developer Action
RPM	Number of requests per minute	Control request frequency
TPM	Total tokens per minute	Optimise prompt and output
Interaction	Whichever limit is hit first throttles you	Monitor both with headers
Formula	`Actual_RPM = min(RPM_limit, TPM_limit / T_request)`	Use this to plan throughput
Throughput table	Helps forecast capacity	Use before scaling
Safe concurrency	`(Actual_RPM / 60) × request_duration`	Decide number of threads/workers

5. Throughput Calculator: Estimation & Implementation

In this section, we’ll convert the ideas from Sections 2–4 into a practical calculator you can use every day.
We’ll first define the inputs, then write the formulas, and finally build a small Python utility that prints a clean table and tells you how many requests you can safely send per minute and how many workers (concurrent calls) you can run without hitting limits.

We’ll keep the language simple and the maths light, so freshers can follow easily, and experienced developers can plug this into CI/ops right away.

What Inputs Do We Need?

To estimate safe throughput, collect these 6 values:

TPM_limit – Your OpenAI Tokens Per Minute limit.
RPM_limit – Your OpenAI Requests Per Minute limit.
T_prompt – Average input tokens per request.
T_output – Average output tokens per request (or your max_tokens if you keep it fixed).
latency_sec – Average time a single request takes (seconds).
burst_factor (optional) – Extra safety margin for short spikes. Default: 1.0 (no extra safety). Use 0.8 for 20% headroom.

From (3) and (4) we compute T_request = T_prompt + T_output.

Core Formulas

This tells you roughly how many simultaneous calls you can keep in flight, without triggering per-second/burst throttles.

Tip: Use a small queue + token bucket/throttle so workers don’t fire all at once.

Worked Examples (Step-by-Step)

Example A — Small Prompts (Chatbot)

TPM_limit = 120,000
RPM_limit = 500
T_prompt = 150, T_output = 150 ⇒ T_request = 300
latency_sec = 2.0, burst_factor = 0.9

Calculations:

Requests_by_TPM = 120,000 ÷ 300 = 400
Actual_RPM = min(500, 400) = 400
Safe_RPM = floor(400 × 0.9) = 360
Tokens_per_min = 360 × 300 = 108,000
Safe_Concurrency = floor((360/60) × 2.0) = floor(6 × 2) = 12

Conclusion: Target ~360 RPM with ~12 concurrent calls.

Example B — Large Outputs (Content Generator)

TPM_limit = 250,000
RPM_limit = 500
T_prompt = 400, T_output = 1200 ⇒ T_request = 1600
latency_sec = 6.0, burst_factor = 0.85

Calculations:

Requests_by_TPM = 250,000 ÷ 1600 = 156
Actual_RPM = min(500, 156) = 156
Safe_RPM = floor(156 × 0.85) = 132
Tokens_per_min = 132 × 1600 = 211,200
Safe_Concurrency = floor((132/60) × 6.0) = floor(2.2 × 6) = 13

Conclusion: Target ~132 RPM with ~13 concurrent calls.

Ready-to-Use Throughput Calculator Table

Pick the row closest to your usage pattern. (You can adjust numbers later with the Python tool.)

Assumptions per block are listed; each block shows how Avg Tokens/Req changes the safe RPM under different TPM limits. We keep burst_factor = 0.9 for safety.

Block 1 — `RPM_limit = 500`

TPM Limit	Avg Tokens/Req	Requests_by_TPM	Actual_RPM	Safe_RPM (×0.9)	Tokens/min @ Safe
60,000	200	300	300	270	54,000
60,000	500	120	120	108	54,000
60,000	1,000	60	60	54	54,000
120,000	200	600	500	450	90,000
120,000	800	150	150	135	108,000
250,000	500	500	500	450	225,000
250,000	1,500	166	166	149	223,500

Note: When Requests_by_TPM exceeds RPM_limit, Actual_RPM is capped by RPM_limit.

Block 2 — RPM_limit = 1000

TPM Limit	Avg Tokens/Req	Requests_by_TPM	Actual_RPM	Safe_RPM (×0.9)	Tokens/min @ Safe
120,000	200	600	600	540	108,000
120,000	500	240	240	216	108,000
250,000	300	833	833	749	224,700
250,000	1,000	250	250	225	225,000

Use these tables to sanity-check your plan before load testing.

A Practical Python Calculator (CLI-Style)

The script below:

Asks for your TPM/RPM limits, prompt/output tokens, avg latency, and burst factor.
Prints your safe RPM, expected tokens/min, and recommended concurrency.
Also prints a small scenario table for multiple token sizes (nice for quick “what-if” checks).

You can paste this into a file like throughput_calc.py and run with Python 3.10+.

import math

def calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec=2.0, burst_factor=0.9):
    T_request = T_prompt + T_output
    if T_request <= 0:
        raise ValueError("Average tokens per request must be > 0")

    requests_by_tpm = TPM_limit // T_request
    actual_rpm = min(RPM_limit, requests_by_tpm)
    safe_rpm = math.floor(actual_rpm * burst_factor)
    tokens_per_min = safe_rpm * T_request
    safe_concurrency = math.floor((safe_rpm / 60.0) * latency_sec)

    return {
        "T_request": T_request,
        "Requests_by_TPM": requests_by_tpm,
        "Actual_RPM": actual_rpm,
        "Safe_RPM": safe_rpm,
        "Tokens_per_min": tokens_per_min,
        "Safe_Concurrency": safe_concurrency
    }

def pretty(n):  # simple thousands formatting
    return f"{n:,}"

def scenario_table(TPM_limit, RPM_limit, lat, burst, token_sizes=(200, 300, 500, 800, 1000, 1500)):
    print("\n=== Scenario Table (vary Avg Tokens/Req) ===")
    print(f"TPM_limit={pretty(TPM_limit)}, RPM_limit={pretty(RPM_limit)}, latency={lat}s, burst_factor={burst}")
    print(f"{'AvgTokens':>10} | {'ReqByTPM':>9} | {'ActualRPM':>9} | {'SafeRPM':>7} | {'Tok/min':>10}")
    print("-" * 60)
    for t in token_sizes:
        res = calc_throughput(TPM_limit, RPM_limit, T_prompt=t//2, T_output=t - t//2, latency_sec=lat, burst_factor=burst)
        print(f"{t:>10} | {pretty(res['Requests_by_TPM']):>9} | {pretty(res['Actual_RPM']):>9} | {pretty(res['Safe_RPM']):>7} | {pretty(res['Tokens_per_min']):>10}")

if __name__ == "__main__":
    # === Collect inputs ===
    try:
        TPM_limit = int(input("Enter TPM_limit (e.g., 250000): ").strip())
        RPM_limit = int(input("Enter RPM_limit (e.g., 500): ").strip())
        T_prompt = int(input("Avg prompt tokens per request (e.g., 200): ").strip())
        T_output = int(input("Avg output tokens per request (e.g., 300): ").strip())
        latency_sec = float(input("Avg latency per request in seconds (e.g., 2.0): ").strip() or "2.0")
        burst_factor = float(input("Burst safety factor (0.7–1.0, e.g., 0.9): ").strip() or "0.9")

        res = calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec, burst_factor)

        print("\n=== Throughput Result ===")
        print(f"Avg tokens/request (T_request) : {pretty(res['T_request'])}")
        print(f"Requests limited by TPM        : {pretty(res['Requests_by_TPM'])} rpm")
        print(f\"Actual RPM (respect RPM/TPM) : {pretty(res['Actual_RPM'])} rpm\")
        print(f\"Safe RPM (after burst factor): {pretty(res['Safe_RPM'])} rpm\")
        print(f\"Tokens per minute at Safe RPM: {pretty(res['Tokens_per_min'])}\")
        print(f\"Recommended Safe Concurrency : {pretty(res['Safe_Concurrency'])} workers\")

        # quick what-if table
        scenario_table(TPM_limit, RPM_limit, latency_sec, burst_factor)

        print("\nTips:")
        print("• Lower Avg tokens/request to increase possible RPM under the same TPM.")
        print("• If Safe_RPM equals RPM_limit often, consider asking for higher RPM or batch requests.")
        print("• If Tokens_per_min ~= TPM_limit, reduce T_output or prompt size, or request higher TPM.")
    except Exception as e:
        print("Input/Error:", e)

How to use:

Run the script and enter your values (from your OpenAI dashboard and logs).
The script prints your Safe RPM and Safe Concurrency.
Use the scenario table to see how changing the average token usage changes your capacity.

Interpreting the Calculator’s Output

If Actual_RPM = RPM_limit and Tokens_per_min is much lower than TPM_limit, then RPM is your bottleneck. Consider:
- Batching multiple small prompts in one request (if your workflow allows), or
- Asking for a higher RPM quota.
If Tokens_per_min ≈ TPM_limit, then TPM is your bottleneck. Consider:
- Reducing max_tokens,
- Compressing context (few-shot > many-shot),
- Using summaries/embeddings to shrink inputs.
If Safe_Concurrency is very low (e.g., 2–3) but your service needs higher parallelism, then:
- Work with a request queue + token bucket (dispatch slowly but consistently),
- Spread traffic more evenly (no spikes).

Optional: Simple Token Bucket (Pseudocode)

If you want a tiny in-process throttle that respects a target RPM, this idea helps:

bucket_capacity = Safe_RPM  # e.g., 360
refill_rate = Safe_RPM / 60 # tokens per second
bucket = bucket_capacity

loop:
  now = time()
  bucket = min(bucket_capacity, bucket + refill_rate * (now - last))
  if bucket >= 1:
      bucket -= 1
      dispatch_request()
  else:
      sleep_until_next_token()

This keeps your per-second rate smooth so you don’t accidentally burst and get 429s.

6. Best Practices to Handle Rate Limits

Once you understand rate limits, the next step is handling them smartly in production.
Here’s a concise checklist of best developer practices:

Technical Practices

Implement Exponential Backoff: Retry after increasing delays (e.g., 2s → 4s → 8s).
Add Random Jitter: Add a small random time to avoid synchronized retries.
Use Queues: Maintain a small request queue to control bursts.
Batch Requests: Combine multiple small prompts into a single call if possible.
Cache Responses: If the same prompt repeats, reuse the last response.

Application Design Practices

Track Usage: Log and visualize rate-limit headers.
Optimize Prompts: Keep prompts shorter and remove redundant context.
Limit max_tokens: Don’t set huge output sizes when not needed.
Graceful Errors: Show friendly “Please wait” messages to users.
Request Higher Quota: If consistently maxing out, request higher RPM/TPM from OpenAI.

7. How to Monitor and Debug Rate Limits

Key Things to Track

x-ratelimit-remaining-requests
x-ratelimit-remaining-tokens
Response latency trends (to detect throttling).

Monitoring Setup

Use Prometheus + Grafana or a simple CSV log to track usage.
Add alerts when remaining tokens fall below 10 % of quota.
Log timestamp, RPM, TPM, and error 429 frequency.

Debug Steps

If you see 429 errors → print header values.
If remaining-tokens = 0 → reduce output size.
If remaining-requests = 0 → slow down request rate.
If both fine → likely burst issue → add throttling.

8. Common Mistakes Developers Make

Mistake	Impact	Quick Fix
Ignoring token usage	Sudden throttling	Measure with `tiktoken`
Setting huge `max_tokens`	Wasted TPM	Tune output length
Sending many short bursts	429 errors	Queue + Backoff
Not reading headers	No visibility	Always log them
Assuming fixed minute reset	Unpredictable retry	Remember: rolling window

9. FAQs

Q 1. Does the limit reset exactly every minute?
No — it’s a rolling window. Tokens free up gradually within 60 seconds.

Q 2. Are limits same for all models?
No. GPT-4 and GPT-4o have tighter TPMs than GPT-3.5. Check dashboard.

Q 3. What happens if I have multiple API keys?
Each key has its own quota, but total usage per account still matters.

Q 4. Can I increase limits manually?
Yes. Submit a request via OpenAI dashboard or upgrade your billing tier.

Q 5. Why am I throttled even under limits?
Because you’re sending short bursts; API enforces sub-minute smoothing.

10. Final Tips for Developers

Plan for scale — design your API calls to respect both limits.
Simulate load using your throughput calculator before launch.
Log everything — headers, tokens, timestamps, latency.
Distribute traffic evenly through seconds, not in bursts.
Stay modular — keep token counting, backoff, and logging in separate utility functions.

11. Real-World Workflow Example

Example: AI Writing App

A content-generation web app sends large prompts to GPT-4o.

Each request ≈ 1200 tokens (input + output).
TPM = 250 000, RPM = 500.
From the calculator → safe ≈ 150 requests/min.
Add a small queue (size = 20) + retry with jitter.
Monitor headers in logs.
If tokens remain > 10 %, increase throughput slightly.

Result → stable app, no 429 errors, predictable latency ≈ 2.2 s per call.

12. Key Takeaways

Focus Area	Why It Matters	Quick Action
TPM	Controls total work done	Optimise prompts
RPM	Controls call frequency	Smooth requests
Backoff	Prevents repeated throttling	Implement exponential retry
Monitoring	Ensures visibility	Log headers + alerts
Throughput Planning	Avoids overload	Use calculator before deployment

Understanding both TPM and RPM helps you:

Predict performance
Control cost
Scale your app reliably

13. Conclusion

OpenAI’s API is extremely powerful, but power always comes with control.
Rate limits are not roadblocks — they’re traffic rules to keep the system stable for everyone.

By mastering the balance between Requests Per Minute (RPM) and Tokens Per Minute (TPM), you can:

Prevent unexpected 429 errors,
Optimise your costs, and
Build production-grade AI applications confidently.

You now have:

A deep understanding of how limits work,
A Python tool to calculate throughput,
Real-world handling techniques, and
A clear strategy for scaling OpenAI APIs safely.

Next Step:
Run your throughput calculator, log your first few thousand calls, and watch how smoothly your app scales without hitting a single rate-limit wall.

Categorized in:

OpenAI

Tagged in:

handle openai rate limit errors, openai api rate limits, openai throughput calculator, openai tpm rpm explained

Press ESC to close

Or check our Popular Categories...

1. Introduction

Why OpenAI Has Rate Limits

Why Understanding Rate Limits Is Crucial for Developers

What You’ll Learn from This Blog

Example Scenario: Why This Matters

2. Understanding OpenAI API Rate Limits

What Is a Rate Limit in an API?

What Is RPM (Requests Per Minute)?

Example

What Is TPM (Tokens Per Minute)?

Example

RPM vs TPM — How They Work Together

How OpenAI Counts Tokens

Approximate guideline

Why Tokens Matter More Than You Think

How Limits Differ by Model and Account Tier

How the Limit Resets

What Happens When You Exceed the Limit

3. Real-World Example & How to Read Rate-Limit Headers

Scenario: A Chatbot with Growing Traffic

Error 429 – What It Really Means

Rate-Limit Headers Explained

Python Example – Monitoring Rate-Limit Headers

Tip: Exponential Backoff for Safety

Example Log Output (Real API Monitoring)

Simple Dashboard Idea

Why Reading Headers Is Crucial

4. RPM vs TPM: Deep Comparison and Interaction

The Core Difference Between RPM and TPM

Relationship Between RPM and TPM

Example 1 – Small Prompts, High Traffic

Example 2 – Large Prompts, Fewer Requests

Example 3 – Heavy Output Models (GPT-4o)

How to Find Out Which Limit You’re Hitting

Analogy: Cars on a Highway

Quick Throughput Calculator Table

Key Developer Takeaways

Pro Developer Tip: Estimate Safe Concurrency

Real-World Use Cases

Debugging When Limits Are Hit

When to Request Higher Limits

5. Throughput Calculator: Estimation & Implementation

What Inputs Do We Need?

Core Formulas

Worked Examples (Step-by-Step)

Example A — Small Prompts (Chatbot)

Example B — Large Outputs (Content Generator)

Ready-to-Use Throughput Calculator Table

Block 1 — RPM_limit = 500

A Practical Python Calculator (CLI-Style)

Interpreting the Calculator’s Output

Optional: Simple Token Bucket (Pseudocode)

6. Best Practices to Handle Rate Limits

Technical Practices

Application Design Practices

7. How to Monitor and Debug Rate Limits

Key Things to Track

Monitoring Setup

Debug Steps

8. Common Mistakes Developers Make

9. FAQs

10. Final Tips for Developers

11. Real-World Workflow Example

Example: AI Writing App

12. Key Takeaways

13. Conclusion

Comments

Leave a Reply Cancel reply

Related Articles

Previous Article

Next Article

Block 1 — `RPM_limit = 500`