1. Introduction

APIs are like the blood vessels of modern software. They help different systems communicate with each other smoothly and reliably. But when thousands (or even millions) of developers start using the same API — like OpenAI’s API — some kind of control is necessary to keep everything stable and fair.
That’s where API rate limits come in.

In simple words, rate limits define how much and how fast you can send data or make requests to an API. For example, OpenAI allows every user to make only a certain number of requests per minute (RPM) and consume a limited number of tokens per minute (TPM). If you go above that limit, you’ll get a 429 Too Many Requests error.


Why OpenAI Has Rate Limits

Imagine a crowded railway ticket counter. If everyone rushes in without order, the system crashes — people get stuck, tickets aren’t issued, and chaos follows. Similarly, OpenAI’s API serves thousands of developers worldwide. To keep things fair and avoid system overload, OpenAI enforces rate limits.

Rate limits ensure:

  • Fair usage: Every developer gets a fair share of system resources.
  • Stability: The API stays fast and reliable for everyone.
  • Security: It helps prevent abuse or accidental infinite loops.
  • Predictable cost: Developers can estimate usage and cost more accurately.

Why Understanding Rate Limits Is Crucial for Developers

Whether you are a fresher learning API integration or an experienced engineer building production-level AI applications, understanding rate limits is critical.
If you ignore them:

  • Your app might stop responding suddenly.
  • You may hit throttling during traffic spikes.
  • You may lose user trust due to frequent “please try again later” messages.

By understanding OpenAI API rate limits properly, you can:
– Design scalable backend systems
– Avoid unnecessary errors and downtime
– Optimize your token usage
– Improve performance and reduce costs


What You’ll Learn from This Blog

By the end of this detailed tutorial, you’ll have complete clarity on:

  • What OpenAI API rate limits are
  • Difference between TPM (Tokens per Minute) and RPM (Requests per Minute)
  • How to calculate your throughput
  • How to handle rate-limit errors gracefully using Python code
  • And how to build a small throughput calculator to plan your API usage

We’ll go step-by-step in plain, easy Indian English — with examples and formulas — so that even a fresher developer can understand and implement these concepts confidently.


Example Scenario: Why This Matters

Let’s say you’re building a chatbot that uses GPT-4 via the OpenAI API.
You have 1000 users, and each user sends one message per minute. Each message uses roughly 200 tokens (prompt + response).
That’s 1000 × 200 = 200,000 tokens per minute.

If your OpenAI TPM limit is 100,000 tokens/minute — your chatbot will crash or delay replies once traffic crosses that mark.

If you had known your rate limits in advance and planned throughput, you could:

  • Queue or stagger requests, or
  • Request a higher quota from OpenAI.

This is exactly what this blog helps you understand and solve.

2. Understanding OpenAI API Rate Limits

When you call any OpenAI model (like GPT-4 or GPT-4o) through an API, OpenAI must balance billions of requests from developers across the world.
To maintain this balance, the platform sets rate limits — that is, boundaries on how much traffic your application can send within a fixed time window.


What Is a Rate Limit in an API?

A rate limit defines how many requests or how much data a single user (or API key) can send to the API within a fixed period — usually a minute.

For OpenAI, the limits are mainly based on two key metrics:

  1. Requests Per Minute (RPM)
  2. Tokens Per Minute (TPM)

You need to keep both within the allowed quota.


What Is RPM (Requests Per Minute)?

RPM means how many separate API calls you can make in one minute.

  • Example: If your plan allows 60 RPM, that means you can make one request every second on average.
  • If you exceed it, OpenAI will return an HTTP status 429 Too Many Requests.

Think of RPM as a traffic police limit — you cannot send too many cars (requests) on the road (API) within a minute.

Example

If your app sends multiple prompts rapidly to GPT-4 (say, 80 requests in 10 seconds) while your RPM limit is 60, you’ll immediately hit a throttle even if the rest of the minute is idle.

So, it’s not only how many per minute but how fast you send them that matters.


What Is TPM (Tokens Per Minute)?

TPM is the total number of tokens (input + output) your API calls can consume in one minute.
A token is roughly four English characters, or about ¾ of a word.

Every request consumes:

  • Input tokens → the text you send in your prompt.
  • Output tokens → the text OpenAI’s model generates in response.

So if you send a 200-token prompt and ask for 300 output tokens, that one request uses up to 500 tokens in total.

Example

Let’s say your account has a limit of 90 000 tokens per minute.
If each request uses 900 tokens, you can safely send about: 90,000÷900=100requestsperminute90,000 ÷ 900 = 100 requests per minute90,000÷900=100requestsperminute

If you try 120 requests, you’ll exceed your TPM cap and get a 429 error.


RPM vs TPM — How They Work Together

Both limits apply simultaneously, and whichever is hit first will throttle your app.

ScenarioRPM LimitTPM LimitWhat You Hit FirstWhy
Many tiny requestsHigh token efficiency (50 tokens each)Normal TPMRPMToo many calls, though tokens are small
Few heavy requestsLow call frequencyHeavy prompts (> 1 000 tokens each)TPMEach request consumes lots of tokens
Balanced usageModerate callsModerate tokensDependsBoth limits share the load

How OpenAI Counts Tokens

OpenAI uses its internal tokenizer (the tiktoken library) to break text into small pieces called tokens.

Example text:

“Hello India, how are you?”

This may break into tokens like:

[ "Hello", " India", ",", " how", " are", " you", "?" ]
That’s 7 tokens.

Approximate guideline

Language / TypeTokens per word (approx.)Example
English0.75“Hello world” = 2 tokens
Indian languages (Hindi / Tamil / Marathi)1.0 – 1.2Varies slightly due to unicode
Code (Python / JavaScript)1.2 – 1.4Keywords, spaces, punctuation count more

You can use Python’s tiktoken library to check your exact token count before sending requests.


Why Tokens Matter More Than You Think

For most GPT-based applications, token usage directly impacts cost and performance:

  • Every token is billed — so excessive tokens = higher cost.
  • Every token consumes quota — so large prompts slow down throughput.
  • Fewer tokens = faster response time.

If your prompt is 1000 tokens long, GPT-4 must read and process those tokens before generating an answer. That’s why planning your token budget is crucial.

Real-World AnalogyMeaning in OpenAI API
Number of letters you send in a postTokens you consume
Number of envelopes you post per minuteRequests per minute
Post office limit (per minute)API rate limit
If you send too many letters (tokens) or envelopes (requests) at onceYou’ll be throttled (429 error)

How Limits Differ by Model and Account Tier

OpenAI offers different rate limits depending on:

  1. Model type: GPT-4o, GPT-4-Turbo, GPT-3.5-Turbo, etc.
  2. Plan: Free tier, Pay-as-you-go, or Enterprise.
  3. Usage history: Older accounts with consistent billing often receive higher quotas automatically.

Approximate examples (subject to change):

ModelRPMTPMIntended Use
GPT-3.5-Turbo3 500 RPM350 000 TPMGeneral apps, chatbots
GPT-4-Turbo500 RPM80 000 TPMComplex apps
GPT-4o1 000 RPM250 000 TPMHigh throughput apps / production

(These numbers vary — always check your account’s dashboard.)


How the Limit Resets

Rate limits work in rolling windows, not fixed one-minute clocks.
That means if you hit your TPM limit at 12:00:30, you don’t have to wait till 12:01:00. Your tokens start freeing up gradually as time passes.

Think of it as a conveyor belt — as older requests move out of the last minute, you gain space for new ones.


What Happens When You Exceed the Limit

When you exceed either RPM or TPM, the API responds with:

HTTP/1.1 429 Too Many Requests
{
"error": {
"message": "Rate limit reached for requests or tokens. Please try again later.",
"type": "rate_limit_error",
"param": null,
"code": null
}
}

You can read response headers like:

  • x-ratelimit-limit-requests
  • x-ratelimit-remaining-requests
  • x-ratelimit-reset-requests
  • x-ratelimit-limit-tokens
  • x-ratelimit-remaining-tokens
  • x-ratelimit-reset-tokens

These tell you how much quota remains and when it resets — very helpful for automatic retries.

ConceptDescriptionDeveloper Tip
RPMNumber of API calls allowed per minuteBatch requests or queue them
TPMTotal tokens (prompt + response) per minuteOptimise prompts to reduce token usage
Token4 characters ≈ ¾ wordUse tiktoken to measure
Throttling Error429 Too Many RequestsImplement retry with exponential backoff
Limit ResetRolling window (~60 s)Plan traffic smoothing

3. Real-World Example & How to Read Rate-Limit Headers

Understanding theory is one thing.
But to make this concept real, let’s walk through a practical scenario that every developer integrating OpenAI’s API faces sooner or later.


Scenario: A Chatbot with Growing Traffic

Imagine you’ve built a customer support chatbot using GPT-4o via OpenAI’s API.

At the start, you have:

  • 10 active users
  • Each sending about 2 messages per minute
  • Each message (prompt + reply) uses around 500 tokens

So total usage per minute =
10 users × 2 messages × 500 tokens = 10 000 tokens per minute

This is well within most rate limits.

Now, suppose your app grows, and you suddenly get:

  • 500 users
  • Each sending 3 messages per minute
    That’s 500 × 3 × 500 = 750 000 tokens per minute

If your TPM limit is 250 000, your app will start failing intermittently.
You’ll notice delayed responses or see this familiar error:

{
  "error": {
    "message": "Rate limit reached for tokens per min. Please try again later.",
    "type": "rate_limit_error",
    "code": null
  }
}

This is OpenAI protecting its infrastructure from overload — not a bug in your code.


Error 429 – What It Really Means

The 429 Too Many Requests response means you’re exceeding either:

  • The number of allowed requests per minute (RPM), or
  • The total tokens per minute (TPM).

OpenAI doesn’t always specify which limit you crossed — but the response headers can tell you exactly what’s happening.


Rate-Limit Headers Explained

When you make an API call, OpenAI includes special headers in its response.
Here’s what they look like (values shown are examples):

x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 14
x-ratelimit-reset-requests: 2025-10-22T09:34:52Z

x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 12000
x-ratelimit-reset-tokens: 2025-10-22T09:34:56Z

Let’s decode these step-by-step:

Header NameMeaningExample ValueExplanation
x-ratelimit-limit-requestsYour maximum RPM quota60You can make up to 60 API calls per minute
x-ratelimit-remaining-requestsHow many API calls you have left in the current window14You’ve already made 46 requests
x-ratelimit-reset-requestsWhen your request quota resets09:34:52You can send more calls after this time
x-ratelimit-limit-tokensYour total TPM quota90 000You can consume up to 90 000 tokens per minute
x-ratelimit-remaining-tokensTokens left for the current minute12 000You’ve used 78 000 tokens so far
x-ratelimit-reset-tokensWhen your token quota resets09:34:56You can send more tokens after this time

Python Example – Monitoring Rate-Limit Headers

Let’s see how you can monitor these headers automatically in Python.
This is very helpful when you want your application to detect when you’re close to the limit and back off before hitting it.

import openai
import time

openai.api_key = "YOUR_API_KEY"

def call_openai(prompt):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100
        )
        
        # Print model response
        print("Response:", response.choices[0].message.content.strip())

        # Print rate limit info from headers
        headers = response.response_ms.headers if hasattr(response, "response_ms") else response.headers
        print("\n--- Rate Limit Info ---")
        for key, value in headers.items():
            if "ratelimit" in key:
                print(f"{key}: {value}")

    except openai.error.RateLimitError as e:
        print("⚠️ Rate limit reached. Waiting before retry...")
        time.sleep(5)
        return call_openai(prompt)
    except Exception as e:
        print("Error:", e)

call_openai("Explain token limits in OpenAI API in one paragraph.")

Explanation:

  • The function sends a request to GPT-4o with a small prompt.
  • After receiving the response, it prints out all rate-limit headers.
  • If it hits the limit, it catches the RateLimitError, waits for 5 seconds, and retries.

You can expand this logic to:

  • Store header data in a log file or database.
  • Monitor how close you are to your limit.
  • Automatically reduce request frequency when near the cap.

Tip: Exponential Backoff for Safety

A smart way to handle 429 errors is to retry after a delay, increasing the delay each time.
This is called exponential backoff — a widely used technique in distributed systems.

Here’s a simple pattern:

import time
import random

def exponential_backoff(retry_count):
    wait = min(60, (2 ** retry_count) + random.random())
    print(f"Retrying after {wait:.2f} seconds...")
    time.sleep(wait)

This avoids hammering the API again immediately, which could make things worse.
Instead, it waits for a gradually longer time after each retry, up to 60 seconds.


Example Log Output (Real API Monitoring)

Here’s what a sample output may look like in your logs:

Response: Tokens are small pieces of text processed by the model...

--- Rate Limit Info ---
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 52
x-ratelimit-reset-requests: 2025-10-22T09:44:21Z
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 83500
x-ratelimit-reset-tokens: 2025-10-22T09:44:23Z

This tells you:

  • You’ve made 8 of 60 requests in this minute.
  • You’ve used ~6 500 tokens of your 90 000 TPM.
  • You still have plenty of quota left.

Such real-time visibility helps you predict throttling before it happens.


Simple Dashboard Idea

If you’re building a production app, you can display these metrics visually:

MetricDescriptionStatus
Requests UsedNumber of API calls this minute 8 / 60
Tokens UsedTotal tokens used this minute6 500 / 90 000
Remaining TimeWhen limits reset 35 seconds
Throttling RiskHigh / Medium / Low Low

This dashboard helps you or your DevOps team maintain a healthy flow of API traffic without sudden failures.


Why Reading Headers Is Crucial

If you’re working in a real-world application (like chatbots, data analysis tools, or automation systems), reading rate-limit headers helps you:

  • Avoid random API failures (by adjusting request rate).
  • Estimate real-time throughput.
  • Plan scaling — know when you need a higher quota.
  • Debug faster — find out whether a failure is due to your code or OpenAI’s limits.
TopicWhat You Learned
Error 429It means you exceeded RPM or TPM
Rate-limit headersShow remaining quota and reset time
Python monitoringYou can log headers to avoid throttling
Exponential backoffSafely retry without flooding the API
Dashboard metricsUseful for production monitoring

4. RPM vs TPM: Deep Comparison and Interaction

When you use OpenAI’s API, both Requests Per Minute (RPM) and Tokens Per Minute (TPM) limits apply simultaneously.
You can think of them as two gates guarding the same road:

  • One limits how many vehicles enter (RPM).
  • The other limits how much total weight those vehicles carry (TPM).

If you break either rule, you’ll get throttled — even if the other gate was still open.


The Core Difference Between RPM and TPM

FeatureRPM (Requests Per Minute)TPM (Tokens Per Minute)
DefinitionNumber of API calls you can send in one minuteTotal number of tokens (input + output) you can send and receive per minute
UnitRequests (calls)Tokens (prompt + response)
Who hits it firstApps sending many small requests rapidlyApps sending few large prompts or responses
Typical use caseReal-time chatbots, web UIs, small promptsLarge document summarizers, coding assistants, translators
Throttling reasonToo many hits per minuteToo many tokens processed per minute
Optimization approachAdd batching, throttling, or queuesReduce prompt size or output tokens

Relationship Between RPM and TPM

Although both are separate, they’re mathematically related through average token usage per request.

Let’s call:

  • T_request = Average tokens used per request (input + output)
  • TPM_limit = Your total allowed tokens per minute
  • RPM_limit = Your total allowed requests per minute

Then your maximum feasible requests per minute (based on tokens) is:

Requests_by_TPM = T_request / TPM_limit​

And your actual allowed requests per minute is:

Actual_RPM=min(RPM_limit,Requests_by_TPM)

This formula helps you decide whether your app will be limited by RPM or TPM first.


Example 1 – Small Prompts, High Traffic

Let’s say:

  • TPM limit = 100 000
  • RPM limit = 500
  • Average request = 200 tokens

Requests_by_TPM=100,000÷200=500

So here, both RPM and TPM allow roughly the same throughput — 500 requests per minute.
You’re balanced .


Example 2 – Large Prompts, Fewer Requests

Now imagine:

  • TPM limit = 100 000
  • RPM limit = 500
  • Average request = 800 tokens

Requests_by_TPM=100,000÷800=125

Even though your RPM limit is 500, your token usage caps you at 125 requests per minute.
You’ll hit the TPM limit before you ever reach your RPM cap.

That’s why optimizing tokens is often more impactful than optimizing the number of requests.

Example 3 – Heavy Output Models (GPT-4o)

For GPT-4o, many developers hit TPM early because outputs are richer and longer.
Suppose you send only 50 requests per minute, but each uses 3 000 tokens (large code or document generation):

50×3,000=150,000 tokensperminute

If your TPM limit is 120 000, you’ll get throttled, even though your RPM (50 ≪ 500) looks fine.

Moral: You can hit your token limit without ever hitting your request limit.

How to Find Out Which Limit You’re Hitting

You can inspect the headers from your API response (as we saw earlier):

HeaderWhat It Tells YouIf Value Is Small
x-ratelimit-remaining-requestsRequests left before hitting RPMYou’re close to request limit
x-ratelimit-remaining-tokensTokens left before hitting TPMYou’re close to token limit

You can log these values in your code or monitoring dashboard.
Once you see either of them dropping rapidly, you know where the bottleneck is.


Analogy: Cars on a Highway

Think of OpenAI’s servers like a toll plaza:

  • Each car = one API request.
  • Each car’s weight = number of tokens used in that request.
  • There’s a limit on how many cars can pass (RPM) and how much total weight can pass (TPM).

If you send many small cars, you might hit the car limit (RPM).
If you send fewer but very heavy trucks, you’ll hit the weight limit (TPM).

Either way, the toll gate (OpenAI API) won’t let more through until some time passes.


Quick Throughput Calculator Table

Here’s a simple ready-to-use table for developers to estimate how many requests per minute are possible for different prompt sizes and token limits.

TPM LimitAvg Tokens per RequestPossible RPM (approx)Bottleneck
60 000200300RPM / TPM balanced
60 000500120TPM
60 0001 00060TPM
120 000200500RPM
120 000800150TPM
250 000500500RPM balanced
250 0001 500166TPM

How to use this table:
Find your TPM limit (from your account dashboard) and estimate your average token use per request (using the tiktoken library).
Then look up your possible RPM. That’s the safe request rate you should aim for.


Key Developer Takeaways

PrincipleExplanationPractical Tip
Both RPM and TPM matterYou can hit either limit firstMonitor both in your logs
Tokens dominate costLarge prompts = large cost and throttlingMinimise unnecessary context
Balance requests & tokensThe lighter your prompt, the more requests you can sendCompress or summarise input
Always plan for burstsShort spikes may exceed limits even if your average is safeImplement a queue system
Monitor real timeTrack rate-limit headers and token usageUse logging or metrics dashboard

Pro Developer Tip: Estimate Safe Concurrency

If your TPM limit is 250 000 tokens/min and each request consumes 500 tokens,
you can handle ≈ 500 requests/minute safely.
If each request takes ~2 seconds, your safe concurrent requests = (500requests/60seconds)×2seconds≈16parallelcalls(500 requests / 60 seconds) × 2 seconds ≈ 16 parallel calls(500requests/60seconds)×2seconds≈16parallelcalls

So, about 16 concurrent threads can run safely without hitting limits.
More than that, you risk short-term throttling.


Real-World Use Cases

Use CaseTypical RPM BehaviorTypical TPM BehaviorCommon Issue
Chatbot / AssistantMany small requestsFew tokens per callHits RPM first
Content GeneratorFewer requestsLarge output tokensHits TPM first
Code AssistantModerate RPMModerate tokens (500–1 500)Balanced
Batch ProcessorLarge batch inputsMany tokens per callTPM limit often reached

By mapping your application type to this table, you can predict where your limitation lies and plan accordingly.


Debugging When Limits Are Hit

When your API suddenly throws a RateLimitError, you can quickly identify which limit was exceeded:

  1. Check the error response headers.
  2. If remaining-tokens = 0 → TPM exceeded.
  3. If remaining-requests = 0 → RPM exceeded.
  4. If both still high but you see throttling → likely bursting too fast (too many requests in a few seconds).

You can handle bursts with rate-limiting libraries in your backend (like asyncio.Semaphore, tenacity, or ratelimit in Python).


When to Request Higher Limits

If your application consistently operates near the top of your limits:

  • Check the usage dashboard on OpenAI’s account page
  • Collect logs showing average RPM/TPM usage.
  • Contact OpenAI support with your use case and current usage patterns.

Usually, consistent, responsible usage over time automatically earns higher quotas.

TopicKey InsightDeveloper Action
RPMNumber of requests per minuteControl request frequency
TPMTotal tokens per minuteOptimise prompt and output
InteractionWhichever limit is hit first throttles youMonitor both with headers
FormulaActual_RPM = min(RPM_limit, TPM_limit / T_request)Use this to plan throughput
Throughput tableHelps forecast capacityUse before scaling
Safe concurrency(Actual_RPM / 60) × request_durationDecide number of threads/workers

5. Throughput Calculator: Estimation & Implementation

In this section, we’ll convert the ideas from Sections 2–4 into a practical calculator you can use every day.
We’ll first define the inputs, then write the formulas, and finally build a small Python utility that prints a clean table and tells you how many requests you can safely send per minute and how many workers (concurrent calls) you can run without hitting limits.

We’ll keep the language simple and the maths light, so freshers can follow easily, and experienced developers can plug this into CI/ops right away.


What Inputs Do We Need?

To estimate safe throughput, collect these 6 values:

  1. TPM_limit – Your OpenAI Tokens Per Minute limit.
  2. RPM_limit – Your OpenAI Requests Per Minute limit.
  3. T_prompt – Average input tokens per request.
  4. T_output – Average output tokens per request (or your max_tokens if you keep it fixed).
  5. latency_sec – Average time a single request takes (seconds).
  6. burst_factor (optional) – Extra safety margin for short spikes. Default: 1.0 (no extra safety). Use 0.8 for 20% headroom.

From (3) and (4) we compute T_request = T_prompt + T_output.

Core Formulas

This tells you roughly how many simultaneous calls you can keep in flight, without triggering per-second/burst throttles.

Tip: Use a small queue + token bucket/throttle so workers don’t fire all at once.

Worked Examples (Step-by-Step)

Example A — Small Prompts (Chatbot)

  • TPM_limit = 120,000
  • RPM_limit = 500
  • T_prompt = 150, T_output = 150T_request = 300
  • latency_sec = 2.0, burst_factor = 0.9

Calculations:

  • Requests_by_TPM = 120,000 ÷ 300 = 400
  • Actual_RPM = min(500, 400) = 400
  • Safe_RPM = floor(400 × 0.9) = 360
  • Tokens_per_min = 360 × 300 = 108,000
  • Safe_Concurrency = floor((360/60) × 2.0) = floor(6 × 2) = 12

Conclusion: Target ~360 RPM with ~12 concurrent calls.


Example B — Large Outputs (Content Generator)

  • TPM_limit = 250,000
  • RPM_limit = 500
  • T_prompt = 400, T_output = 1200T_request = 1600
  • latency_sec = 6.0, burst_factor = 0.85

Calculations:

  • Requests_by_TPM = 250,000 ÷ 1600 = 156
  • Actual_RPM = min(500, 156) = 156
  • Safe_RPM = floor(156 × 0.85) = 132
  • Tokens_per_min = 132 × 1600 = 211,200
  • Safe_Concurrency = floor((132/60) × 6.0) = floor(2.2 × 6) = 13

Conclusion: Target ~132 RPM with ~13 concurrent calls.

Ready-to-Use Throughput Calculator Table

Pick the row closest to your usage pattern. (You can adjust numbers later with the Python tool.)

Assumptions per block are listed; each block shows how Avg Tokens/Req changes the safe RPM under different TPM limits. We keep burst_factor = 0.9 for safety.

Block 1 — RPM_limit = 500

TPM LimitAvg Tokens/ReqRequests_by_TPMActual_RPMSafe_RPM (×0.9)Tokens/min @ Safe
60,00020030030027054,000
60,00050012012010854,000
60,0001,00060605454,000
120,00020060050045090,000
120,000800150150135108,000
250,000500500500450225,000
250,0001,500166166149223,500

Note: When Requests_by_TPM exceeds RPM_limit, Actual_RPM is capped by RPM_limit.

Block 2RPM_limit = 1000

TPM LimitAvg Tokens/ReqRequests_by_TPMActual_RPMSafe_RPM (×0.9)Tokens/min @ Safe
120,000200600600540108,000
120,000500240240216108,000
250,000300833833749224,700
250,0001,000250250225225,000

Use these tables to sanity-check your plan before load testing.

A Practical Python Calculator (CLI-Style)

The script below:

  • Asks for your TPM/RPM limits, prompt/output tokens, avg latency, and burst factor.
  • Prints your safe RPM, expected tokens/min, and recommended concurrency.
  • Also prints a small scenario table for multiple token sizes (nice for quick “what-if” checks).

You can paste this into a file like throughput_calc.py and run with Python 3.10+.

import math

def calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec=2.0, burst_factor=0.9):
    T_request = T_prompt + T_output
    if T_request <= 0:
        raise ValueError("Average tokens per request must be > 0")

    requests_by_tpm = TPM_limit // T_request
    actual_rpm = min(RPM_limit, requests_by_tpm)
    safe_rpm = math.floor(actual_rpm * burst_factor)
    tokens_per_min = safe_rpm * T_request
    safe_concurrency = math.floor((safe_rpm / 60.0) * latency_sec)

    return {
        "T_request": T_request,
        "Requests_by_TPM": requests_by_tpm,
        "Actual_RPM": actual_rpm,
        "Safe_RPM": safe_rpm,
        "Tokens_per_min": tokens_per_min,
        "Safe_Concurrency": safe_concurrency
    }

def pretty(n):  # simple thousands formatting
    return f"{n:,}"

def scenario_table(TPM_limit, RPM_limit, lat, burst, token_sizes=(200, 300, 500, 800, 1000, 1500)):
    print("\n=== Scenario Table (vary Avg Tokens/Req) ===")
    print(f"TPM_limit={pretty(TPM_limit)}, RPM_limit={pretty(RPM_limit)}, latency={lat}s, burst_factor={burst}")
    print(f"{'AvgTokens':>10} | {'ReqByTPM':>9} | {'ActualRPM':>9} | {'SafeRPM':>7} | {'Tok/min':>10}")
    print("-" * 60)
    for t in token_sizes:
        res = calc_throughput(TPM_limit, RPM_limit, T_prompt=t//2, T_output=t - t//2, latency_sec=lat, burst_factor=burst)
        print(f"{t:>10} | {pretty(res['Requests_by_TPM']):>9} | {pretty(res['Actual_RPM']):>9} | {pretty(res['Safe_RPM']):>7} | {pretty(res['Tokens_per_min']):>10}")

if __name__ == "__main__":
    # === Collect inputs ===
    try:
        TPM_limit = int(input("Enter TPM_limit (e.g., 250000): ").strip())
        RPM_limit = int(input("Enter RPM_limit (e.g., 500): ").strip())
        T_prompt = int(input("Avg prompt tokens per request (e.g., 200): ").strip())
        T_output = int(input("Avg output tokens per request (e.g., 300): ").strip())
        latency_sec = float(input("Avg latency per request in seconds (e.g., 2.0): ").strip() or "2.0")
        burst_factor = float(input("Burst safety factor (0.7–1.0, e.g., 0.9): ").strip() or "0.9")

        res = calc_throughput(TPM_limit, RPM_limit, T_prompt, T_output, latency_sec, burst_factor)

        print("\n=== Throughput Result ===")
        print(f"Avg tokens/request (T_request) : {pretty(res['T_request'])}")
        print(f"Requests limited by TPM        : {pretty(res['Requests_by_TPM'])} rpm")
        print(f\"Actual RPM (respect RPM/TPM) : {pretty(res['Actual_RPM'])} rpm\")
        print(f\"Safe RPM (after burst factor): {pretty(res['Safe_RPM'])} rpm\")
        print(f\"Tokens per minute at Safe RPM: {pretty(res['Tokens_per_min'])}\")
        print(f\"Recommended Safe Concurrency : {pretty(res['Safe_Concurrency'])} workers\")

        # quick what-if table
        scenario_table(TPM_limit, RPM_limit, latency_sec, burst_factor)

        print("\nTips:")
        print("• Lower Avg tokens/request to increase possible RPM under the same TPM.")
        print("• If Safe_RPM equals RPM_limit often, consider asking for higher RPM or batch requests.")
        print("• If Tokens_per_min ~= TPM_limit, reduce T_output or prompt size, or request higher TPM.")
    except Exception as e:
        print("Input/Error:", e)

How to use:

  1. Run the script and enter your values (from your OpenAI dashboard and logs).
  2. The script prints your Safe RPM and Safe Concurrency.
  3. Use the scenario table to see how changing the average token usage changes your capacity.

Interpreting the Calculator’s Output

  • If Actual_RPM = RPM_limit and Tokens_per_min is much lower than TPM_limit, then RPM is your bottleneck. Consider:
    • Batching multiple small prompts in one request (if your workflow allows), or
    • Asking for a higher RPM quota.
  • If Tokens_per_minTPM_limit, then TPM is your bottleneck. Consider:
    • Reducing max_tokens,
    • Compressing context (few-shot > many-shot),
    • Using summaries/embeddings to shrink inputs.
  • If Safe_Concurrency is very low (e.g., 2–3) but your service needs higher parallelism, then:
    • Work with a request queue + token bucket (dispatch slowly but consistently),
    • Spread traffic more evenly (no spikes).

Optional: Simple Token Bucket (Pseudocode)

If you want a tiny in-process throttle that respects a target RPM, this idea helps:

bucket_capacity = Safe_RPM  # e.g., 360
refill_rate = Safe_RPM / 60 # tokens per second
bucket = bucket_capacity

loop:
  now = time()
  bucket = min(bucket_capacity, bucket + refill_rate * (now - last))
  if bucket >= 1:
      bucket -= 1
      dispatch_request()
  else:
      sleep_until_next_token()

This keeps your per-second rate smooth so you don’t accidentally burst and get 429s.


6. Best Practices to Handle Rate Limits

Once you understand rate limits, the next step is handling them smartly in production.
Here’s a concise checklist of best developer practices:

Technical Practices

  • Implement Exponential Backoff: Retry after increasing delays (e.g., 2s → 4s → 8s).
  • Add Random Jitter: Add a small random time to avoid synchronized retries.
  • Use Queues: Maintain a small request queue to control bursts.
  • Batch Requests: Combine multiple small prompts into a single call if possible.
  • Cache Responses: If the same prompt repeats, reuse the last response.

Application Design Practices

  • Track Usage: Log and visualize rate-limit headers.
  • Optimize Prompts: Keep prompts shorter and remove redundant context.
  • Limit max_tokens: Don’t set huge output sizes when not needed.
  • Graceful Errors: Show friendly “Please wait” messages to users.
  • Request Higher Quota: If consistently maxing out, request higher RPM/TPM from OpenAI.

7. How to Monitor and Debug Rate Limits

Key Things to Track

  • x-ratelimit-remaining-requests
  • x-ratelimit-remaining-tokens
  • Response latency trends (to detect throttling).

Monitoring Setup

  • Use Prometheus + Grafana or a simple CSV log to track usage.
  • Add alerts when remaining tokens fall below 10 % of quota.
  • Log timestamp, RPM, TPM, and error 429 frequency.

Debug Steps

  1. If you see 429 errors → print header values.
  2. If remaining-tokens = 0 → reduce output size.
  3. If remaining-requests = 0 → slow down request rate.
  4. If both fine → likely burst issue → add throttling.

8. Common Mistakes Developers Make

MistakeImpactQuick Fix
Ignoring token usageSudden throttlingMeasure with tiktoken
Setting huge max_tokensWasted TPMTune output length
Sending many short bursts429 errorsQueue + Backoff
Not reading headersNo visibilityAlways log them
Assuming fixed minute resetUnpredictable retryRemember: rolling window

9. FAQs

Q 1. Does the limit reset exactly every minute?
No — it’s a rolling window. Tokens free up gradually within 60 seconds.

Q 2. Are limits same for all models?
No. GPT-4 and GPT-4o have tighter TPMs than GPT-3.5. Check dashboard.

Q 3. What happens if I have multiple API keys?
Each key has its own quota, but total usage per account still matters.

Q 4. Can I increase limits manually?
Yes. Submit a request via OpenAI dashboard or upgrade your billing tier.

Q 5. Why am I throttled even under limits?
Because you’re sending short bursts; API enforces sub-minute smoothing.


10. Final Tips for Developers

  • Plan for scale — design your API calls to respect both limits.
  • Simulate load using your throughput calculator before launch.
  • Log everything — headers, tokens, timestamps, latency.
  • Distribute traffic evenly through seconds, not in bursts.
  • Stay modular — keep token counting, backoff, and logging in separate utility functions.

11. Real-World Workflow Example

Example: AI Writing App

A content-generation web app sends large prompts to GPT-4o.

  1. Each request ≈ 1200 tokens (input + output).
  2. TPM = 250 000, RPM = 500.
  3. From the calculator → safe ≈ 150 requests/min.
  4. Add a small queue (size = 20) + retry with jitter.
  5. Monitor headers in logs.
  6. If tokens remain > 10 %, increase throughput slightly.

Result → stable app, no 429 errors, predictable latency ≈ 2.2 s per call.


12. Key Takeaways

Focus AreaWhy It MattersQuick Action
TPMControls total work doneOptimise prompts
RPMControls call frequencySmooth requests
BackoffPrevents repeated throttlingImplement exponential retry
MonitoringEnsures visibilityLog headers + alerts
Throughput PlanningAvoids overloadUse calculator before deployment

Understanding both TPM and RPM helps you:

  • Predict performance
  • Control cost
  • Scale your app reliably

13. Conclusion

OpenAI’s API is extremely powerful, but power always comes with control.
Rate limits are not roadblocks — they’re traffic rules to keep the system stable for everyone.

By mastering the balance between Requests Per Minute (RPM) and Tokens Per Minute (TPM), you can:

  • Prevent unexpected 429 errors,
  • Optimise your costs, and
  • Build production-grade AI applications confidently.

You now have:

  • A deep understanding of how limits work,
  • A Python tool to calculate throughput,
  • Real-world handling techniques, and
  • A clear strategy for scaling OpenAI APIs safely.

Next Step:
Run your throughput calculator, log your first few thousand calls, and watch how smoothly your app scales without hitting a single rate-limit wall.

Categorized in: