Rate Limits & Usage Limits
Learn about API rate limits, usage quotas, and how to optimize your application
Overview
ParrotRouter implements rate limiting to ensure fair usage and maintain service quality for all users. Understanding these limits helps you build reliable applications that scale effectively.
Rate Limits
Requests per minute (RPM) and tokens per minute (TPM)
Usage Quotas
Monthly token limits based on your plan
Burst Capacity
Short-term burst allowance for traffic spikes
Rate Limit Headers
Every API response includes headers that help you track your rate limit status:
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 58
X-RateLimit-Reset: 1699000000
X-RateLimit-Reset-After: 45
X-RateLimit-Bucket: default
X-RateLimit-Policy: 60;w=60
X-RateLimit-Limit
Maximum number of requests allowed in the current window
X-RateLimit-Remaining
Number of requests remaining in the current window
X-RateLimit-Reset
Unix timestamp when the rate limit window resets
X-RateLimit-Reset-After
Seconds until the rate limit window resets
Rate Limits by Plan
Free Plan Limits
Model-Specific Limits
Some models have additional constraints beyond your plan limits:
Context Length Limits
Max Output Tokens
Maximum tokens that can be generated in a single request:
Handling Rate Limits
import time
import random
from typing import Callable, Any
def exponential_backoff(
func: Callable,
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0
) -> Any:
"""Retry function with exponential backoff"""
for attempt in range(max_retries):
try:
return func()
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Use Retry-After header if available
retry_after = e.response.headers.get('Retry-After')
if retry_after:
delay = int(retry_after)
else:
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt) + random.random(), max_delay)
print(f"Rate limited. Retrying in {delay} seconds...")
time.sleep(delay)
raise Exception("Max retries exceeded")
class RateLimitManager {
private requests: number[] = [];
private readonly windowMs = 60000; // 1 minute
private readonly maxRequests = 60;
async checkRateLimit(): Promise<void> {
const now = Date.now();
// Remove old requests outside the window
this.requests = this.requests.filter(
timestamp => now - timestamp < this.windowMs
);
if (this.requests.length >= this.maxRequests) {
const oldestRequest = this.requests[0];
const resetTime = oldestRequest + this.windowMs;
const waitTime = resetTime - now;
throw new Error(`Rate limit exceeded. Wait ${waitTime}ms`);
}
this.requests.push(now);
}
async makeRequest<T>(fn: () => Promise<T>): Promise<T> {
await this.checkRateLimit();
try {
const response = await fn();
// Update limits from response headers
this.updateFromHeaders(response.headers);
return response;
} catch (error) {
if (error.status === 429) {
const retryAfter = error.headers?.['retry-after'] || 60;
await new Promise(resolve => setTimeout(resolve, retryAfter * 1000));
return this.makeRequest(fn);
}
throw error;
}
}
private updateFromHeaders(headers: Headers) {
const remaining = headers.get('X-RateLimit-Remaining');
const reset = headers.get('X-RateLimit-Reset');
// Update internal state based on server response
if (remaining && reset) {
// Sync with server limits
}
}
}
Best Practices
Implement Request Queuing
Queue requests to stay within rate limits automatically
from queue import Queue
import threading
class RequestQueue:
def __init__(self, rpm_limit=60):
self.queue = Queue()
self.rpm_limit = rpm_limit
self.interval = 60.0 / rpm_limit
def process_queue(self):
while True:
request = self.queue.get()
# Process request
time.sleep(self.interval)
Use Batch Requests
Combine multiple operations into single requests when possible
# Instead of multiple requests
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Use batch processing
responses = client.chat.completions.create(
model="gpt-4",
messages=batch_messages,
n=len(prompts)
)
Monitor Usage
Track your API usage to avoid hitting limits
class UsageTracker {
private tokenCount = 0;
private requestCount = 0;
trackUsage(response: any) {
this.tokenCount += response.usage.total_tokens;
this.requestCount += 1;
// Alert if approaching limits
if (this.tokenCount > 900000) {
console.warn('Approaching token limit');
}
}
}
Cache Responses
Cache common requests to reduce API calls
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
# Only called if not in cache
return client.chat.completions.create(...)
def get_completion(prompt):
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
return cached_completion(prompt_hash)
Rate Limit Errors
When you exceed rate limits, you'll receive a 429 error:
{
"error": {
"message": "Rate limit exceeded. Please retry after 60 seconds.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"status": 429
}
}
Retry-After
header to determine how long to wait before retrying. This helps prevent cascading failures and ensures fair resource allocation.Increasing Your Limits
Request Limit Increase
Enterprise customers can request custom rate limits based on their needs
Contact enterprise sales →Optimize Usage
Implement caching, batching, and efficient prompting to maximize your current limits
- • Use smaller models when appropriate
- • Implement response caching
- • Batch similar requests
- • Optimize prompt length