API Reference

Rate Limits & Usage Limits

Learn about API rate limits, usage quotas, and how to optimize your application

Overview

ParrotRouter implements rate limiting to ensure fair usage and maintain service quality for all users. Understanding these limits helps you build reliable applications that scale effectively.

Rate Limits

Requests per minute (RPM) and tokens per minute (TPM)

Usage Quotas

Monthly token limits based on your plan

Burst Capacity

Short-term burst allowance for traffic spikes

Rate Limit Headers

Every API response includes headers that help you track your rate limit status:

Response Headershttp

HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 58
X-RateLimit-Reset: 1699000000
X-RateLimit-Reset-After: 45
X-RateLimit-Bucket: default
X-RateLimit-Policy: 60;w=60

X-RateLimit-Limit

Maximum number of requests allowed in the current window

X-RateLimit-Remaining

Number of requests remaining in the current window

X-RateLimit-Reset

Unix timestamp when the rate limit window resets

X-RateLimit-Reset-After

Seconds until the rate limit window resets

Rate Limits by Plan

Free Plan Limits

Requests per minute

20 RPM

Tokens per minute

40,000 TPM

Requests per day

500 RPD

Monthly token quota

1M tokens

Model-Specific Limits

Some models have additional constraints beyond your plan limits:

Context Length Limits

GPT-4

8,192 tokens

GPT-4-32k

32,768 tokens

Claude 3 Opus

200,000 tokens

GPT-3.5-turbo

16,385 tokens

Max Output Tokens

Maximum tokens that can be generated in a single request:

Most models

4,096 tokens

Claude 3 models

4,096 tokens

GPT-4 Vision

4,096 tokens

Handling Rate Limits

Python - Exponential Backoffpython

import time
import random
from typing import Callable, Any

def exponential_backoff(
    func: Callable,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0
) -> Any:
    """Retry function with exponential backoff"""
    
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Use Retry-After header if available
            retry_after = e.response.headers.get('Retry-After')
            if retry_after:
                delay = int(retry_after)
            else:
                # Exponential backoff with jitter
                delay = min(base_delay * (2 ** attempt) + random.random(), max_delay)
            
            print(f"Rate limited. Retrying in {delay} seconds...")
            time.sleep(delay)
    
    raise Exception("Max retries exceeded")

TypeScript - Rate Limit Managertypescript

class RateLimitManager {
  private requests: number[] = [];
  private readonly windowMs = 60000; // 1 minute
  private readonly maxRequests = 60;

  async checkRateLimit(): Promise<void> {
    const now = Date.now();
    
    // Remove old requests outside the window
    this.requests = this.requests.filter(
      timestamp => now - timestamp < this.windowMs
    );
    
    if (this.requests.length >= this.maxRequests) {
      const oldestRequest = this.requests[0];
      const resetTime = oldestRequest + this.windowMs;
      const waitTime = resetTime - now;
      
      throw new Error(`Rate limit exceeded. Wait ${waitTime}ms`);
    }
    
    this.requests.push(now);
  }

  async makeRequest<T>(fn: () => Promise<T>): Promise<T> {
    await this.checkRateLimit();
    
    try {
      const response = await fn();
      
      // Update limits from response headers
      this.updateFromHeaders(response.headers);
      
      return response;
    } catch (error) {
      if (error.status === 429) {
        const retryAfter = error.headers?.['retry-after'] || 60;
        await new Promise(resolve => setTimeout(resolve, retryAfter * 1000));
        return this.makeRequest(fn);
      }
      throw error;
    }
  }

  private updateFromHeaders(headers: Headers) {
    const remaining = headers.get('X-RateLimit-Remaining');
    const reset = headers.get('X-RateLimit-Reset');
    
    // Update internal state based on server response
    if (remaining && reset) {
      // Sync with server limits
    }
  }
}

Best Practices

Implement Request Queuing

Queue requests to stay within rate limits automatically

from queue import Queue
import threading

class RequestQueue:
    def __init__(self, rpm_limit=60):
        self.queue = Queue()
        self.rpm_limit = rpm_limit
        self.interval = 60.0 / rpm_limit
        
    def process_queue(self):
        while True:
            request = self.queue.get()
            # Process request
            time.sleep(self.interval)

Use Batch Requests

Combine multiple operations into single requests when possible

# Instead of multiple requests
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

# Use batch processing
responses = client.chat.completions.create(
    model="gpt-4",
    messages=batch_messages,
    n=len(prompts)
)

Monitor Usage

Track your API usage to avoid hitting limits

class UsageTracker {
  private tokenCount = 0;
  private requestCount = 0;
  
  trackUsage(response: any) {
    this.tokenCount += response.usage.total_tokens;
    this.requestCount += 1;
    
    // Alert if approaching limits
    if (this.tokenCount > 900000) {
      console.warn('Approaching token limit');
    }
  }
}

Cache Responses

Cache common requests to reduce API calls

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
    # Only called if not in cache
    return client.chat.completions.create(...)

def get_completion(prompt):
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_completion(prompt_hash)

Rate Limit Errors

When you exceed rate limits, you'll receive a 429 error:

Rate Limit Error Responsejson

{
  "error": {
    "message": "Rate limit exceeded. Please retry after 60 seconds.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded",
    "status": 429
  }
}

Always check the Retry-After header to determine how long to wait before retrying. This helps prevent cascading failures and ensures fair resource allocation.

Increasing Your Limits

Upgrade Your Plan

Higher tier plans come with increased rate limits and quotas

View pricing plans →

Request Limit Increase

Enterprise customers can request custom rate limits based on their needs

Contact enterprise sales →

Optimize Usage

Implement caching, batching, and efficient prompting to maximize your current limits

• Use smaller models when appropriate
• Implement response caching
• Batch similar requests
• Optimize prompt length

Rate Limits & Usage Limits

Overview

Rate Limits

Usage Quotas

Burst Capacity

Rate Limit Headers

Rate Limits by Plan

Free Plan Limits

Model-Specific Limits

Context Length Limits

Max Output Tokens

Handling Rate Limits

Best Practices

Implement Request Queuing

Use Batch Requests

Monitor Usage

Cache Responses

Rate Limit Errors

Increasing Your Limits

Upgrade Your Plan

Request Limit Increase

Optimize Usage

Related Topics

Error Handling

Pricing

Usage Tracking