Feature

Latency and Performance

Minimize response times with global edge deployment and intelligent optimization

Performance Overview

ParrotRouter uses multiple strategies to deliver the fastest possible AI responses, including global edge deployment, intelligent caching, and optimized routing.

50ms Average TTFT

Time to first token across all providers

Global Edge Network

25+ edge locations worldwide

99.9% Uptime

High availability with automatic failover

Streaming Responses

Get tokens as they're generated for the fastest perceived performance:

Streaming Implementationpython

from openai import OpenAI

client = OpenAI(
    base_url="https://api.parrotrouter.com/v1",
    api_key="your-api-key"
)

# Stream tokens as they arrive
stream = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,
    extra_headers={
        "X-Stream-Buffer-Size": "1",  # Send each token immediately
        "X-Stream-Timeout-Ms": "100"  # Max wait between tokens
    }
)

# Process tokens in real-time
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
        
# Advanced streaming with metrics
import time

start_time = time.time()
first_token_time = None
tokens = []

for chunk in stream:
    if chunk.choices[0].delta.content:
        if first_token_time is None:
            first_token_time = time.time()
            ttft = (first_token_time - start_time) * 1000
            print(f"
Time to first token: {ttft:.0f}ms
")
        
        tokens.append(chunk.choices[0].delta.content)
        print(chunk.choices[0].delta.content, end="", flush=True)

total_time = (time.time() - start_time) * 1000
print(f"

Total time: {total_time:.0f}ms")
print(f"Tokens: {len(tokens)}")
print(f"Tokens/sec: {len(tokens) / (total_time / 1000):.1f}")

Streaming typically reduces perceived latency by 60-80% compared to waiting for complete responses.

Edge Deployment

Route requests through the nearest edge location for minimal latency:

Global Edge Locations

Americas

• US East (Virginia) - 15ms
• US West (California) - 18ms
• Canada (Toronto) - 20ms
• Brazil (São Paulo) - 25ms

Europe

• EU West (Ireland) - 16ms
• EU Central (Frankfurt) - 14ms
• UK (London) - 15ms
• Nordic (Stockholm) - 18ms

Asia Pacific

• Singapore - 20ms
• Japan (Tokyo) - 22ms
• Australia (Sydney) - 24ms
• India (Mumbai) - 28ms

Other Regions

• Middle East (Dubai) - 30ms
• Africa (Cape Town) - 35ms
• China (Hong Kong) - 25ms

Edge Routing Configuration

# Automatic edge routing (recommended)
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Hello"}]
)
# Automatically routes through nearest edge

# Force specific region
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={
        "X-Edge-Location": "us-west-2",
        "X-Edge-Fallback": "us-east-1,eu-west-1"
    }
)

# Get edge location info
import requests

edge_info = requests.get(
    "https://api.parrotrouter.com/v1/edge/nearest",
    headers={"Authorization": "Bearer your-api-key"}
).json()

print(f"Nearest edge: {edge_info['location']}")
print(f"Latency: {edge_info['latency_ms']}ms")
print(f"Available models: {edge_info['cached_models']}")

Performance Optimization

Parallel Processing

Process multiple requests simultaneously for better throughput.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.parrotrouter.com/v1",
    api_key="your-api-key"
)

async def process_batch(prompts):
    # Create tasks for parallel execution
    tasks = []
    for prompt in prompts:
        task = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            extra_headers={
                "X-Batch-Mode": "true",
                "X-Batch-Priority": "throughput"  # or "latency"
            }
        )
        tasks.append(task)
    
    # Execute all requests in parallel
    responses = await asyncio.gather(*tasks)
    return responses

# Process batch
prompts = ["Question 1", "Question 2", "Question 3"]
results = await process_batch(prompts)

# With connection pooling
async def optimized_batch(prompts):
    async with client:
        # Reuse connections for better performance
        responses = await asyncio.gather(*[
            client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": p}]
            ) for p in prompts
        ])
    return responses

Token Optimization

Reduce tokens for faster responses and lower costs.

# Optimize prompt length
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{
        "role": "user", 
        "content": "Summarize in 50 words: [long text]"
    }],
    max_tokens=100,  # Limit response length
    extra_headers={
        "X-Token-Budget": "500",  # Total token budget
        "X-Compression": "true",  # Enable prompt compression
        "X-Strip-Whitespace": "true"  # Remove extra spaces
    }
)

# Use prompt templates for efficiency
template_response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "{task}"}],
    extra_headers={
        "X-Template-ID": "summary-template",
        "X-Template-Vars": json.dumps({
            "task": "Summarize this article"
        })
    }
)

Connection Optimization

Optimize network connections for better performance.

# HTTP/2 with keep-alive
import httpx

# Create optimized client
http_client = httpx.Client(
    http2=True,
    limits=httpx.Limits(
        max_keepalive_connections=10,
        max_connections=100,
        keepalive_expiry=30
    )
)

client = OpenAI(
    base_url="https://api.parrotrouter.com/v1",
    api_key="your-api-key",
    http_client=http_client
)

# DNS prefetch and connection warming
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={
        "X-Connection-Warm": "true",
        "X-DNS-Prefetch": "true",
        "X-TCP-Nodelay": "true"  # Disable Nagle's algorithm
    }
)

Performance Monitoring

Track and analyze performance metrics in real-time:

Performance Metricspython

# Get detailed performance metrics
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Analyze this"}],
    extra_headers={
        "X-Include-Metrics": "true"
    }
)

# Access performance data
metrics = response.metrics
print(f"Time to first token: {metrics['ttft_ms']}ms")
print(f"Total latency: {metrics['total_latency_ms']}ms")
print(f"Tokens per second: {metrics['tokens_per_second']}")
print(f"Queue time: {metrics['queue_time_ms']}ms")
print(f"Model inference time: {metrics['inference_time_ms']}ms")
print(f"Edge location: {metrics['edge_location']}")
print(f"Provider: {metrics['provider']}")

# Real-time performance monitoring
import requests

# Get current performance stats
stats = requests.get(
    "https://api.parrotrouter.com/v1/performance/current",
    headers={"Authorization": "Bearer your-api-key"}
).json()

print(f"Current latency: {stats['current_latency_ms']}ms")
print(f"Request queue: {stats['queue_length']}")
print(f"Active models: {stats['active_models']}")

# Historical performance data
history = requests.get(
    "https://api.parrotrouter.com/v1/performance/history",
    headers={"Authorization": "Bearer your-api-key"},
    params={
        "period": "1h",
        "metric": "latency",
        "groupBy": "model"
    }
).json()

for model_stats in history['models']:
    print(f"
{model_stats['model']}:")
    print(f"  Avg latency: {model_stats['avg_latency_ms']}ms")
    print(f"  P50: {model_stats['p50_latency_ms']}ms")
    print(f"  P95: {model_stats['p95_latency_ms']}ms")
    print(f"  P99: {model_stats['p99_latency_ms']}ms")

Latency Benchmarks

Time to First Token (TTFT)

GPT-3.5 Turbo

45ms

GPT-4 Turbo

75ms

Claude 3 Haiku

38ms

Claude 3 Opus

90ms

Tokens Per Second

GPT-3.5 Turbo

180 t/s

GPT-4 Turbo

100 t/s

Claude 3 Haiku

200 t/s

Claude 3 Opus

80 t/s

Best Practices

1.
Use Streaming for User-Facing Apps
Stream responses to improve perceived performance
2.
Implement Client-Side Caching
Cache frequent requests to eliminate network latency
3.
Batch Similar Requests
Process multiple requests together for better throughput
4.
Monitor Performance Metrics
Track latency patterns to identify optimization opportunities

Related Features

Prompt Caching

Cache responses for instant results

Streaming API

Real-time token streaming

Provider Routing

Route to fastest providers