Rate Limiting for LLM Applications
Protect your AI infrastructure from abuse while ensuring fair resource allocation
Critical for LLM APIs
According to industry research, effective rate limiting prevents cost overruns, ensures fair access, and protects against DDoS attacks. See provider comparisons.
Rate Limiting Overview
Real-time monitoring of API rate limits
Rate Limiting Algorithms
Choose the right algorithm based on traffic patterns
Token Bucket
Allows burst traffic with refill rate
Advantages
- Handles bursts well
- Flexible
- Industry standard
Limitations
- Requires state management
- Complex tuning
Best for: LLM APIs with variable load
Implementation Examples
// Token Bucket Implementation with Redis
class TokenBucketRateLimiter {
constructor(redis, options = {}) {
this.redis = redis;
this.maxTokens = options.maxTokens || 100;
this.refillRate = options.refillRate || 10; // tokens per second
this.keyPrefix = options.keyPrefix || 'rate:';
}
async isAllowed(identifier, tokensRequested = 1) {
const key = `${this.keyPrefix}${identifier}`;
const now = Date.now();
// Lua script for atomic token bucket operations
const luaScript = `
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local tokens_requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or max_tokens
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add based on time elapsed
local elapsed = math.max(0, now - last_refill) / 1000
local tokens_to_add = elapsed * refill_rate
tokens = math.min(max_tokens, tokens + tokens_to_add)
if tokens >= tokens_requested then
tokens = tokens - tokens_requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {1, tokens, max_tokens}
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {0, tokens, max_tokens}
end
`;
const result = await this.redis.eval(
luaScript,
1,
key,
this.maxTokens,
this.refillRate,
now,
tokensRequested
);
return {
allowed: result[0] === 1,
remainingTokens: result[1],
maxTokens: result[2],
refillRate: this.refillRate
};
}
// Weighted token consumption for LLM requests
async isAllowedWeighted(identifier, request) {
// Calculate tokens based on model and request size
const tokens = this.calculateTokenCost(request);
return this.isAllowed(identifier, tokens);
}
calculateTokenCost(request) {
const modelCosts = {
'gpt-4': 10,
'gpt-3.5-turbo': 3,
'claude-2': 8,
'llama-2': 2
};
const baseCost = modelCosts[request.model] || 5;
const inputTokens = Math.ceil(request.inputTokens / 1000);
const outputTokens = Math.ceil(request.maxTokens / 1000);
return baseCost * (inputTokens + outputTokens);
}
}
// Usage
const rateLimiter = new TokenBucketRateLimiter(redisClient, {
maxTokens: 1000,
refillRate: 50,
keyPrefix: 'llm:rate:'
});
// Middleware
async function rateLimitMiddleware(req, res, next) {
const identifier = req.user?.id || req.ip;
const result = await rateLimiter.isAllowedWeighted(identifier, {
model: req.body.model,
inputTokens: req.body.messages.join(' ').length,
maxTokens: req.body.max_tokens || 1000
});
// Set rate limit headers
res.set({
'X-RateLimit-Limit': result.maxTokens,
'X-RateLimit-Remaining': result.remainingTokens,
'X-RateLimit-Refill-Rate': result.refillRate
});
if (!result.allowed) {
const retryAfter = Math.ceil((1 - result.remainingTokens) / result.refillRate);
res.set('Retry-After', retryAfter);
return res.status(429).json({
error: 'Too Many Requests',
message: 'Rate limit exceeded. Please retry later.',
retryAfter
});
}
next();
}
Enterprise Rate Limiting with ParrotRouter
ParrotRouter provides advanced rate limiting with automatic scaling, intelligent routing, and real-time monitoring. Protect your LLM infrastructure while ensuring optimal performance.
References
- [1] OWASP. "OWASP Top 10 for LLM Applications" (2024)
- [2] NIST. "AI Risk Management Framework" (2024)
- [3] Microsoft. "LLM Security Best Practices" (2024)