Rate Limiting for LLM Applications

Protect your AI infrastructure from abuse while ensuring fair resource allocation

Critical for LLM APIs

According to industry research, effective rate limiting prevents cost overruns, ensures fair access, and protects against DDoS attacks. See provider comparisons.

Rate Limiting Overview

Real-time monitoring of API rate limits

Rate Limiting Algorithms

Choose the right algorithm based on traffic patterns

Select Algorithm

Token Bucket

Allows burst traffic with refill rate

Advantages

Handles bursts well
Flexible
Industry standard

Limitations

Requires state management
Complex tuning

Best for: LLM APIs with variable load

Implementation Examples

// Token Bucket Implementation with Redis
class TokenBucketRateLimiter {
  constructor(redis, options = {}) {
    this.redis = redis;
    this.maxTokens = options.maxTokens || 100;
    this.refillRate = options.refillRate || 10; // tokens per second
    this.keyPrefix = options.keyPrefix || 'rate:';
  }

  async isAllowed(identifier, tokensRequested = 1) {
    const key = `${this.keyPrefix}${identifier}`;
    const now = Date.now();

    // Lua script for atomic token bucket operations
    const luaScript = `
      local key = KEYS[1]
      local max_tokens = tonumber(ARGV[1])
      local refill_rate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])
      local tokens_requested = tonumber(ARGV[4])
      
      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(bucket[1]) or max_tokens
      local last_refill = tonumber(bucket[2]) or now
      
      -- Calculate tokens to add based on time elapsed
      local elapsed = math.max(0, now - last_refill) / 1000
      local tokens_to_add = elapsed * refill_rate
      tokens = math.min(max_tokens, tokens + tokens_to_add)
      
      if tokens >= tokens_requested then
        tokens = tokens - tokens_requested
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return {1, tokens, max_tokens}
      else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return {0, tokens, max_tokens}
      end
    `;

    const result = await this.redis.eval(
      luaScript,
      1,
      key,
      this.maxTokens,
      this.refillRate,
      now,
      tokensRequested
    );

    return {
      allowed: result[0] === 1,
      remainingTokens: result[1],
      maxTokens: result[2],
      refillRate: this.refillRate
    };
  }

  // Weighted token consumption for LLM requests
  async isAllowedWeighted(identifier, request) {
    // Calculate tokens based on model and request size
    const tokens = this.calculateTokenCost(request);
    return this.isAllowed(identifier, tokens);
  }

  calculateTokenCost(request) {
    const modelCosts = {
      'gpt-4': 10,
      'gpt-3.5-turbo': 3,
      'claude-2': 8,
      'llama-2': 2
    };

    const baseCost = modelCosts[request.model] || 5;
    const inputTokens = Math.ceil(request.inputTokens / 1000);
    const outputTokens = Math.ceil(request.maxTokens / 1000);

    return baseCost * (inputTokens + outputTokens);
  }
}

// Usage
const rateLimiter = new TokenBucketRateLimiter(redisClient, {
  maxTokens: 1000,
  refillRate: 50,
  keyPrefix: 'llm:rate:'
});

// Middleware
async function rateLimitMiddleware(req, res, next) {
  const identifier = req.user?.id || req.ip;
  const result = await rateLimiter.isAllowedWeighted(identifier, {
    model: req.body.model,
    inputTokens: req.body.messages.join(' ').length,
    maxTokens: req.body.max_tokens || 1000
  });

  // Set rate limit headers
  res.set({
    'X-RateLimit-Limit': result.maxTokens,
    'X-RateLimit-Remaining': result.remainingTokens,
    'X-RateLimit-Refill-Rate': result.refillRate
  });

  if (!result.allowed) {
    const retryAfter = Math.ceil((1 - result.remainingTokens) / result.refillRate);
    res.set('Retry-After', retryAfter);
    return res.status(429).json({
      error: 'Too Many Requests',
      message: 'Rate limit exceeded. Please retry later.',
      retryAfter
    });
  }

  next();
}

Enterprise Rate Limiting with ParrotRouter

ParrotRouter provides advanced rate limiting with automatic scaling, intelligent routing, and real-time monitoring. Protect your LLM infrastructure while ensuring optimal performance.

References

[1] OWASP. "OWASP Top 10 for LLM Applications" (2024)
[2] NIST. "AI Risk Management Framework" (2024)
[3] Microsoft. "LLM Security Best Practices" (2024)