Rate Limiting for LLM Applications

Protect your AI infrastructure from abuse while ensuring fair resource allocation

Rate Limiting Overview
Real-time monitoring of API rate limits
Rate Limiting Algorithms
Choose the right algorithm based on traffic patterns

Token Bucket

Allows burst traffic with refill rate

Advantages

  • Handles bursts well
  • Flexible
  • Industry standard

Limitations

  • Requires state management
  • Complex tuning

Best for: LLM APIs with variable load

Implementation Examples

// Token Bucket Implementation with Redis
class TokenBucketRateLimiter {
  constructor(redis, options = {}) {
    this.redis = redis;
    this.maxTokens = options.maxTokens || 100;
    this.refillRate = options.refillRate || 10; // tokens per second
    this.keyPrefix = options.keyPrefix || 'rate:';
  }

  async isAllowed(identifier, tokensRequested = 1) {
    const key = `${this.keyPrefix}${identifier}`;
    const now = Date.now();

    // Lua script for atomic token bucket operations
    const luaScript = `
      local key = KEYS[1]
      local max_tokens = tonumber(ARGV[1])
      local refill_rate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])
      local tokens_requested = tonumber(ARGV[4])
      
      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(bucket[1]) or max_tokens
      local last_refill = tonumber(bucket[2]) or now
      
      -- Calculate tokens to add based on time elapsed
      local elapsed = math.max(0, now - last_refill) / 1000
      local tokens_to_add = elapsed * refill_rate
      tokens = math.min(max_tokens, tokens + tokens_to_add)
      
      if tokens >= tokens_requested then
        tokens = tokens - tokens_requested
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return {1, tokens, max_tokens}
      else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return {0, tokens, max_tokens}
      end
    `;

    const result = await this.redis.eval(
      luaScript,
      1,
      key,
      this.maxTokens,
      this.refillRate,
      now,
      tokensRequested
    );

    return {
      allowed: result[0] === 1,
      remainingTokens: result[1],
      maxTokens: result[2],
      refillRate: this.refillRate
    };
  }

  // Weighted token consumption for LLM requests
  async isAllowedWeighted(identifier, request) {
    // Calculate tokens based on model and request size
    const tokens = this.calculateTokenCost(request);
    return this.isAllowed(identifier, tokens);
  }

  calculateTokenCost(request) {
    const modelCosts = {
      'gpt-4': 10,
      'gpt-3.5-turbo': 3,
      'claude-2': 8,
      'llama-2': 2
    };

    const baseCost = modelCosts[request.model] || 5;
    const inputTokens = Math.ceil(request.inputTokens / 1000);
    const outputTokens = Math.ceil(request.maxTokens / 1000);

    return baseCost * (inputTokens + outputTokens);
  }
}

// Usage
const rateLimiter = new TokenBucketRateLimiter(redisClient, {
  maxTokens: 1000,
  refillRate: 50,
  keyPrefix: 'llm:rate:'
});

// Middleware
async function rateLimitMiddleware(req, res, next) {
  const identifier = req.user?.id || req.ip;
  const result = await rateLimiter.isAllowedWeighted(identifier, {
    model: req.body.model,
    inputTokens: req.body.messages.join(' ').length,
    maxTokens: req.body.max_tokens || 1000
  });

  // Set rate limit headers
  res.set({
    'X-RateLimit-Limit': result.maxTokens,
    'X-RateLimit-Remaining': result.remainingTokens,
    'X-RateLimit-Refill-Rate': result.refillRate
  });

  if (!result.allowed) {
    const retryAfter = Math.ceil((1 - result.remainingTokens) / result.refillRate);
    res.set('Retry-After', retryAfter);
    return res.status(429).json({
      error: 'Too Many Requests',
      message: 'Rate limit exceeded. Please retry later.',
      retryAfter
    });
  }

  next();
}

Enterprise Rate Limiting with ParrotRouter

ParrotRouter provides advanced rate limiting with automatic scaling, intelligent routing, and real-time monitoring. Protect your LLM infrastructure while ensuring optimal performance.

References