Advanced
January 20, 202430 min read

LLM Monitoring and Observability

Build comprehensive monitoring systems to track performance, costs, and quality of your LLM applications in production with OpenTelemetry, Prometheus, and distributed tracing.

Metrics
20+ Types
Cost Reduction
15-30%
Issue Detection
90% Faster
Data Sources
15+ Tools

Key Monitoring Areas

Performance Metrics
Track response times, throughput, and system health
  • • Response time (p50, p95, p99)
  • • Token generation speed
  • • Queue depth and wait times
  • • Error rates and timeouts
  • • Model load balancing metrics

OpenTelemetry Implementation

OpenTelemetry provides the foundation for comprehensive LLM observability with distributed tracing, metrics collection, and log aggregation.

Basic Setup

import { NodeSDK } from '@opentelemetry/sdk-node'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { JaegerExporter } from '@opentelemetry/exporter-jaeger'
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus'

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'llm-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
  traceExporter: new JaegerExporter({
    endpoint: 'http://localhost:14268/api/traces',
  }),
  metricReader: new PrometheusExporter({
    port: 9090,
  }),
})

sdk.start()

LLM Request Tracing

import { trace, metrics } from '@opentelemetry/api'

const tracer = trace.getTracer('llm-service')
const meter = metrics.getMeter('llm-service')

// Create counters and histograms
const requestCounter = meter.createCounter('llm_requests_total')
const responseTimeHistogram = meter.createHistogram('llm_response_time_seconds')
const tokenCounter = meter.createCounter('llm_tokens_total')
const costCounter = meter.createCounter('llm_cost_total')

async function callLLM(prompt: string, model: string) {
  const span = tracer.startSpan('llm_request', {
    attributes: {
      'llm.model': model,
      'llm.prompt_length': prompt.length,
      'llm.provider': 'openai'
    }
  })

  const startTime = Date.now()
  
  try {
    const response = await openai.chat.completions.create({
      model,
      messages: [{ role: 'user', content: prompt }],
    })

    const duration = (Date.now() - startTime) / 1000
    const tokens = response.usage?.total_tokens || 0
    const cost = calculateCost(tokens, model)

    // Record metrics
    requestCounter.add(1, { model, status: 'success' })
    responseTimeHistogram.record(duration, { model })
    tokenCounter.add(tokens, { model, type: 'total' })
    costCounter.add(cost, { model })

    // Add span attributes
    span.setAttributes({
      'llm.response_length': response.choices[0].message.content?.length,
      'llm.tokens.prompt': response.usage?.prompt_tokens,
      'llm.tokens.completion': response.usage?.completion_tokens,
      'llm.tokens.total': tokens,
      'llm.cost': cost
    })

    span.setStatus({ code: 1 }) // OK
    return response
  } catch (error) {
    requestCounter.add(1, { model, status: 'error' })
    span.setStatus({ code: 2, message: error.message }) // ERROR
    throw error
  } finally {
    span.end()
  }
}

Monitoring Stack

Metrics (Prometheus)
  • • Request rates and latencies
  • • Token usage and costs
  • • Error rates and types
  • • Resource utilization
  • • Custom business metrics
Traces (Jaeger)
  • • End-to-end request flows
  • • Service dependencies
  • • Performance bottlenecks
  • • Error root cause analysis
  • • User journey tracking
Logs (ELK Stack)
  • • Structured application logs
  • • LLM request/response data
  • • Error details and stack traces
  • • Security audit trails
  • • Debug information
Alerts (AlertManager)
  • • High error rates
  • • Latency spikes
  • • Cost threshold breaches
  • • Service downtime
  • • Quality degradation

Cost Monitoring

Cost monitoring is crucial for LLM applications due to token-based pricing and varying model costs.

Cost Tracking Implementation

class CostTracker {
  private costs = new Map<string, number>()
  
  // Model pricing per 1K tokens
  private pricing = {
    'gpt-4': { input: 0.03, output: 0.06 },
    'gpt-3.5-turbo': { input: 0.001, output: 0.002 },
    'claude-3-opus': { input: 0.015, output: 0.075 },
    'claude-3-sonnet': { input: 0.003, output: 0.015 }
  }

  calculateCost(model: string, inputTokens: number, outputTokens: number): number {
    const modelPricing = this.pricing[model]
    if (!modelPricing) return 0
    
    const inputCost = (inputTokens / 1000) * modelPricing.input
    const outputCost = (outputTokens / 1000) * modelPricing.output
    return inputCost + outputCost
  }

  trackCost(userId: string, sessionId: string, cost: number, metadata: any) {
    // Store cost with context
    const costEntry = {
      userId,
      sessionId,
      cost,
      timestamp: new Date(),
      ...metadata
    }
    
    // Send to monitoring system
    costCounter.add(cost, {
      user_id: userId,
      model: metadata.model,
      provider: metadata.provider
    })
    
    // Check budget alerts
    this.checkBudgetAlerts(userId, cost)
  }

  async checkBudgetAlerts(userId: string, newCost: number) {
    const userBudget = await this.getUserBudget(userId)
    const currentSpend = await this.getCurrentSpend(userId)
    
    if (currentSpend + newCost > userBudget * 0.8) {
      // Send alert at 80% of budget
      this.sendBudgetAlert(userId, currentSpend, userBudget)
    }
  }
}

Quality Monitoring

Monitor response quality to ensure your LLM application maintains high standards over time.

Quality Metrics

class QualityMonitor {
  async evaluateResponse(prompt: string, response: string, context?: any) {
    const metrics = await Promise.all([
      this.checkRelevance(prompt, response),
      this.checkCoherence(response),
      this.checkSafety(response),
      this.checkFactuality(response, context)
    ])

    const qualityScore = {
      relevance: metrics[0],
      coherence: metrics[1], 
      safety: metrics[2],
      factuality: metrics[3],
      overall: metrics.reduce((a, b) => a + b, 0) / metrics.length
    }

    // Record quality metrics
    qualityGauge.set(qualityScore.overall, {
      metric_type: 'overall',
      model: context?.model
    })

    qualityGauge.set(qualityScore.relevance, {
      metric_type: 'relevance',
      model: context?.model
    })

    // Alert on quality degradation
    if (qualityScore.overall < 0.7) {
      this.alertQualityIssue(qualityScore, prompt, response)
    }

    return qualityScore
  }

  async checkRelevance(prompt: string, response: string): Promise<number> {
    // Use embedding similarity or classifier model
    const promptEmbedding = await this.getEmbedding(prompt)
    const responseEmbedding = await this.getEmbedding(response)
    return this.cosineSimilarity(promptEmbedding, responseEmbedding)
  }

  async checkSafety(response: string): Promise<number> {
    // Use content moderation API
    const moderation = await openai.moderations.create({
      input: response
    })
    
    return moderation.results[0].flagged ? 0 : 1
  }
}

Alerting and Dashboards

Critical Alerts
  • • Error rate > 5% for 5 minutes
  • • Response time p95 > 10 seconds
  • • Cost increase > 50% hour-over-hour
  • • Quality score drops below 0.7
  • • Service availability < 99%
Dashboard Components
  • • Real-time request volume and latency
  • • Cost breakdown by model and user
  • • Error rates and types over time
  • • Quality metrics trends
  • • Resource utilization and scaling metrics

Best Practices

References

  1. [1] OpenTelemetry. "OpenTelemetry Documentation" (2024)
  2. [2] Langfuse. "LLM Observability Platform" (2024)
  3. [3] Helicone. "LLM Monitoring Guide" (2024)
  4. [4] DataDog. "Monitoring Generative AI" (2024)
  5. [5] Arize AI. "LLM Observability Best Practices" (2024)
  6. [6] Weights & Biases. "LLM Monitoring and Evaluation" (2024)
  7. [7] TruLens. "LLM Application Evaluation" (2024)
  8. [8] Phoenix. "ML Observability Platform" (2024)