Why LLM Monitoring Matters
LLM applications in production need comprehensive monitoring to track costs, performance, quality, and user experience. Without proper observability, issues can go undetected, leading to poor user experience and unexpected costs.
Metrics
20+ Types
Cost Reduction
15-30%
Issue Detection
90% Faster
Data Sources
15+ Tools
Key Monitoring Areas
Performance Metrics
Track response times, throughput, and system health
- • Response time (p50, p95, p99)
- • Token generation speed
- • Queue depth and wait times
- • Error rates and timeouts
- • Model load balancing metrics
OpenTelemetry Implementation
OpenTelemetry provides the foundation for comprehensive LLM observability with distributed tracing, metrics collection, and log aggregation.
Basic Setup
import { NodeSDK } from '@opentelemetry/sdk-node'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { JaegerExporter } from '@opentelemetry/exporter-jaeger'
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus'
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'llm-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new JaegerExporter({
endpoint: 'http://localhost:14268/api/traces',
}),
metricReader: new PrometheusExporter({
port: 9090,
}),
})
sdk.start()
LLM Request Tracing
import { trace, metrics } from '@opentelemetry/api'
const tracer = trace.getTracer('llm-service')
const meter = metrics.getMeter('llm-service')
// Create counters and histograms
const requestCounter = meter.createCounter('llm_requests_total')
const responseTimeHistogram = meter.createHistogram('llm_response_time_seconds')
const tokenCounter = meter.createCounter('llm_tokens_total')
const costCounter = meter.createCounter('llm_cost_total')
async function callLLM(prompt: string, model: string) {
const span = tracer.startSpan('llm_request', {
attributes: {
'llm.model': model,
'llm.prompt_length': prompt.length,
'llm.provider': 'openai'
}
})
const startTime = Date.now()
try {
const response = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
})
const duration = (Date.now() - startTime) / 1000
const tokens = response.usage?.total_tokens || 0
const cost = calculateCost(tokens, model)
// Record metrics
requestCounter.add(1, { model, status: 'success' })
responseTimeHistogram.record(duration, { model })
tokenCounter.add(tokens, { model, type: 'total' })
costCounter.add(cost, { model })
// Add span attributes
span.setAttributes({
'llm.response_length': response.choices[0].message.content?.length,
'llm.tokens.prompt': response.usage?.prompt_tokens,
'llm.tokens.completion': response.usage?.completion_tokens,
'llm.tokens.total': tokens,
'llm.cost': cost
})
span.setStatus({ code: 1 }) // OK
return response
} catch (error) {
requestCounter.add(1, { model, status: 'error' })
span.setStatus({ code: 2, message: error.message }) // ERROR
throw error
} finally {
span.end()
}
}
Monitoring Stack
Metrics (Prometheus)
- • Request rates and latencies
- • Token usage and costs
- • Error rates and types
- • Resource utilization
- • Custom business metrics
Traces (Jaeger)
- • End-to-end request flows
- • Service dependencies
- • Performance bottlenecks
- • Error root cause analysis
- • User journey tracking
Logs (ELK Stack)
- • Structured application logs
- • LLM request/response data
- • Error details and stack traces
- • Security audit trails
- • Debug information
Alerts (AlertManager)
- • High error rates
- • Latency spikes
- • Cost threshold breaches
- • Service downtime
- • Quality degradation
Cost Monitoring
Cost monitoring is crucial for LLM applications due to token-based pricing and varying model costs.
Cost Tracking Implementation
class CostTracker {
private costs = new Map<string, number>()
// Model pricing per 1K tokens
private pricing = {
'gpt-4': { input: 0.03, output: 0.06 },
'gpt-3.5-turbo': { input: 0.001, output: 0.002 },
'claude-3-opus': { input: 0.015, output: 0.075 },
'claude-3-sonnet': { input: 0.003, output: 0.015 }
}
calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const modelPricing = this.pricing[model]
if (!modelPricing) return 0
const inputCost = (inputTokens / 1000) * modelPricing.input
const outputCost = (outputTokens / 1000) * modelPricing.output
return inputCost + outputCost
}
trackCost(userId: string, sessionId: string, cost: number, metadata: any) {
// Store cost with context
const costEntry = {
userId,
sessionId,
cost,
timestamp: new Date(),
...metadata
}
// Send to monitoring system
costCounter.add(cost, {
user_id: userId,
model: metadata.model,
provider: metadata.provider
})
// Check budget alerts
this.checkBudgetAlerts(userId, cost)
}
async checkBudgetAlerts(userId: string, newCost: number) {
const userBudget = await this.getUserBudget(userId)
const currentSpend = await this.getCurrentSpend(userId)
if (currentSpend + newCost > userBudget * 0.8) {
// Send alert at 80% of budget
this.sendBudgetAlert(userId, currentSpend, userBudget)
}
}
}
Quality Monitoring
Monitor response quality to ensure your LLM application maintains high standards over time.
Quality Metrics
class QualityMonitor {
async evaluateResponse(prompt: string, response: string, context?: any) {
const metrics = await Promise.all([
this.checkRelevance(prompt, response),
this.checkCoherence(response),
this.checkSafety(response),
this.checkFactuality(response, context)
])
const qualityScore = {
relevance: metrics[0],
coherence: metrics[1],
safety: metrics[2],
factuality: metrics[3],
overall: metrics.reduce((a, b) => a + b, 0) / metrics.length
}
// Record quality metrics
qualityGauge.set(qualityScore.overall, {
metric_type: 'overall',
model: context?.model
})
qualityGauge.set(qualityScore.relevance, {
metric_type: 'relevance',
model: context?.model
})
// Alert on quality degradation
if (qualityScore.overall < 0.7) {
this.alertQualityIssue(qualityScore, prompt, response)
}
return qualityScore
}
async checkRelevance(prompt: string, response: string): Promise<number> {
// Use embedding similarity or classifier model
const promptEmbedding = await this.getEmbedding(prompt)
const responseEmbedding = await this.getEmbedding(response)
return this.cosineSimilarity(promptEmbedding, responseEmbedding)
}
async checkSafety(response: string): Promise<number> {
// Use content moderation API
const moderation = await openai.moderations.create({
input: response
})
return moderation.results[0].flagged ? 0 : 1
}
}
Alerting and Dashboards
Critical Alerts
- • Error rate > 5% for 5 minutes
- • Response time p95 > 10 seconds
- • Cost increase > 50% hour-over-hour
- • Quality score drops below 0.7
- • Service availability < 99%
Dashboard Components
- • Real-time request volume and latency
- • Cost breakdown by model and user
- • Error rates and types over time
- • Quality metrics trends
- • Resource utilization and scaling metrics
Best Practices
Do's
- • Implement distributed tracing
- • Monitor costs in real-time
- • Set up quality baselines
- • Use structured logging
- • Automate alert responses
Don'ts
- • Don't ignore silent failures
- • Don't over-alert teams
- • Don't store sensitive data in logs
- • Don't monitor everything equally
- • Don't forget user experience metrics
References
- [1] OpenTelemetry. "OpenTelemetry Documentation" (2024)
- [2] Langfuse. "LLM Observability Platform" (2024)
- [3] Helicone. "LLM Monitoring Guide" (2024)
- [4] DataDog. "Monitoring Generative AI" (2024)
- [5] Arize AI. "LLM Observability Best Practices" (2024)
- [6] Weights & Biases. "LLM Monitoring and Evaluation" (2024)
- [7] TruLens. "LLM Application Evaluation" (2024)
- [8] Phoenix. "ML Observability Platform" (2024)