Performance Monitoring Setup

Monitor LLM performance in real-time. Track latency, errors, costs, and usage patterns with comprehensive observability tools.

Metrics Tracked
50+

Performance indicators

Alert Response
<30s

Average detection time

Data Retention
90 days

Historical analysis

OpenTelemetry Integration
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Initialize tracing
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics
latency_histogram = meter.create_histogram(
    name="llm_request_duration",
    description="LLM request latency",
    unit="ms"
)

error_counter = meter.create_counter(
    name="llm_errors_total",
    description="Total LLM errors"
)

class MonitoredLLMClient:
    @tracer.start_as_current_span("llm_request")
    def complete(self, prompt):
        span = trace.get_current_span()
        span.set_attribute("model", self.model)
        span.set_attribute("prompt_tokens", len(prompt))
        
        start_time = time.time()
        try:
            response = self.client.complete(prompt)
            latency = (time.time() - start_time) * 1000
            latency_histogram.record(latency)
            return response
        except Exception as e:
            error_counter.add(1)
            span.record_exception(e)
            raise
Key Metrics to Monitor

Performance Metrics

  • Request latency (P50, P95, P99)
  • Tokens per second
  • Time to first token
  • Queue depth

Business Metrics

  • Cost per request
  • Error rates by model
  • Usage by endpoint
  • Cache hit rates
References
  1. [1] arXiv. "Efficient LLM Inference" (2024)
  2. [2] Hugging Face. "GPU Inference Optimization" (2024)
  3. [3] NVIDIA. "LLM Inference Optimization" (2024)