TechAni

Navigating Observability in LLMs

Three weeks to instrument an LLM system nobody could debug
Published: Jan 2, 2026
Timeline: 3 weeks
Team: Platform + SRE
Focus: AI Observability, LLM Reliability
A production LLM feature shipped without semantic observability. Responses were unreliable and root cause analysis took hours. In three weeks we instrumented every hop trace IDs, retrieval scores, context assembly, and differentiated SLOs, so engineers could see why an answer failed and fix it fast.

Impact at a Glance

100%
Requests with end-to-end trace IDs
0.7
Embedding score alert threshold
50% ↓
MTTR on LLM incidents
70%+
Issues caught via semantic tracing before users

Opaque Retrieval & Context

Logs showed "context retrieved" but never which sources, what scores, or why chunks were chosen. Engineers could not explain bad answers.

One-Size-Fits-All SLOs

Instant answers and deep reasoning shared the same latency target, masking real regressions and generating noisy alerts.

Trace IDs Everywhere

Every request propagates a trace ID through gateway, retrieval, context assembly, LLM, and observability sinks.

Retrieval & Embedding Telemetry

Logged queries, data sources, similarity scores, and inclusion decisions. Low-quality hits trigger alerts at 0.7.

Context-Aware SLOs

Split "instant" (<1s target) vs. "thinking" (<30s target). Alerting tunes thresholds to user expectations.

Implementation Highlights

Week 1 — Map the Black Box

Shadowed live traffic and manually followed a request through every component. Confirmed the system retrieved mostly irrelevant documents without visibility into why.

Week 2 — Instrument the Chain

Added trace propagation at ingress, retrieval, ranking, assembly, and LLM invocation. Logged embeddings, scores, chosen chunks, and drop reasons.

Week 3 — Make It Actionable

Built lightweight dashboards and alerts: slow "instant" paths (>3s), "thinking" paths (>60s), low embedding scores (<0.7), and missing context.

Full Request Trace Flow

User Request trace_id: req_123 API Gateway Context Retrieval Data Source 1 score: 0.89 Data Source 2 score: 0.76 Data Source 3 score: 0.65 Context Assembly 847ms LLM Observability All traces logged here

Every request carries a trace ID. Retrieval calls, scoring, context assembly, and LLM generation all emit trace-aware logs.

Context Quality Tracking

User Query "Reset password" Generate Embeddings Search Data Sources Score: 0.89 ✓ Include Score: 0.76 ✓ Include Score: 0.65 ⚠ Low Quality Assemble Context Alert: Low Score Observability Log quality issue

Similarity scores are logged with every retrieval. Anything below 0.7 triggers an alert and links to the trace.

Response Time SLO Tracking

Instant Response Target: <1s Alert: >3s Thinking Response Target: <30s Alert: >60s Start t=0ms Retrieval t=200ms LLM t=850ms Done 1.2s P50: 0.9s P95: 1.4s P99: 2.1s Within SLO ✓ Violations 0.3%

Separate SLOs and alerts for "instant" and "thinking" paths keep latency signals meaningful.

Results

Before

Engineers guessed at failures, combed logs for hours, and could not explain hallucinations or slow paths.

After

Pull up any trace in 30 seconds, view retrieval decisions, scores, and context, and pinpoint the root cause immediately.

Bottom Line

Semantic observability transformed the team's LLM reliability. Debugging dropped from hours to minutes, and issues are now caught before users feel them. Shipping AI features without deep tracing is flying blind.