We have 400 dashboards and still don't know if we're healthy

Something went wrong between zero and everything

Every platform team hits two walls in sequence. The first wall is invisibility: you have no idea what your systems are doing, incidents surprise you, and post-mortems are guesswork. So you instrument everything. You add metrics, wire up logging, set up dashboards, connect alerts. The second wall is noise: you now have so much data that you still can't answer the basic question - are we healthy right now? The dashboards exist. The alerts fire. And yet when leadership asks that question at the wrong moment, the answer is a shrug and a pivot to four different tools.

What went wrong between those two walls is signal design. Most teams treat observability instrumentation as a collection problem - gather the four signal types - metrics, events, logs, and traces - point them at a platform, and figure out what matters later. Later never comes. What you end up with is 400 dashboards, a five-figure monthly bill, and still no clean answer to the health question.

MELT isn't a strategy. It's a vocabulary. Metrics are numeric measurements stored against time. Events are discrete point-in-time markers. Logs are timestamped records, structured or not. Traces are end-to-end request paths stitched across services. Each has a different cost model, a different cardinality risk, and a different set of questions it's suited for. Treating them as interchangeable is the root of most observability debt. What sits above them - SLIs, SLOs, KPIs, the D.U.R.E.S.S. framework - is what turns raw signal collection into a system that can answer the health question. That's what this article is about.

M

Metrics - numeric measurements over time, stored in a TSDB, aggregated at write

E

Events - discrete occurrences with structured context, point-in-time markers

L

Logs - timestamped text records, structured or unstructured, highest raw volume

T

Traces - end-to-end request paths stitched across services using spans

The KPI layer: why raw signals alone aren't enough

Raw MELT signals tell you what your system is doing. KPIs, SLIs, and SLOs tell you whether that's good enough for the people who depend on it. Those are different questions. Conflating them is exactly how you end up with 400 dashboards and no answer.

The hierarchy is straightforward on paper. SLIs are the measurements you derive from MELT signals - the fraction of requests faster than a threshold, the percentage of successful responses, the proportion of pipeline runs that produced correct output. SLOs are your internal targets for those SLIs. SLAs are the external-facing commitments backed by consequences. KPIs bridge observability to business outcomes - MTTR, error budget burn rate, deployment frequency, toil percentage. They don't always come from a single signal type; most meaningful KPIs are derived from a combination of all four.

layer

what it is

where it comes from in MELT

who cares

SLI

A ratio or proportion derived from raw signal data. The fraction of healthy events in a measurement window.

Primarily metrics and traces. Error rate from metric counters, latency percentile from histogram or trace duration, availability from synthetic probes (events).

SRE team and platform engineering. The signal quality of the SLI depends entirely on the quality of the underlying MELT data.

SLO

An internal target for an SLI. "99.5% of requests complete in under 500ms over a 30-day rolling window."

Same signal base as the SLI. The SLO just adds a threshold and a time window to the calculation.

SRE, product, and engineering leadership. Error budget derived from SLO miss rate drives release decisions.

SLA

External commitment with financial or contractual consequence if missed.

Typically a looser version of your SLO. If your SLO is 99.5%, your SLA might be 99%. The gap is your buffer.

Customers, legal, and finance. SLA miss triggers credits, audits, or contract review - not just an on-call page.

KPI

Business-level outcome metric. MTTR, error budget burn rate, deployment frequency, alert-to-ticket ratio, toil percentage.

Derived across all four signal types. MTTR needs timestamps from logs and events. Toil needs trace duration distributions. Error budget needs SLO miss history from metrics.

Engineering leadership and execs. This is how observability talks to the business rather than staying inside the platform team.

The failure mode is starting at the signal layer and never connecting upward. Teams instrument everything, build dashboards per service, and then when a VP asks "how reliable were we last quarter?" the answer is a manual spreadsheet exercise. The KPI and SLO tier needs to be designed alongside the signal tier, not bolted on after the fact.

A useful test: can your team answer "what is our current error budget for the checkout flow and at what burn rate?" in under two minutes without opening Slack or a spreadsheet? If not, you have MELT signals without a KPI layer. The signals exist. The meaning layer doesn't.

D.U.R.E.S.S. - the framework that connects signals to meaning

D.U.R.E.S.S. is a six-dimension framework for organizing observability signals around the things that matter during an incident: Duration, Utilization, Rate, Errors, Saturation, and System health. It isn't a new signal type - it's a lens for deciding which MELT signals to collect, which KPIs to derive from them, and which SLOs to attach those KPIs to.

The value is in the structure it provides. Instead of instrumenting everything and then asking "what matters?", you start with the six dimensions and work backwards to the signals that feed each one. This is how you avoid the custom metric explosion that costs $20k/month for signals nobody queries. Every metric you emit should map to at least one DURESS dimension. If it doesn't, it isn't a monitoring signal - it's a billing line item.

D

Duration

How long does a request, transaction, or job take end-to-end? This is the latency signal. Duration is the most commonly over-instrumented dimension - teams emit latency as custom metrics when they should be emitting traces.

p50_latency_msp95_latency_msp99_latency_msjob_duration_secondsdb_query_time_ms

U

Utilization

How much of available capacity is being used? CPU, memory, connection pool, thread pool, disk. Utilization signals are mostly host metrics or agent-provided - your application team shouldn't be generating these as custom metrics.

cpu_percentheap_used_bytesconnection_pool_activegoroutine_countthread_pool_queue_depth

R

Rate

How many operations per unit of time? Requests per second, messages consumed per second, records processed per minute. Rate is your throughput signal. A sudden drop in rate is as alarming as an error spike - something stopped processing.

requests_per_secondmessages_consumed_rateevents_processed_ratehttp_requests_total

E

Errors

How many requests or operations failed? Error rate, error count, 5xx ratio, exception rate. This feeds directly into availability SLIs. Error signals are where logs, metrics, and traces converge - a good error story needs all three.

error_rate_percenthttp_5xx_countexception_count_totalfailed_jobs_countslo_error_budget_remaining

S

Saturation

How close to the limit is the system? Queue depth, backpressure, disk fullness, connection pool exhaustion. Saturation predicts failure before it happens - you want to alert on saturation before the system falls over, not after.

queue_depthdisk_percent_fullconnection_pool_wait_countpod_eviction_rategc_pressure_percent

S

System health

Is the system running the way it's supposed to? Health check status, synthetic availability, deployment state, dependency health. System health signals are often events more than metrics - state changes, not continuous measurements.

health_check_statussynthetic_probe_successdependency_up_downdeployment_success_ratemttr_minutes

Every signal you collect should trace to one of those six. Every KPI you report should be derivable from the signals under the relevant dimension. Every SLO should attach to a specific SLI that feeds from DURESS signals. This is the chain: MELT signals feed DURESS dimensions, DURESS dimensions produce SLIs, SLIs power SLOs, SLOs drive error budgets, error budgets connect to KPIs that leadership can read without needing a Dynatrace license.

Metrics: the oldest signal with the most expensive failure mode

Metrics are the backbone of observability for most teams - dashboards, SLOs, alerts all run on them. They're fast to query, cheap to render if well-designed, and intuitive enough that anyone in the org can read a time-series graph. The problem is custom metrics, and the trajectory every team hits with them is predictable to the point of being embarrassing.

No metrics"We shipped this feature three sprints ago and have no idea if it's slow or broken in production."

→

Not enough metrics"We have CPU and memory. We need latency, error rate, queue depth, throughput, and cache hit rate too."

→

Who added these?"$40k/month observability bill. Nobody knows which metrics map to which dashboard. Half aren't in any alert."

Every team lands here eventually. The only variable is how fast and how expensive the crash landing is.

Figure 1: The lifecycle of custom metrics in every growing platform team. Getting from zero to "remove these" takes 12-18 months on average and costs real money.

Host metrics are largely fine. CPU, memory, disk, network - these come from the agent or the platform (kube-state-metrics, Datadog agent, Dynatrace OneAgent). Your application team doesn't generate them and they're low-cardinality by design. The dangerous part is custom metrics: the instrumentation your engineers emit from application code to measure business activity, feature behavior, and performance.

Host metrics - relatively safe

System-level signals provided by agents or platform components. CPU utilization, load average, heap usage, swap, disk iops, bandwidth. These map directly to the Utilization and Saturation dimensions of DURESS. Your team doesn't own the cardinality here - the agent does, and it's predictable.

Custom metrics - handle with care

Emitted from within application code. These map to the Rate, Error, and Duration dimensions of DURESS when designed well. When designed poorly, they're the reason your observability cost grew 3x in a quarter while your fleet size grew 20%.

Derived metrics - often overlooked

Computed from existing signals at query or recording-rule time. Error budget burn rate, SLO compliance percentage, p95/p99 latency from histogram buckets. These are powerful because they're computed from existing data rather than generating new time series, but they need to be set up intentionally.

Business metrics - the connection layer

Order count, transaction volume, active users, feature adoption rate. These are the metrics that bridge MELT to KPIs. They're also the most tempting to over-tag - the instinct to add customer ID or session ID as a tag is how you turn a single business metric into a million time series.

Cardinality: the cost multiplier hiding in the vendor docs

Metrics live in a Time Series Database (TSDB). Every unique combination of metric name plus tag values is a separate time series stored independently. Not a separate row - a separate series. This is the mechanism that makes TSDB queries fast and also the mechanism that makes custom metrics explosively expensive when you add the wrong tags.

Figure 2: Cardinality explosion from a single request.latency metric at scale. Switch from Count to Histogram and multiply by 5. That's one metric, two decisions, six figures of annual cost.

The billing model in Datadog (and similar platforms) compounds this. In Datadog, 100-200 custom metrics are included per host in the base plan. Beyond that, the charge is $1-$5 per 100 custom metrics depending on your plan tier. At 2,000 hosts with a mature application, you're easily at 20 million custom metrics if nobody has been disciplined about cardinality. That's $20k/month on the low end - and that's before you factor in retention tiers and percentile aggregations.

There's also a write-time aggregation trap specific to TSDBs that's easy to miss. Metrics are aggregated when written, not when queried. If you want p95 latency, that percentile needs to be declared at instrumentation time. In Datadog, adding a new percentile aggregation to an existing metric later counts as a new custom metric. You're not just paying for the data - you're paying for every analytical question you decided to ask when you shipped the instrumentation.

C1

High-cardinality tags

Request IDs, user IDs, session tokens, UUIDs, epoch timestamps. Each unique combination of tag values is a separate time series. Tag a metric with a request ID and you get one series per request - unbounded growth, and nothing useful to group by in any dashboard.

Use baggage or log context for request-level correlation. Keep metric tags to stable, low-cardinality dimensions: service, environment, region, cluster, endpoint, status class (2xx/4xx/5xx).

C2

Histogram by default

Histogram emits five separate series per cardinality combination: max, median, avg, p95, and count. Most teams reach for it because they want percentiles, without checking whether a Count metric with a recording rule would answer the same question at a fifth of the cost.

Only use Histogram when you need the full distribution. For alerting on latency spikes, a Count with a threshold often works fine. For SLO calculations, derive percentiles from existing histograms rather than adding new ones.

C3

Metrics orphaned from dashboards and alerts

Engineering ships instrumentation for a feature experiment. The experiment ends. The metric lives on in the billing cycle forever because nobody owns the cleanup. If a metric isn't referenced in a dashboard, alert, SLO, or recording rule it has no owner and no value.

Run a quarterly audit: cross-reference every emitted series against your dashboards, alerts, and SLOs. Anything with no match is a candidate for removal. Automate it if you can - it should produce a PR, not a meeting.

C4

Using metrics to measure latency across service boundaries

Five latency metrics for a five-leg request path that you stitch together in a dashboard aren't observability - they're manual bookkeeping. This is what traces were built for.

If you're measuring how long something takes and it crosses more than one service or function boundary, emit a trace. Retire the latency metrics. The cardinality cost alone justifies the migration.

Volume: the other cost dial teams ignore until it's too late

Cardinality is the width of your signal problem. Volume is the depth. Even a well-designed, low-cardinality metric set can balloon in cost if you're scraping too frequently, retaining at full resolution for too long, or indexing log fields you never filter on. Volume and cardinality are independent dials and you need to be turning both of them intentionally.

Scrape and poll intervals

Most teams inherit a 10-second scrape interval because that's what Prometheus defaulted to years ago. Not every service needs 10-second resolution. A background job that runs every 5 minutes doesn't need subsecond metric fidelity. Match your poll interval to the rate of change of what you're measuring - critical services at 10-15s, non-critical infrastructure at 60s, batch pipelines at the batch interval.

Retention tiers

90% of post-incident analysis happens in the first 48 hours. Full-resolution metric retention for 90 days is paying for precision that nobody queries. Downsample to 1-minute aggregates after 7 days, 5-minute aggregates after 30 days. Most capacity planning and trend analysis works fine on downsampled data. Match retention tiers to actual query behavior, not to "just in case."

Log indexing vs. archival

Indexed logs are fast and expensive. Archived logs are slow and cheap. You don't need every log field indexed in hot storage - you need the fields you filter and group by in active investigations. Route debug-level and high-volume service mesh logs to cold storage at ingest. Index selectively and index by query pattern, not by schema completeness.

Trace sampling strategy

Head-based sampling at 10-20% captures baseline behavior for dashboards and throughput analysis. Tail-based sampling on error conditions and high-latency outliers captures the incidents you actually need to investigate. You don't need 100% trace volume to have excellent trace coverage - you need the right traces, which requires a deliberate sampling strategy rather than a default rate.

Traces: built for the question metrics keep getting asked

Anything you're measuring as a unit of time - response time, processing latency, leg duration across services - should be a trace, not a metric. Two objections come up every time.

The sampling objection: "What if we miss slow requests?" With distributed tracing, the goal is to detect whether transactions were slow during a period, not to capture every single instance. Even at 50% head-based sampling you'll see latency degradation in the sampled population. Add tail-based sampling keyed to latency anomalies and you capture the outliers specifically. The trace model is built for this question. A histogram metric is not.

The cardinality objection applies to traces too, just differently. Span attributes carry the same risk as metric tags. Adding user IDs, request tokens, or dynamic values as span attributes inflates trace storage and index cost in exactly the same way. The discipline is identical: stable, low-cardinality attributes on spans; high-cardinality context in logs attached to the trace context rather than in the span attributes themselves.

Figure 3: One trace gives you end-to-end context and the error root cause in a single span. Five separate latency metrics give you disconnected numbers, no relationship between them, and no answer to why payment_svc failed.

Logs: the most verbose signal, and the one with the most operational debt

Logs are the original observability signal and the one carrying the most accumulated bad practice. Every application has always logged. The problem is that what logging meant in 2012 - stdout, file tails, grep in production - is nothing like structured, indexed logs at cloud scale in 2025. Log management at scale is a data management problem that most teams treat as a search problem.

Log volume compounds quietly. Services log at INFO by default. Someone cranks DEBUG during an incident and forgets to revert. A chatty service mesh sidecar emits access logs for every internal health check. A pipeline processes a million records and logs a line per record. The ingestion pipeline accepts everything at the same per-GB rate regardless of signal value.

The indexing trap

Indexed log fields define your query surface and your cost surface simultaneously. Every field you index is available for fast search and costs hot-tier storage and compute. Most teams index everything because the platform supports it and because the operational pain of under-indexing a field during an incident is vivid while the cost of over-indexing is invisible until billing. Index the fields you filter and aggregate by. Everything else is archival text - store it cheaply and pull it when you need it.

The level discipline problem

DEBUG logs should never be on at steady state in production. They exist for development and targeted incident investigation. The operational discipline required is real: log level changes need to be treated as configuration changes with a rollback plan, not as ad-hoc toggles. A service emitting DEBUG at full throughput in production can generate 10x the log volume of the same service at INFO, at the same per-GB ingestion rate.

Tiered log routing at ingest

The right time to decide where a log goes is at ingest, not after it's already burned hot-tier compute for a week. OTel Collector processors can classify and route log lines by level, source, or pattern at pipeline time. High-volume low-value logs (health check access logs, periodic status lines, audit trails you'll never search in real-time) route straight to cold/object storage. Error-bearing logs and structured application events route to hot indexes. This is a one-time pipeline configuration that pays back continuously.

Events: the most underrated signal for incident correlation

Events are discrete, point-in-time markers. Not a measurement over time like a metric, not a text record like a log, not a path like a trace. A deployment. A feature flag flip. A config change. A scaling event. An autoscaler action. These are the signals that explain why your other signals changed, and most teams either log them as text (losing the structured context) or skip them entirely and then spend 40 minutes during incidents asking "did anything change recently?"

Events map directly to the System health dimension of DURESS. They're the correlation layer that connects anomaly detection to causation. An error rate spike at 14:23 is noise until you overlay the deployment event at 14:21. A latency increase over three days is a mystery until you see the feature flag that gradually rolled out over the same window. Events are what turn pattern-matching into root cause analysis.

What belongs as an event

Deployments (service, version, deploy tool), feature flag state changes, infrastructure scaling events (pod count changes, autoscaler triggers), config changes pushed to running services, certificate renewals, dependency health state transitions (up/degraded/down), scheduled job execution markers. Anything that represents a discrete state change in your system.

What people do instead

A log line that says "deployed version 2.4.1" and hope someone searches for it during the next incident. A Slack message. A manual Datadog annotation that decays and disappears. None of these integrate with anomaly detection, change correlation, or SLO breach analysis. They're human-readable breadcrumbs, not machine-queryable signals.

How to emit them right

Structured event emission from your CI/CD pipeline, feature flag system, and infrastructure automation - not a log line, a typed event with a consistent schema: event_type, service, version, environment, initiator, timestamp, correlation_id. This is the format that lets your observability platform correlate events to metric anomalies automatically rather than requiring a human to overlay them manually.

The full MELT maturity picture

Every signal type has a starting point, a failure mode, and an evolved state. Most teams are somewhere in the middle for most signal types - instrumented but not governed, collected but not connected to KPIs. The table below maps each signal to its DURESS dimension, common failure mode, the KPIs it should power, and what good evolution looks like.

signal

DURESS dimension + KPIs

common failure mode

evolved state

Metrics

Rate (throughput, req/s), Errors (error_rate, 5xx_ratio), Utilization (cpu, memory, pool depth). KPIs: SLO compliance %, error budget burn rate, cost per signal.

High-cardinality tags, Histogram overuse, orphaned series from feature experiments, latency measured as metrics instead of traces. Symptom: cost grows 3x while fleet grows 20%.

Tag surface governed per service. Metric type matches the question. Every metric references at least one dashboard, alert, or SLO. Quarterly audit removes orphans. Cardinality budget enforced at PR review.

Logs

Errors (exception detail, stack traces), System health (audit trails, state transitions). KPIs: MTTR (time from log to detection), mean time to diagnosis, log search latency at p95.

DEBUG at steady state in production, all fields indexed at the same tier, retention set to max regardless of query pattern. Symptom: log bill grows faster than any other signal type, query performance degrades.

Level discipline enforced (INFO prod default, DEBUG by explicit toggle with TTL). Tiered indexing by query pattern. Cold routing for high-volume low-value sources at ingest. Retention matched to query frequency.

Traces

Duration (p50/p95/p99 latency, span duration, db query time). KPIs: service latency SLO attainment, cross-service dependency latency, trace coverage % of critical paths.

Zero coverage on critical paths, latency still measured as metrics, high-cardinality span attributes, head-only sampling with no tail strategy for anomalies. Symptom: can't explain why p99 spiked without guessing.

Critical request paths fully instrumented. Latency metrics deprecated in favor of trace-derived SLIs. Tail-based sampling active for errors and high-latency outliers. Span attributes low-cardinality by policy.

Events

System health (deployment state, change events, dependency transitions). KPIs: change-to-incident correlation rate, MTTR reduction from faster root cause identification, deployment frequency.

Changes logged as text lines or posted to Slack rather than emitted as structured events. Symptom: 40 minutes of incident time spent answering "did anything change recently?" manually.

CI/CD, feature flag systems, and infra automation emit structured events with consistent schema. Events appear in observability platform alongside metric and trace data for automatic change correlation.

Where AI is actually useful in the MELT picture

There's a real role for AI in observability and it isn't the anomaly detection vendors have been promising since 2018 with inconsistent results. The platforms that have sold "AI-powered alerts" for years are still serving the same cardinality and noise problems as everyone else. The teams getting real value from AI in observability are using it for governance, not just analysis - reducing the signal surface rather than adding another interpretation layer on top of a bloated one.

Pre-commit cardinality governance

Static analysis on OTel instrumentation PRs can flag high-cardinality tag additions before they land in production. An LLM with context on your existing metric catalog can identify when a proposed metric is redundant, when a proposed tag would multiply series count beyond a service budget, or when a team is about to emit a Histogram for a use case that a Count would serve. This is a code review step, not a runtime check - and it's orders of magnitude cheaper to catch at commit time than at invoice time.

Metric catalog hygiene automation

LLMs are good at cross-referencing: given your current metric inventory and dashboard/alert references, they can identify orphaned series with high confidence. Running this as a scheduled job - rather than a quarterly manual audit that nobody schedules - turns metric housekeeping from a painful meeting into a pull request. The signal you don't emit is always cheaper than the signal you do emit and ignore.

OTel pipeline signal routing

AI-assisted OTel Collector configurations can classify log lines by signal value at ingest time - routing chatty service mesh noise to cold storage while flagging error-bearing lines to hot indexes. This is where an LLM-backed classifier in the processing pipeline pays back real cost reduction. The key is positioning it at ingest rather than as a post-hoc query layer on already-expensive stored data.

Natural language over MELT data

The most immediate yield is query assistance - translating plain-English questions into PromQL, DQL, SPL, or NRQL, and explaining what an existing query does. This doesn't change the cost model but it reduces the activation energy for engineers who could be trimming cardinality if they weren't blocked on query syntax. Lowering who can act on observability data is a real forcing function for better hygiene across the whole team.

Agentic SRE for DURESS-based triage

The DURESS framework maps cleanly to an agentic investigation flow: an AI SRE checks Duration first (is latency elevated?), then Rate (is throughput down?), then Errors (are failures up?), then Utilization and Saturation (is the system resource-constrained?), then System health (did anything change?). This is a reproducible triage protocol that can be executed by an agent in seconds during incidents, generating a structured hypothesis before a human even opens a terminal.

AI in observability is most useful as a governance accelerant and a query interface, not as a magic anomaly detector. The teams getting value from this use AI to shrink the signal surface and speed up triage - not to replace the discipline of designing good signals in the first place. A well-governed MELT platform with 500 clean, well-tagged, SLO-connected metrics is more useful than a bloated 50,000-series platform with an AI layer on top trying to find the signal in the noise.

What it actually takes

MELT tells you what signal types exist. DURESS tells you which ones to collect and why. SLIs, SLOs, and KPIs connect those signals to something the business can be held to. None of these work by themselves. They require discipline at instrumentation, governance at collection, and intentional design at the KPI layer.

The teams that get observability right aren't the ones with the most signals or the most dashboards. They're the ones who can answer "are we healthy right now, how do we know, and what changed if we aren't?" in under five minutes - with a direct line from that answer to a named SLO, an error budget, and a KPI that someone owns.

01

Use traces for latency, not metrics

If you're measuring how long something takes across service boundaries, emit a trace. Retire the latency metrics. The cardinality cost, the stitching work, and the loss of span context all favor the trace model for this question.

02

Map every metric to a DURESS dimension before shipping

If you can't answer which of the six dimensions this metric feeds - and which SLI or KPI it powers - it isn't a monitoring signal yet. It's just a billing line item with good intentions.

03

Govern tag cardinality at PR time, not invoice time

High-cardinality tags, UUID values, and Histogram defaults should be caught in code review before they land in production. This is a policy question with a concrete, automatable enforcement point. Use it.

04

Tier logs by query value at ingest

Route high-volume low-value logs to cold storage at the OTel Collector layer. Don't pay hot-tier indexing rates for health check access logs and periodic status lines that nobody searches outside of a major incident once a year.

05

Emit change events as structured signals, not log lines

Deployments, flag flips, config changes. Structured events with consistent schema. This is what turns anomaly detection from "something changed around 2pm" to "deployment of payment-service v2.4.1 at 14:21 correlates with this error rate increase."

06

Build the KPI and SLO layer intentionally, not retroactively

MELT signals without SLOs and KPIs above them are raw material without a product. Design the SLI, the SLO target, and the KPI calculation alongside the instrumentation - not six months later when someone asks "how reliable were we last quarter?" and the answer is a spreadsheet.

07

Run a metric audit on a cadence and treat removal as routine

Any metric not referenced in a dashboard, alert, or SLO is a candidate for removal. The cleanest observability platforms are the ones where removal is as routine as addition - and where the audit is automated enough that it produces a PR, not a meeting.

08

Use AI to govern the signal surface, not just to analyze it

Pre-commit cardinality checks, LLM-backed metric catalog reviews, automated cold-routing at the collector layer, DURESS-structured agentic triage. These are where AI pays back in observability. Not by adding a chat interface to a bloated signal set - by making it smaller, cleaner, and more connected to outcomes.

The goal isn't comprehensive coverage of every signal type at maximum fidelity. The goal is the right signal, at the right resolution, connected to the right SLI, powering the right KPI - and knowing exactly what you're paying for each signal. MELT gives you the vocabulary. DURESS gives you the organizing principle. SLOs give you the accountability structure. How you govern all three determines whether observability is a competitive advantage or a cost center that surprises everyone at budget review.

We have 400 dashboards and still don't know if we're healthy

Something went wrong between zero and everything

The KPI layer: why raw signals alone aren't enough

D.U.R.E.S.S. - the framework that connects signals to meaning

Metrics: the oldest signal with the most expensive failure mode

Host metrics - relatively safe

Custom metrics - handle with care

Derived metrics - often overlooked

Business metrics - the connection layer

Cardinality: the cost multiplier hiding in the vendor docs

High-cardinality tags

Histogram by default

Metrics orphaned from dashboards and alerts

Using metrics to measure latency across service boundaries

Volume: the other cost dial teams ignore until it's too late

Scrape and poll intervals

Retention tiers

Log indexing vs. archival

Trace sampling strategy

Traces: built for the question metrics keep getting asked

Logs: the most verbose signal, and the one with the most operational debt

The indexing trap

The level discipline problem

Tiered log routing at ingest

Events: the most underrated signal for incident correlation

What belongs as an event

What people do instead

How to emit them right

The full MELT maturity picture

Where AI is actually useful in the MELT picture

Pre-commit cardinality governance

Metric catalog hygiene automation

OTel pipeline signal routing

Natural language over MELT data

Agentic SRE for DURESS-based triage

What it actually takes

Use traces for latency, not metrics

Map every metric to a DURESS dimension before shipping

Govern tag cardinality at PR time, not invoice time

Tier logs by query value at ingest

Emit change events as structured signals, not log lines

Build the KPI and SLO layer intentionally, not retroactively

Run a metric audit on a cadence and treat removal as routine

Use AI to govern the signal surface, not just to analyze it

Accessibility