TechAni

Observe for Observability.

INSIGHTSObservabilityFebruary 27, 2026

Most observability platforms give you logs, metrics, and traces in separate tabs and call it a single pane of glass. Observe actually connects them. That distinction matters a lot more than it sounds when you are staring at a production incident at 2am.

I have spent enough time across Dynatrace, Splunk, New Relic, and Datadog to have strong opinions about what works and what just looks good in a vendor demo. Observe falls into a genuinely different architectural category. It is not better at the same things those tools do. It does something structurally different, and that changes what you can actually do during an investigation.

This is not a product review. It is a practitioner breakdown of how Observe handles logs, APM, and traces, what the knowledge graph model actually means in practice, how it compares to New Relic across the dimensions that matter, and where the rough edges are.

How Observe handles the three signal types

The baseline for any observability platform is logs, metrics, and traces. Where most platforms treat these as three separate products that happen to share a UI, Observe treats them as three views into the same underlying entity model. That architectural choice is what makes cross-signal investigation fast.

Logs

Log ingestion in Observe lands in a streaming data lake backed by Apache Iceberg on cloud object storage. This is not a proprietary index. Your data is stored in open formats, which means your logs are not held hostage if you ever want to move. Compression runs at roughly 10x, so the storage cost on high-volume log environments comes out significantly lower than platforms charging per GB ingested.

What makes the log experience different from New Relic or Splunk is the transform pipeline. You enrich and shape logs at ingest using OPAL (Observe's pipeline language) before they land in storage. By the time a log line is queryable, it already has the service context, deployment version, and infrastructure metadata attached. You are not joining at query time. You are joining at ingest, which is meaningfully faster for interactive investigation.

On cardinality

One thing that changes your instrumentation behavior fast: Observe does not penalize high-cardinality attributes the way New Relic and Datadog do. You can index on user ID, session token, transaction ID without the bill exploding. Once your engineering team realizes this, they stop asking "does this attribute cost money" before adding it to their spans. That is a meaningful cultural shift.

APM and metrics

Application performance monitoring in Observe is built on OTel-native instrumentation. You bring your OpenTelemetry SDK instrumented services and the spans, metrics, and resource attributes flow in without translation layers. There is no proprietary agent to deploy alongside your application code.

The metrics model follows the same entity graph approach as logs. A service entity in Observe has its error rate, latency percentiles, request throughput, and resource utilization all modeled as attributes of that entity rather than as separate metric series you manually join. When you are looking at a service in the APM view, the logs from that service are right there. Not a separate query. Not a link to another tab. The same entity.

Distributed traces

This is where Observe separates itself most clearly from the field. Full-fidelity traces stored for 13 months by default, in Iceberg tables. Compared to New Relic's 8-day full-fidelity window and Datadog's 15-day default, this is a fundamentally different capability. Recurring incident patterns that surface every three to four weeks are now observable from the trace level. Seasonal regressions, slow behavioral drift, post-deployment performance degradation that takes weeks to show up clearly - all of this becomes visible in a way that was previously impossible without manual re-instrumentation.

How signals flow into the entity model
Logs
Metrics
Traces
OTel ingest pipeline (OPAL enrichment)
enrich at ingest, not at query time
O11y Knowledge Graph
services, pods, deployments, users, incidents as linked entities
Iceberg data lake on cloud storage
open format storage — 13-month retention, 10x compression

The knowledge graph is the actual product

Every vendor claims a single pane of glass. What they usually mean is that you can see logs, metrics, and traces in the same UI without navigating to a different product. Observe's knowledge graph is a different claim. It means that your infrastructure entities - services, pods, namespaces, deployments, users, incidents - are nodes in a graph with explicit semantic relationships between them.

In practice, what this changes is the investigation flow. In New Relic, investigating a user-facing error looks like this: you start in the errors inbox, find the trace, pivot to the service map to understand dependencies, then open a separate log query to correlate log output from around the same time window. Each of those pivots is manual. You are holding context in your head between tabs.

In Observe, those pivots are traversals on the graph. You start at the error span, and the deployment that happened 12 minutes before it is already linked as a related event on the same entity. The services that called this one and the services it called downstream are connected nodes. You navigate by following relationships, not by opening new queries.

The real difference in a P1

The knowledge graph shortens the "what changed recently" phase of an incident investigation dramatically. Deployments, config changes, and scaling events are entities in the graph, linked to the services they affected. You do not search for what changed. You look at the entity's recent related events. That is a different mental model from querying logs for deployment markers, and it is faster by a meaningful margin when the pressure is on.

What it looks like in practice

The best way to understand the single-pane-of-glass claim is to see logs, APM, and traces actually coexist on the same surface - not as tabs, but as correlated signals on the same entities. Flip between the tabs below to see how each signal type surfaces in context during an active incident, without losing entity state between pivots.

https://observe.anisri.dev/payment-service
▦ Observe  /  payment-service
Requests / min
2,847
↑ 4.2% vs prev window
Error rate
3.1%
↑ from 0.4% — deploy 14m ago
P99 latency
487ms
↑ 2.3x baseline
Affected users
1,204
active error sessions
Request rate & errors — last 1h
deploy
▮ requests ▮ errors ▮ deploy event
Related entities
payment-service
v2.14.1  ·  12 pods
DEGRADED
fraud-check-svc
v1.8.0  ·  4 pods
SLOW
postgres-primary
RDS  ·  us-east-1
OK
redis-session
ElastiCache  ·  cluster
OK
Knowledge graph alert

Deployment payment-service v2.14.1 pushed 14 minutes ago is linked to 3.1% error spike on POST /v2/charge. Upstream caller fraud-check-svc latency elevated. Downstream dependencies nominal. AI SRE suggests rollback to v2.13.9.

Throughput
2.8k
req/min
P50 latency
84ms
within SLO
P99 latency
487ms
↑ SLO breach (250ms)
Apdex
0.71
↓ from 0.96
Slowest endpoints — last 1h
POST /v2/charge
2,847 req/min  ·  3.1% error rate
487ms p99
GET /v2/status/{id}
412 req/min  ·  0.1% error rate
182ms p99
POST /v2/refund
88 req/min  ·  0.0% error rate
64ms p99
GET /healthz
2,400 req/min  ·  0.0% error rate
4ms p99
Latency distribution by endpoint (p99)
/charge
/status
/refund
/healthz
Log volume
84k
lines / min
Error lines
2,614
↑ 12x spike
Warn lines
8,901
↑ 3x spike
Ingest cost
$0.14
/ GB compressed
Live log tail — payment-service (enriched at ingest)
14:07:43.221payment-serviceERROR charge_request failed: timeout calling fraud-check-svc after 450ms [trace_id=4f2a91c env=prod deploy=v2.14.1]
14:07:43.198fraud-check-svcWARN upstream latency high: payment-service 443ms avg [pod=fcs-7d9b4]
14:07:42.914payment-serviceERROR charge_request failed: timeout calling fraud-check-svc after 448ms [trace_id=3e1b80d env=prod deploy=v2.14.1]
14:07:42.103payment-serviceINFO charge_request success: 78ms [trace_id=2c0d71a user_id=u_8821 deploy=v2.14.1]
14:07:41.889k8s-deployWARN deployment payment-service v2.14.1 rolled out — 14 mins ago [namespace=prod replicas=12/12]
14:07:41.442payment-serviceERROR charge_request failed: timeout calling fraud-check-svc after 452ms [trace_id=1a9f63e env=prod deploy=v2.14.1]
Ingest enrichment note

Every log line above already carries deploy=v2.14.1, env=prod, and trace_id because OPAL attached them at ingest from the Kubernetes deployment event entity. No join needed at query time. The deployment context is already there.

Traces / min
2,847
100% sampled
Error traces
88
/ min — 3.1%
Retention
13mo
full fidelity
Spans / trace
14
avg depth
Recent error traces — click to expand
POST /v2/charge  trace_id: 4f2a91c
14:07:43  ·  487ms total  ·  failed at fraud-check-svc
ERROR
Trace waterfall  ·  4f2a91c  ·  487ms
payment-service
487ms
↳ validate-card
38ms
↳ fraud-check-svc
450ms ⚠
↳ rules-engine
437ms
↳ postgres query
432ms
Root cause: rules-engine postgres query slow. Missing index on fraud_rules.merchant_id added in v2.14.1 migration. Linked deploy event: payment-service v2.14.1 14m ago.
POST /v2/charge  trace_id: 3e1b80d
14:07:42  ·  448ms total  ·  failed at fraud-check-svc
ERROR
Trace waterfall  ·  3e1b80d  ·  448ms
payment-service
448ms
↳ validate-card
32ms
↳ fraud-check-svc
413ms ⚠
Same pattern as 4f2a91c — fraud-check-svc slowdown consistent across error traces. Deployment link confirmed.
POST /v2/charge  trace_id: 2c0d71a
14:07:42  ·  78ms total  ·  success
OK
13-month retention in action

The same rules-engine slowdown appeared in traces from 6 weeks ago during a different migration. Observe surfaced the historical trace correlation automatically because both deploys touched the same fraud_rules table. In New Relic, those 6-week-old traces would not exist.

What entities look like in OPAL

OPAL is Observe's pipeline and query language. It is a dataflow language, not a SQL-like query language, which is the main adjustment coming from NRQL or SPL. The mental model is: you are describing a pipeline of transformations on a stream of events, not writing a query against a table. Once that clicks, the language is genuinely expressive.

OPAL — error spans linked to recent deployments
# Start with error spans on the payment service
filter span.status_code = "ERROR"
  and service.name = "payment-service"

# Join to deployment events within a 30-minute window
| join deployment_events
    on service.name = deploy_service
    and timestamp >= deploy_time - 5m
    and timestamp <= deploy_time + 30m

# Summarize: errors, p99 latency, affected users per deploy
| summarize
    error_count  = count(),
    p99_ms       = percentile(duration_ms, 99),
    affected_users = count_distinct(user.id)
  by service.name, deploy_version, deploy_time

| filter error_count > 10
| sort deploy_time desc

That query - correlating error spikes with deployment events, pulling in p99 latency and user impact in a single pass - would require at least three separate queries and a manual time correlation in New Relic. In Observe it is a single pipeline because deployments and spans are entities in the same graph.

Observe vs New Relic: the dimensions that actually matter

New Relic is a good product. I have used it extensively across multiple engagements. This comparison is not about which one is better in absolute terms. It is about where the architectural differences create real operational consequences, and where the tradeoffs land for different team profiles.

Dimension
New Relic
Observe
Trace retention
8 days full fidelity
Recurring patterns invisible
13 months standard
Full trace history available
Cardinality cost
Penalized at scale
Teams throttle instrumentation
Flat data lake model
No instrumentation tax
Cross-signal correlation
Manual tab pivoting
Workable, but slow under pressure
Knowledge graph traversal
Entities linked at ingest
Query language
NRQL (SQL-like)
Familiar, widely understood
OPAL (dataflow)
Learning curve, more expressive
Storage model
Proprietary, managed
Vendor lock-in on your data
Iceberg on cloud storage
Open formats, data portability
OTel support
Supported with NRQL layer
Mild translation overhead
OTel-first architecture
No translation layer
Pricing model
Ingest + seats + features
Hard to predict at scale
Ingest + compute usage
More predictable, new model to learn
Ecosystem maturity
10+ years, deep integrations
Broad third-party coverage
Newer, growing fast
Core integrations solid, breadth growing

The two dimensions that consistently change the calculus most for teams I talk to are trace retention and cardinality cost. The 8-day trace window is fine for straightforward deployment-regression cycles. Once your incident patterns are longer - monthly rollups, seasonal load patterns, gradual drift from a config change two weeks back - you start working around the limitation rather than with the tool.

Where New Relic still wins

Ecosystem depth and time-to-value for standard environments. If you are running a fairly conventional stack on AWS or GCP with standard services, New Relic's data apps and out-of-the-box dashboards get you to operational visibility faster. The NRQL learning curve is also lower for teams coming from a SQL background. OPAL genuinely takes time to internalize, and if you have a large team that needs to query the platform regularly, that ramp-up cost is not trivial.

The same question in both query languages

The translation between NRQL and OPAL is not always one-to-one, but for common investigative queries the intent maps cleanly. Here are three side-by-side examples that show where the syntactic differences are and why OPAL's pipeline model handles joins and aggregations differently.

Average CPU by environment, last hour

New Relic (NRQL)
SELECT average(cpuPercent)
FROM SystemSample
WHERE environment = 'production'
SINCE 1 hour ago
FACET hostname
Observe (OPAL)
filter environment = "production"
  and timestamp >= now() - 1h
| summarize avg_cpu = avg(cpu_percent)
  by hostname
| sort avg_cpu desc

Error rate by service, last 30 minutes

New Relic (NRQL)
SELECT
  count(*) as errors,
  percentage(count(*), WHERE error IS true)
    as error_rate
FROM Transaction
WHERE appName LIKE 'api-%'
SINCE 30 minutes ago
FACET appName
Observe (OPAL)
filter service.name ~ "api-.*"
  and timestamp >= now() - 30m
| summarize
    errors = countif(span.status_code = "ERROR"),
    total  = count(),
    error_rate = countif(span.status_code = "ERROR")
               / count() * 100
  by service.name
| filter error_rate > 0

P99 latency by endpoint over last 6 hours

New Relic (NRQL)
SELECT percentile(duration, 99)
FROM Transaction
WHERE transactionType = 'Web'
SINCE 6 hours ago
TIMESERIES 10 minutes
FACET request.uri
Observe (OPAL)
filter span.kind = "server"
  and timestamp >= now() - 6h
| make_col bucket = bucket(timestamp, 10m)
| summarize
    p99_ms = percentile(duration_ms, 99)
  by http.route, bucket
| sort bucket desc
On the translation work

If your team is sitting on a library of NRQL queries built up over years and you are evaluating Observe, the translation overhead is real. Functions map, but the pipeline model means the structure of complex queries changes meaningfully. If you are doing this at scale, you want a systematic way to translate rather than rewriting each query by hand. That is exactly the problem the Observability Query Translator was built to solve.

Translating your existing queries to OPAL

One of the practical friction points when evaluating Observe from a New Relic background is the query translation work. You have hundreds of NRQL queries embedded in dashboards, runbooks, alerts, and tribal knowledge. Rewriting them by hand is tedious and error-prone, especially for complex aggregations and conditional expressions where the syntax differences between NRQL and OPAL are non-trivial.

Tool

Observability Query Translator

Paste a query from New Relic, Datadog, Splunk, Prometheus, Grafana, or CloudWatch and get the equivalent in any other supported platform instantly. Bidirectional translation across all six platforms, with query logic and filter semantics preserved. Useful for platform evaluations, migration projects, or just working across a multi-platform environment where you need to express the same question in multiple query languages.

Try the translator →
Input (New Relic NRQL)
SELECT average(cpuPercent)
FROM SystemSample
WHERE environment = 'production'
SINCE 1 hour ago
Output (Observe OPAL)
filter environment = "production"
  and timestamp >= now() - 1h
| summarize avg_cpu = avg(cpu_percent)

The translator also handles Datadog to Splunk, Prometheus to CloudWatch, and every other combination across the six supported platforms. If your environment is multi-platform and you are spending time context-switching between query languages, it is worth bookmarking.

Best practices for implementing Observe

If you are moving to Observe or running a proof of concept, these are the patterns that separate teams who get value fast from teams who struggle through a six-month onboarding cycle.

Data ingestion and pipeline design

Start with OpenTelemetry instrumentation from day one. Observe is OTel-native, which means you avoid translation layers and get full-fidelity traces without fighting the platform. Use the Observe Agent for infrastructure telemetry collection rather than stitching together multiple forwarders. Consolidation reduces operational overhead and gives you a single configuration surface.

Design your OPAL transform pipelines at ingest time, not query time. Enrich logs with service context, deployment metadata, and infrastructure attributes before they land in storage. This makes interactive investigation meaningfully faster because you are not joining datasets every time you run a query. The storage cost of denormalized data is low. The query performance gain is high.

On cardinality

Observe does not penalize high-cardinality attributes the way cost-per-event platforms do. Index on user IDs, session tokens, transaction IDs, request paths - whatever makes your traces useful. The cultural shift this enables is significant. Engineers stop self-censoring instrumentation decisions based on cost anxiety.

Knowledge graph and entity modeling

Model your service topology explicitly in the knowledge graph from the start. Define relationships between services, deployments, infrastructure, and dependencies. This is not optional metadata. It is the structural foundation that makes cross-signal correlation automatic. A properly modeled graph means clicking from an error spike in APM to the exact deployment that caused it takes one click, not five manual queries.

Use correlation tags consistently across logs, metrics, and traces. Tag everything with service name, environment, deployment version, and cluster ID at minimum. The knowledge graph uses these tags to build the entity model. Inconsistent tagging breaks the graph and you lose the primary value proposition of the platform.

Dashboard and monitoring strategy

Build dashboards around entities, not raw metrics. An entity-centric dashboard shows service health, related infrastructure, downstream dependencies, and recent deployments in a single view. This is structurally different from metric-centric dashboards where you manually assemble context every time you investigate.

Configure monitors using the entity model. A threshold monitor on a service entity automatically includes all instances of that service without manual filter updates when you scale. Use promote monitors to surface patterns from logs into alerts without writing complex aggregation queries. Observe's monitor types map directly to common SRE patterns - use them instead of forcing everything into threshold alerts.

Monitor Type
Use Case
Implementation Pattern
Threshold
Error rate, latency percentiles, resource utilization crossing fixed bounds
Set on service entities with dynamic instance scaling
Count
Specific error patterns, security events, deployment failures
Filter logs to pattern, alert when count exceeds threshold in window
Promote
Surfacing high-cardinality issues from logs (user errors, endpoint failures)
Aggregate by dimension, promote when individual dimension crosses threshold
Anomaly
Detecting drift in metrics with seasonal or weekly patterns
Train on historical data, alert on statistical deviation

OPAL query patterns

Learn the pipeline model early. OPAL queries are sequential transformations, not declarative statements. Each stage operates on the output of the previous stage. This makes complex queries more readable once you internalize the pattern, but it requires unlearning SQL and NRQL mental models.

Build a team library of common OPAL patterns in the first week. Error rate by service, latency percentiles by endpoint, log volume by severity - these are queries you will write dozens of times. Standardize them early and share them across the team. Use datasets to materialize frequently-used aggregations rather than recomputing them in every dashboard.

Learning curve

Block dedicated time for OPAL training. Expecting engineers to learn it alongside active incident work does not work. Run a half-day workshop where the team translates your ten most common queries from NRQL or PromQL to OPAL together. This builds shared fluency faster than individual self-study.

Trace retention and long-horizon analysis

Use the 13-month trace retention for pattern analysis, not just incident response. Build dashboards that show error rate trends, latency drift, and deployment impact over weeks and months. This is analysis you could not do with 8-day retention platforms. Seasonal patterns, slow regressions, and post-deployment performance drift become visible.

Configure trace sampling at 100% for production services where trace volume is manageable. The storage cost is lower than you expect and full-fidelity traces eliminate sampling bias in incident investigations. If you need to sample, use tail-based sampling to keep all error traces and a representative sample of successful requests.

Cost management

Observe pricing is based on data volume ingested and stored, not events or spans. Compression runs at roughly 10x on typical log and trace data, which means your effective storage cost is significantly lower than the raw ingestion volume. Monitor your compression ratio in the first month to validate this holds for your telemetry profile.

Use data retention policies to age out low-value telemetry. Keep full-fidelity traces for 13 months, but you may not need debug-level logs past 30 days. Configure retention by dataset rather than applying a single global policy. High-value signals stay longer. Low-value signals age out faster.

Data export

Your data is stored in Apache Iceberg on cloud object storage in open formats. If you ever need to move platforms or run external analysis, you own the data and can export it without vendor permission. This is a meaningful difference from platforms where your telemetry is locked in proprietary indexes.

Who Observe is and is not a fit for

Strong fit when...
  • Your trace retention window is actively limiting incident investigations
  • High-cardinality instrumentation is being throttled to manage costs
  • You are running OTel-instrumented services and want a native-first platform
  • Long-retention trace analysis (seasonal, monthly patterns) matters to your SRE practice
  • Data portability is a requirement and vendor lock-in on storage is a concern
  • Your observability costs have become a regular planning conversation
Weaker fit when...
  • Your team is deeply NRQL-fluent and the switch cost is organizationally expensive
  • You need broad out-of-the-box integrations fast and standard AWS/GCP coverage is not enough
  • Your incident patterns are short-cycle and 8-day trace retention is genuinely sufficient
  • You have a large non-technical stakeholder base where New Relic's UI familiarity matters
  • Time-to-value is the overriding constraint and onboarding bandwidth is limited

The honest version: Observe is a better architectural bet for where observability is going. OTel-native, open storage formats, knowledge graph correlation, long-retention traces - these are the right foundations. New Relic is a better tactical choice for teams that need broad coverage fast and have significant existing investment in the NRQL ecosystem. Both are real answers depending on where your team sits.

The signal to watch

If your engineers have started asking "does this attribute cost money" before adding instrumentation, or if you have had to build compensating workarounds because 8-day trace retention is not covering your incident patterns - those are the two strongest signals that Observe is worth a serious evaluation. Not a demo. An actual proof of concept with your own telemetry data.

Where to go from here

If you are evaluating Observe, the most useful thing you can do is run a 30-day proof of concept with your actual telemetry rather than synthetic data. The knowledge graph value is not obvious until your own service topology is in it. The trace retention value is not obvious until you have an incident where you actually need three weeks of trace history and it is just there.

The OPAL learning curve is real. Block time for it rather than expecting engineers to absorb it alongside active work. Build a team cheatsheet of common OPAL patterns for your domain in the first week. And if you are coming from NRQL and need to translate a library of existing queries, the Observability Query Translator can take a lot of that mechanical work off the table.

The platform is genuinely worth the evaluation if the architectural fit is right. Whether it is worth the switching cost depends entirely on which of the friction points above you are actually hitting today.