Observe for Observability.
Most observability platforms give you logs, metrics, and traces in separate tabs and call it a single pane of glass. Observe actually connects them. That distinction matters a lot more than it sounds when you are staring at a production incident at 2am.
I have spent enough time across Dynatrace, Splunk, New Relic, and Datadog to have strong opinions about what works and what just looks good in a vendor demo. Observe falls into a genuinely different architectural category. It is not better at the same things those tools do. It does something structurally different, and that changes what you can actually do during an investigation.
This is not a product review. It is a practitioner breakdown of how Observe handles logs, APM, and traces, what the knowledge graph model actually means in practice, how it compares to New Relic across the dimensions that matter, and where the rough edges are.
How Observe handles the three signal types
The baseline for any observability platform is logs, metrics, and traces. Where most platforms treat these as three separate products that happen to share a UI, Observe treats them as three views into the same underlying entity model. That architectural choice is what makes cross-signal investigation fast.
Logs
Log ingestion in Observe lands in a streaming data lake backed by Apache Iceberg on cloud object storage. This is not a proprietary index. Your data is stored in open formats, which means your logs are not held hostage if you ever want to move. Compression runs at roughly 10x, so the storage cost on high-volume log environments comes out significantly lower than platforms charging per GB ingested.
What makes the log experience different from New Relic or Splunk is the transform pipeline. You enrich and shape logs at ingest using OPAL (Observe's pipeline language) before they land in storage. By the time a log line is queryable, it already has the service context, deployment version, and infrastructure metadata attached. You are not joining at query time. You are joining at ingest, which is meaningfully faster for interactive investigation.
One thing that changes your instrumentation behavior fast: Observe does not penalize high-cardinality attributes the way New Relic and Datadog do. You can index on user ID, session token, transaction ID without the bill exploding. Once your engineering team realizes this, they stop asking "does this attribute cost money" before adding it to their spans. That is a meaningful cultural shift.
APM and metrics
Application performance monitoring in Observe is built on OTel-native instrumentation. You bring your OpenTelemetry SDK instrumented services and the spans, metrics, and resource attributes flow in without translation layers. There is no proprietary agent to deploy alongside your application code.
The metrics model follows the same entity graph approach as logs. A service entity in Observe has its error rate, latency percentiles, request throughput, and resource utilization all modeled as attributes of that entity rather than as separate metric series you manually join. When you are looking at a service in the APM view, the logs from that service are right there. Not a separate query. Not a link to another tab. The same entity.
Distributed traces
This is where Observe separates itself most clearly from the field. Full-fidelity traces stored for 13 months by default, in Iceberg tables. Compared to New Relic's 8-day full-fidelity window and Datadog's 15-day default, this is a fundamentally different capability. Recurring incident patterns that surface every three to four weeks are now observable from the trace level. Seasonal regressions, slow behavioral drift, post-deployment performance degradation that takes weeks to show up clearly - all of this becomes visible in a way that was previously impossible without manual re-instrumentation.
The knowledge graph is the actual product
Every vendor claims a single pane of glass. What they usually mean is that you can see logs, metrics, and traces in the same UI without navigating to a different product. Observe's knowledge graph is a different claim. It means that your infrastructure entities - services, pods, namespaces, deployments, users, incidents - are nodes in a graph with explicit semantic relationships between them.
In practice, what this changes is the investigation flow. In New Relic, investigating a user-facing error looks like this: you start in the errors inbox, find the trace, pivot to the service map to understand dependencies, then open a separate log query to correlate log output from around the same time window. Each of those pivots is manual. You are holding context in your head between tabs.
In Observe, those pivots are traversals on the graph. You start at the error span, and the deployment that happened 12 minutes before it is already linked as a related event on the same entity. The services that called this one and the services it called downstream are connected nodes. You navigate by following relationships, not by opening new queries.
The knowledge graph shortens the "what changed recently" phase of an incident investigation dramatically. Deployments, config changes, and scaling events are entities in the graph, linked to the services they affected. You do not search for what changed. You look at the entity's recent related events. That is a different mental model from querying logs for deployment markers, and it is faster by a meaningful margin when the pressure is on.
What it looks like in practice
The best way to understand the single-pane-of-glass claim is to see logs, APM, and traces actually coexist on the same surface - not as tabs, but as correlated signals on the same entities. Flip between the tabs below to see how each signal type surfaces in context during an active incident, without losing entity state between pivots.
Deployment payment-service v2.14.1 pushed 14 minutes ago is linked to 3.1% error spike on POST /v2/charge. Upstream caller fraud-check-svc latency elevated. Downstream dependencies nominal. AI SRE suggests rollback to v2.13.9.
Every log line above already carries deploy=v2.14.1, env=prod, and trace_id because OPAL attached them at ingest from the Kubernetes deployment event entity. No join needed at query time. The deployment context is already there.
The same rules-engine slowdown appeared in traces from 6 weeks ago during a different migration. Observe surfaced the historical trace correlation automatically because both deploys touched the same fraud_rules table. In New Relic, those 6-week-old traces would not exist.
What entities look like in OPAL
OPAL is Observe's pipeline and query language. It is a dataflow language, not a SQL-like query language, which is the main adjustment coming from NRQL or SPL. The mental model is: you are describing a pipeline of transformations on a stream of events, not writing a query against a table. Once that clicks, the language is genuinely expressive.
# Start with error spans on the payment service
filter span.status_code = "ERROR"
and service.name = "payment-service"
# Join to deployment events within a 30-minute window
| join deployment_events
on service.name = deploy_service
and timestamp >= deploy_time - 5m
and timestamp <= deploy_time + 30m
# Summarize: errors, p99 latency, affected users per deploy
| summarize
error_count = count(),
p99_ms = percentile(duration_ms, 99),
affected_users = count_distinct(user.id)
by service.name, deploy_version, deploy_time
| filter error_count > 10
| sort deploy_time desc
That query - correlating error spikes with deployment events, pulling in p99 latency and user impact in a single pass - would require at least three separate queries and a manual time correlation in New Relic. In Observe it is a single pipeline because deployments and spans are entities in the same graph.
Observe vs New Relic: the dimensions that actually matter
New Relic is a good product. I have used it extensively across multiple engagements. This comparison is not about which one is better in absolute terms. It is about where the architectural differences create real operational consequences, and where the tradeoffs land for different team profiles.
Recurring patterns invisible
Full trace history available
Teams throttle instrumentation
No instrumentation tax
Workable, but slow under pressure
Entities linked at ingest
Familiar, widely understood
Learning curve, more expressive
Vendor lock-in on your data
Open formats, data portability
Mild translation overhead
No translation layer
Hard to predict at scale
More predictable, new model to learn
Broad third-party coverage
Core integrations solid, breadth growing
The two dimensions that consistently change the calculus most for teams I talk to are trace retention and cardinality cost. The 8-day trace window is fine for straightforward deployment-regression cycles. Once your incident patterns are longer - monthly rollups, seasonal load patterns, gradual drift from a config change two weeks back - you start working around the limitation rather than with the tool.
Where New Relic still wins
Ecosystem depth and time-to-value for standard environments. If you are running a fairly conventional stack on AWS or GCP with standard services, New Relic's data apps and out-of-the-box dashboards get you to operational visibility faster. The NRQL learning curve is also lower for teams coming from a SQL background. OPAL genuinely takes time to internalize, and if you have a large team that needs to query the platform regularly, that ramp-up cost is not trivial.
The same question in both query languages
The translation between NRQL and OPAL is not always one-to-one, but for common investigative queries the intent maps cleanly. Here are three side-by-side examples that show where the syntactic differences are and why OPAL's pipeline model handles joins and aggregations differently.
Average CPU by environment, last hour
SELECT average(cpuPercent) FROM SystemSample WHERE environment = 'production' SINCE 1 hour ago FACET hostname
filter environment = "production" and timestamp >= now() - 1h | summarize avg_cpu = avg(cpu_percent) by hostname | sort avg_cpu desc
Error rate by service, last 30 minutes
SELECT
count(*) as errors,
percentage(count(*), WHERE error IS true)
as error_rate
FROM Transaction
WHERE appName LIKE 'api-%'
SINCE 30 minutes ago
FACET appNamefilter service.name ~ "api-.*"
and timestamp >= now() - 30m
| summarize
errors = countif(span.status_code = "ERROR"),
total = count(),
error_rate = countif(span.status_code = "ERROR")
/ count() * 100
by service.name
| filter error_rate > 0P99 latency by endpoint over last 6 hours
SELECT percentile(duration, 99) FROM Transaction WHERE transactionType = 'Web' SINCE 6 hours ago TIMESERIES 10 minutes FACET request.uri
filter span.kind = "server"
and timestamp >= now() - 6h
| make_col bucket = bucket(timestamp, 10m)
| summarize
p99_ms = percentile(duration_ms, 99)
by http.route, bucket
| sort bucket descIf your team is sitting on a library of NRQL queries built up over years and you are evaluating Observe, the translation overhead is real. Functions map, but the pipeline model means the structure of complex queries changes meaningfully. If you are doing this at scale, you want a systematic way to translate rather than rewriting each query by hand. That is exactly the problem the Observability Query Translator was built to solve.
Translating your existing queries to OPAL
One of the practical friction points when evaluating Observe from a New Relic background is the query translation work. You have hundreds of NRQL queries embedded in dashboards, runbooks, alerts, and tribal knowledge. Rewriting them by hand is tedious and error-prone, especially for complex aggregations and conditional expressions where the syntax differences between NRQL and OPAL are non-trivial.
Observability Query Translator
Paste a query from New Relic, Datadog, Splunk, Prometheus, Grafana, or CloudWatch and get the equivalent in any other supported platform instantly. Bidirectional translation across all six platforms, with query logic and filter semantics preserved. Useful for platform evaluations, migration projects, or just working across a multi-platform environment where you need to express the same question in multiple query languages.
Try the translator →SELECT average(cpuPercent) FROM SystemSample WHERE environment = 'production' SINCE 1 hour ago
filter environment = "production" and timestamp >= now() - 1h | summarize avg_cpu = avg(cpu_percent)
The translator also handles Datadog to Splunk, Prometheus to CloudWatch, and every other combination across the six supported platforms. If your environment is multi-platform and you are spending time context-switching between query languages, it is worth bookmarking.
Best practices for implementing Observe
If you are moving to Observe or running a proof of concept, these are the patterns that separate teams who get value fast from teams who struggle through a six-month onboarding cycle.
Data ingestion and pipeline design
Start with OpenTelemetry instrumentation from day one. Observe is OTel-native, which means you avoid translation layers and get full-fidelity traces without fighting the platform. Use the Observe Agent for infrastructure telemetry collection rather than stitching together multiple forwarders. Consolidation reduces operational overhead and gives you a single configuration surface.
Design your OPAL transform pipelines at ingest time, not query time. Enrich logs with service context, deployment metadata, and infrastructure attributes before they land in storage. This makes interactive investigation meaningfully faster because you are not joining datasets every time you run a query. The storage cost of denormalized data is low. The query performance gain is high.
Observe does not penalize high-cardinality attributes the way cost-per-event platforms do. Index on user IDs, session tokens, transaction IDs, request paths - whatever makes your traces useful. The cultural shift this enables is significant. Engineers stop self-censoring instrumentation decisions based on cost anxiety.
Knowledge graph and entity modeling
Model your service topology explicitly in the knowledge graph from the start. Define relationships between services, deployments, infrastructure, and dependencies. This is not optional metadata. It is the structural foundation that makes cross-signal correlation automatic. A properly modeled graph means clicking from an error spike in APM to the exact deployment that caused it takes one click, not five manual queries.
Use correlation tags consistently across logs, metrics, and traces. Tag everything with service name, environment, deployment version, and cluster ID at minimum. The knowledge graph uses these tags to build the entity model. Inconsistent tagging breaks the graph and you lose the primary value proposition of the platform.
Dashboard and monitoring strategy
Build dashboards around entities, not raw metrics. An entity-centric dashboard shows service health, related infrastructure, downstream dependencies, and recent deployments in a single view. This is structurally different from metric-centric dashboards where you manually assemble context every time you investigate.
Configure monitors using the entity model. A threshold monitor on a service entity automatically includes all instances of that service without manual filter updates when you scale. Use promote monitors to surface patterns from logs into alerts without writing complex aggregation queries. Observe's monitor types map directly to common SRE patterns - use them instead of forcing everything into threshold alerts.
OPAL query patterns
Learn the pipeline model early. OPAL queries are sequential transformations, not declarative statements. Each stage operates on the output of the previous stage. This makes complex queries more readable once you internalize the pattern, but it requires unlearning SQL and NRQL mental models.
Build a team library of common OPAL patterns in the first week. Error rate by service, latency percentiles by endpoint, log volume by severity - these are queries you will write dozens of times. Standardize them early and share them across the team. Use datasets to materialize frequently-used aggregations rather than recomputing them in every dashboard.
Block dedicated time for OPAL training. Expecting engineers to learn it alongside active incident work does not work. Run a half-day workshop where the team translates your ten most common queries from NRQL or PromQL to OPAL together. This builds shared fluency faster than individual self-study.
Trace retention and long-horizon analysis
Use the 13-month trace retention for pattern analysis, not just incident response. Build dashboards that show error rate trends, latency drift, and deployment impact over weeks and months. This is analysis you could not do with 8-day retention platforms. Seasonal patterns, slow regressions, and post-deployment performance drift become visible.
Configure trace sampling at 100% for production services where trace volume is manageable. The storage cost is lower than you expect and full-fidelity traces eliminate sampling bias in incident investigations. If you need to sample, use tail-based sampling to keep all error traces and a representative sample of successful requests.
Cost management
Observe pricing is based on data volume ingested and stored, not events or spans. Compression runs at roughly 10x on typical log and trace data, which means your effective storage cost is significantly lower than the raw ingestion volume. Monitor your compression ratio in the first month to validate this holds for your telemetry profile.
Use data retention policies to age out low-value telemetry. Keep full-fidelity traces for 13 months, but you may not need debug-level logs past 30 days. Configure retention by dataset rather than applying a single global policy. High-value signals stay longer. Low-value signals age out faster.
Your data is stored in Apache Iceberg on cloud object storage in open formats. If you ever need to move platforms or run external analysis, you own the data and can export it without vendor permission. This is a meaningful difference from platforms where your telemetry is locked in proprietary indexes.
Who Observe is and is not a fit for
- Your trace retention window is actively limiting incident investigations
- High-cardinality instrumentation is being throttled to manage costs
- You are running OTel-instrumented services and want a native-first platform
- Long-retention trace analysis (seasonal, monthly patterns) matters to your SRE practice
- Data portability is a requirement and vendor lock-in on storage is a concern
- Your observability costs have become a regular planning conversation
- Your team is deeply NRQL-fluent and the switch cost is organizationally expensive
- You need broad out-of-the-box integrations fast and standard AWS/GCP coverage is not enough
- Your incident patterns are short-cycle and 8-day trace retention is genuinely sufficient
- You have a large non-technical stakeholder base where New Relic's UI familiarity matters
- Time-to-value is the overriding constraint and onboarding bandwidth is limited
The honest version: Observe is a better architectural bet for where observability is going. OTel-native, open storage formats, knowledge graph correlation, long-retention traces - these are the right foundations. New Relic is a better tactical choice for teams that need broad coverage fast and have significant existing investment in the NRQL ecosystem. Both are real answers depending on where your team sits.
If your engineers have started asking "does this attribute cost money" before adding instrumentation, or if you have had to build compensating workarounds because 8-day trace retention is not covering your incident patterns - those are the two strongest signals that Observe is worth a serious evaluation. Not a demo. An actual proof of concept with your own telemetry data.
Where to go from here
If you are evaluating Observe, the most useful thing you can do is run a 30-day proof of concept with your actual telemetry rather than synthetic data. The knowledge graph value is not obvious until your own service topology is in it. The trace retention value is not obvious until you have an incident where you actually need three weeks of trace history and it is just there.
The OPAL learning curve is real. Block time for it rather than expecting engineers to absorb it alongside active work. Build a team cheatsheet of common OPAL patterns for your domain in the first week. And if you are coming from NRQL and need to translate a library of existing queries, the Observability Query Translator can take a lot of that mechanical work off the table.
The platform is genuinely worth the evaluation if the architectural fit is right. Whether it is worth the switching cost depends entirely on which of the friction points above you are actually hitting today.