Production · Azure Fabric · 40M txn/day

Catch what your
dashboards miss.

An AI-powered anomaly detection system built for the SRE Observability team , processing 40 million transactions a day with real-time streaming inference, pattern classification, and plain-English insights that route straight to your on-call queue.

Open live dashboard ↓Read the use case Ask me how to do this ↗

<12s

Mean time to detect

down from 8-12 minutes

40M

Transactions per day

production throughput

2.1%

False positive rate

vs 5% threshold

99.97%

SLO uptime · 30d

target: 99.95%

The use case

8 engineers. 40 million transactions. One on-call queue.

The SRE Observability team owns reliability for a distributed payment and transaction platform processing 40M events per day. Six critical metric streams , transaction latency, error rate, throughput, auth service performance, database connections, and queue depth , all need to be monitored continuously.

Static alert thresholds set months ago were failing the team. Real incidents , subtle p99 regressions, creeping error rates, queue depth drifts , went undetected until customers noticed. Meanwhile, false positives from normal traffic spikes were waking engineers unnecessarily.

The team needed a system that adapted to traffic patterns, classified what it found, routed alerts intelligently by severity, and explained everything in plain English , without requiring anyone on the team to maintain model parameters.

Before , static threshold monitoring

✗Thresholds set once, never updated as traffic evolved

✗No pattern classification , just 'threshold breached'

✗False positives woke engineers 3-5 nights per week

✗Mean time to detect: 8-12 minutes after incident start

✗On-call received raw metric values, no context or guidance

After , AI-powered anomaly detection

✓Dynamic 2σ control bands adapt to hourly and weekly patterns

✓Spike, drop, level shift, and sustained drift classified per event

✓91% reduction in false positive alert volume

✓Mean time to detect under 12 seconds from signal arrival

✓Plain-English root cause hypothesis delivered with every alert

How it works

From raw signal to routed alert
in under two seconds.

Signal arrives

A transaction event lands in the streaming pipeline. The detector begins scoring immediately , no polling, no delay.

T+0s

Anomaly scored

Z-score and anomaly confidence computed in under 10ms. Point flagged at 89% confidence as an outlier vs the 2σ control band.

T+0.01s

Pattern classified

Spike, drop, level shift, or sustained drift , identified by temporal shape. Not just 'something is wrong' but what kind of wrong.

T+0.02s

Insight generated

Interpretability layer translates the detection into a plain-English root cause hypothesis. Actionable without an ML background.

T+0.05s

Alert routed

P1/P2 fires PagerDuty on-call + MS Teams + ServiceNow ticket. P3 routes Teams + email. All channels confirmed in under 2 seconds.

T+2s

Capabilities

Powerful detection.
Built for real SRE teams.

⚡

Real-time streaming inference

Every transaction is scored as it arrives. No batching windows , sub-10ms p99 across 40M daily events.

🔍

Interpretability by default

Model outputs translate directly into plain-English insights your on-call engineer can act on at 2am without a data science background.

🎚️

Configurable sensitivity

One slider between conservative and aggressive detection. Tuned to your alert fatigue tolerance without touching a model parameter.

📊

2σ control bands

Dynamic Shewhart control charts adapt to traffic patterns in real time , no static thresholds to maintain, ever.

🧠

Pattern classification

Differentiates spikes, drops, level shifts, and sustained drifts , giving context that a single threshold alarm never could.

🏗️

Native to Microsoft Fabric

Deployed in Azure Fabric F64 SKU, East US 2. Security, governance, and workspace isolation baked in from day one.

Detection Architecture

How 40M daily metrics stream through the inference engine and route to the team.

📡

1. Telemetry Ingestion

At 40M events/day, a continuous stream of metrics (latency, error rates, throughput, connection pools) is ingested via Kafka and pre-processed in sub-millisecond time.

🧠

2. Inference Engine

Streaming analytics calculate dynamic 2-sigma control bands on the fly. Spikes, drops, and slow drifts are instantly flagged via confidence scores without relying on static thresholds.

⚡

3. Contextual Routing

The engine translates numeric anomalies into plain-English root cause hypotheses, classifying impact and directly paging the right on-call engineer with full context in under 2 seconds.

Production dashboard

The live Fabric deployment.
Your team's actual tool.

Adjust sensitivity, deep-dive into any metric with the detail drawer, inject a fault to test the full detection pipeline, then click an alert to watch it route through PagerDuty, Teams, ServiceNow, and email in real time.

SRE ObservabilityLIVE

P1 · 1P2 · 1

System health · Production

Operational

All signals within control limits

Today's transactions

16,400,000

+41.0% of 40M target

SLO uptime · 30d

99.97%

Target: 99.95%

Mean time to detect

<12s

Prev: 8–12 min

↓ 98% improvement

False positive rate

2.1%

Threshold: 5%

↓ 91% vs static rules

Detection sensitivity

Aggressive

Fault injection

Select a metric to simulate a fault and trace the routing pipeline

Txn Latency

85ms

Healthyp99

-3.3

Error Rate

0.083%

HealthySLO

+0.0

Throughput

455

Healthycapacity

-14.6

Auth Service

44ms

Healthyp99

+2.6

DB Connections

313

Healthypool

-2.9

Queue Depth

1,282

Healthylag

+12.4

Alert routing channels

PagerDuty

On-call escalation

P1P2

Connected

MS Teams

#sre-alerts channel

P1P2P3

Connected

ServiceNow

Auto-ticket INC creation

P1P2

Connected

SRE team DL + manager

P2P3

Connected

Alert feed

Click any alert to trace its pipeline

P1Txn Latency· Spike5:42:08 PM

DB read replica lag on txn-replica-02

score 88%✓ Routed

P2DB Connections· Level shift5:34:08 PM

Connection pool exhaustion approaching on primary shard

score 65%✓ Routed

P3Queue Depth· Sustained drift5:18:08 PM

Dead letter queue filling , downstream processor rejecting messages

score 48%✓ Routed

Detector configuration

Workspacesre-observability

AlgorithmStatistical ML + Z-score

InferenceReal-time streaming

p99 latency<10ms

Volume40M txn / day

TeamSRE Observability (8 eng)

Capacity: Microsoft Fabric F64 SKU

Region: East US 2

Refreshed: 5:46:08 PM

SRE Observability Platform · Microsoft Fabric

Catch what yourdashboards miss.

8 engineers. 40 million transactions. One on-call queue.

From raw signal to routed alertin under two seconds.

Powerful detection.Built for real SRE teams.

Detection Architecture

1. Telemetry Ingestion

2. Inference Engine

3. Contextual Routing

The live Fabric deployment.Your team's actual tool.

Accessibility

Catch what your
dashboards miss.

From raw signal to routed alert
in under two seconds.

Powerful detection.
Built for real SRE teams.

The live Fabric deployment.
Your team's actual tool.