TechAni
Back to My Products
Production · Azure Fabric · 40M txn/day

Catch what your
dashboards miss.

An AI-powered anomaly detection system built for the SRE Observability team , processing 40 million transactions a day with real-time streaming inference, pattern classification, and plain-English insights that route straight to your on-call queue.

<12s
Mean time to detect
down from 8-12 minutes
40M
Transactions per day
production throughput
2.1%
False positive rate
vs 5% threshold
99.97%
SLO uptime · 30d
target: 99.95%
The use case

8 engineers. 40 million transactions. One on-call queue.

The SRE Observability team owns reliability for a distributed payment and transaction platform processing 40M events per day. Six critical metric streams , transaction latency, error rate, throughput, auth service performance, database connections, and queue depth , all need to be monitored continuously.

Static alert thresholds set months ago were failing the team. Real incidents , subtle p99 regressions, creeping error rates, queue depth drifts , went undetected until customers noticed. Meanwhile, false positives from normal traffic spikes were waking engineers unnecessarily.

The team needed a system that adapted to traffic patterns, classified what it found, routed alerts intelligently by severity, and explained everything in plain English , without requiring anyone on the team to maintain model parameters.

Before , static threshold monitoring
Thresholds set once, never updated as traffic evolved
No pattern classification , just 'threshold breached'
False positives woke engineers 3-5 nights per week
Mean time to detect: 8-12 minutes after incident start
On-call received raw metric values, no context or guidance
After , AI-powered anomaly detection
Dynamic 2σ control bands adapt to hourly and weekly patterns
Spike, drop, level shift, and sustained drift classified per event
91% reduction in false positive alert volume
Mean time to detect under 12 seconds from signal arrival
Plain-English root cause hypothesis delivered with every alert
How it works

From raw signal to routed alert
in under two seconds.

01
Signal arrives
A transaction event lands in the streaming pipeline. The detector begins scoring immediately , no polling, no delay.
T+0s
02
Anomaly scored
Z-score and anomaly confidence computed in under 10ms. Point flagged at 89% confidence as an outlier vs the 2σ control band.
T+0.01s
03
Pattern classified
Spike, drop, level shift, or sustained drift , identified by temporal shape. Not just 'something is wrong' but what kind of wrong.
T+0.02s
04
Insight generated
Interpretability layer translates the detection into a plain-English root cause hypothesis. Actionable without an ML background.
T+0.05s
05
Alert routed
P1/P2 fires PagerDuty on-call + MS Teams + ServiceNow ticket. P3 routes Teams + email. All channels confirmed in under 2 seconds.
T+2s
Capabilities

Powerful detection.
Built for real SRE teams.

Real-time streaming inference
Every transaction is scored as it arrives. No batching windows , sub-10ms p99 across 40M daily events.
🔍
Interpretability by default
Model outputs translate directly into plain-English insights your on-call engineer can act on at 2am without a data science background.
🎚️
Configurable sensitivity
One slider between conservative and aggressive detection. Tuned to your alert fatigue tolerance without touching a model parameter.
📊
2σ control bands
Dynamic Shewhart control charts adapt to traffic patterns in real time , no static thresholds to maintain, ever.
🧠
Pattern classification
Differentiates spikes, drops, level shifts, and sustained drifts , giving context that a single threshold alarm never could.
🏗️
Native to Microsoft Fabric
Deployed in Azure Fabric F64 SKU, East US 2. Security, governance, and workspace isolation baked in from day one.

Detection Architecture

How 40M daily metrics stream through the inference engine and route to the team.

📡

1. Telemetry Ingestion

At 40M events/day, a continuous stream of metrics (latency, error rates, throughput, connection pools) is ingested via Kafka and pre-processed in sub-millisecond time.

🧠

2. Inference Engine

Streaming analytics calculate dynamic 2-sigma control bands on the fly. Spikes, drops, and slow drifts are instantly flagged via confidence scores without relying on static thresholds.

3. Contextual Routing

The engine translates numeric anomalies into plain-English root cause hypotheses, classifying impact and directly paging the right on-call engineer with full context in under 2 seconds.

Production dashboard

The live Fabric deployment.
Your team's actual tool.

Adjust sensitivity, deep-dive into any metric with the detail drawer, inject a fault to test the full detection pipeline, then click an alert to watch it route through PagerDuty, Teams, ServiceNow, and email in real time.

SRE ObservabilityLIVE
P1 · 1P2 · 1
System health · Production
Operational
All signals within control limits
Today's transactions
16,400,000
+41.0% of 40M target
SLO uptime · 30d
99.97%
Target: 99.95%
Mean time to detect
<12s
Prev: 8–12 min
↓ 98% improvement
False positive rate
2.1%
Threshold: 5%
↓ 91% vs static rules
Detection sensitivity
Aggressive
Fault injection
Select a metric to simulate a fault and trace the routing pipeline
Txn Latency
90ms
Healthyp99
+4.7
Error Rate
0.083%
HealthySLO
-0.0
Throughput
462
Healthycapacity
+9.2
Auth Service
40ms
Healthyp99
-1.4
DB Connections
319
Healthypool
+3.5
Queue Depth
1,228
Healthylag
-12.4
Alert routing channels
PagerDuty
On-call escalation
P1P2
Connected
MS Teams
#sre-alerts channel
P1P2P3
Connected
ServiceNow
Auto-ticket INC creation
P1P2
Connected
Email
SRE team DL + manager
P2P3
Connected
Alert feed
Click any alert to trace its pipeline
15
P1Txn Latency· Spike2:32:04 PM
TLS handshake overhead elevated , cert rotation in progress?
score 88%✓ Routed
P2DB Connections· Level shift2:24:04 PM
Long-running query holding locks , TXID 48291 suspect
score 65%✓ Routed
P3Queue Depth· Sustained drift2:08:04 PM
Dead letter queue filling , downstream processor rejecting messages
score 48%✓ Routed
Detector configuration
Workspacesre-observability
AlgorithmStatistical ML + Z-score
InferenceReal-time streaming
p99 latency<10ms
Volume40M txn / day
TeamSRE Observability (8 eng)
Capacity: Microsoft Fabric F64 SKU
Region: East US 2
Refreshed: 2:36:04 PM
SRE Observability Platform · Microsoft Fabric
Anomaly Detector · SRE Observability Platform
Built on Microsoft Fabric