TechAni

Agentic AI for the enterprise.

Insights · Agentic AI · March 21, 2026 · Edit Assist: AniBot powered by Claude

The term AI agent gets used for everything from a chatbot with one tool call to a system autonomously managing production infrastructure across a dozen APIs, and that's one broad space with many gaps. This insight is a breakdown of what an agent actually is, how to build one properly that works for you, and what it looks like when you deploy the pattern in healthcare IT and banking operations today.

I have been building and evaluating agentic systems across SRE, observability, and platform engineering contexts long enough to have opinions about where the pattern works and where it breaks down. This is not a vendor pitch or a research paper. It is a ground-level guide to the architecture, the deployment realities, and the eval criteria that actually tell you whether an agent is helping or quietly making things worse.

What an AI Agent Actually Is

A language model on its own produces a response. An agent is what you get when you wrap that model in an execution loop.

  • The loop is the point. The agent perceives state, reasons about the next move, executes through a tool, observes the result, and decides what to do next.
  • A prompt stops after one answer. An agent continues until a goal is met, a confidence threshold is breached, or a human steps in.
  • That changes the automation boundary. You can automate tasks that require sequential decisions against changing system state, not just tasks with one fixed response.
The agent loop
Perceive
read state from tools, APIs, memory
Reason
LLM decides what action to take
Act
execute tool call: API, command, query
Observe
capture result, check expected outcome
Loop or halt
continue, escalate, or close
The loop runs until goal completion, confidence breach, or human intervention. That is the whole model.

The Four Components Every Agent Needs

Strip away the marketing and every production agent has four essential layers. If one of these is missing, you do not have a production agent. You have a prompt with extra packaging.

Layer 01

Language Model

The model handles reasoning, decision-making, and response generation. It interprets context, chooses the next step, and produces structured output.

Layer 02

Tool Layer

The tool layer lets the model act on external systems. That can mean querying a database, calling an API, running a command, or writing a file.

Layer 03

Memory Layer

The memory layer gives the model context beyond the immediate prompt, including past actions, retrieved documents, and structured state.

Layer 04

Orchestration Layer

The orchestration layer runs the loop, sequences tool calls, enforces guardrails, and decides when a human has to stay in the decision.

Why not a script or rule engine

Scripts execute fixed steps. Rule engines match fixed patterns. Neither handles ambiguity, missing context, or cases outside their predefined logic. An agent can reason over partial information, call additional tools to fill gaps, and make a judgment call with a confidence score attached. It also knows when to stop and ask a human. That is the capability gap that makes the architecture worth the complexity cost.

How Tool Calls Actually Work

The model does not directly execute code. It proposes an action and the orchestration layer decides whether and how to run it.

  • Step 1. The model emits a structured tool call with the function name and parameters.
  • Step 2. The orchestration layer validates the request and executes the real function against the live system.
  • Step 3. The result is captured and written back into the model context.
  • Step 4. The model reasons over that result and decides whether to continue, escalate, or stop.
  • Contracts matter. Tool inputs should be typed and validated before execution, and tool outputs should come back in a structured schema. That is what makes agent behavior auditable, interoperable, and safe to evolve without breaking downstream steps.
  • Latency note. Independent tool calls should run in parallel. That is one of the main levers for reducing end-to-end latency.
Parallel async tool calls / Python
# Fire all enrichment calls simultaneously, not sequentially
# Sequential would add 6-9 seconds of latency per incident
context = await asyncio.gather(
    get_service_catalog(service="patient-portal-api"),
    get_recent_anomalies(service="patient-portal-api", window="30m"),
    search_incident_history(title_similar=incident.title, limit=5)
)

# All three resolve before the LLM sees any of it
# catalog: owner, slo_target, downstream deps
# anomalies: error_rate, p99, spike_start
# history: 3 similar past incidents, their resolution paths

The Human-in-the-Loop Question: Where Does the Human Stay?

Good agent design is explicit about where humans stay in the decision. The split should be clear enough that an operator can explain it in one minute.

  • Human decides. Use this mode for irreversible actions, high-stakes outcomes, and low-confidence outputs. The agent prepares the case and the human makes the decision.
  • Agent executes. Use this mode for well-understood, reversible, high-confidence actions. Humans review through audit rather than blocking every step.
  • Policy lives in code. The threshold between those two modes should be a governed policy enforced in code, not a prompt instruction.
On confidence thresholds

Every production agent needs an explicit confidence floor below which it escalates to human rather than acting. This threshold is a business policy, not a model hyperparameter. It should be set by the operations team, versioned, audited, and reviewed when the false positive or false negative rate drifts. If you cannot answer "what is your agent's confidence threshold for autonomous action and who owns that decision," the agent is not production-ready.

Eval Is Not Optional

Agents that are not continuously evaluated drift. The model changes across versions, data distribution shifts, and edge cases accumulate. Every production agent needs a small set of metrics tracked on a regular cadence.

  • Task quality. Track classification accuracy, task completion rate, or resolution quality based on the job the agent is doing.
  • Hallucination rate. Measure how often the agent states facts that are not present in its source data, retrieved context, or tool outputs.
  • Latency. Measure elapsed time from trigger to action so you know whether the agent is useful in real operational conditions.
  • Human usefulness. Capture a simple quality score from the person receiving the output so you know whether the response was actually helpful.

Without these measures, you do not know whether the agent is helping or quietly causing problems.

Why Retrieval and Data Pipelines Matter

Most failures that look like model failures are actually context failures. The model cannot reason over evidence that was never extracted, cleaned, tagged, or retrieved correctly.

  • Retrieval is not memory. Memory retains prior actions, state, and decisions. Retrieval fetches external evidence on demand: runbooks, incident history, policy documents, CMDB records, and knowledge bases.
  • Bad context poisons good reasoning. Stale runbooks, noisy incident history, or an incomplete service catalog will drive bad decisions for understandable reasons.
  • A pipeline has to do real work. Production retrieval depends on source extraction, cleanup, chunking, metadata, freshness checks, access control, and traceability to the exact source and version used at run time.
  • This is why RAG projects often disappoint. Teams blame the model when the real issue is that the data layer was never made usable, current, or trustworthy enough for production decisions.

Building Agent Infrastructure at Commercial Scale

Once you move beyond one or two isolated agents, the real work shifts from prompt design to platform design.

  • Shared control plane. Commercial scale means dozens of workflows, many teams, strict audit requirements, and hard limits on latency and cost. You need common model routing, tool authentication, state management, policy enforcement, tracing, and release management.
  • Production resilience. Tool calls fail, APIs rate-limit, queues back up, and workflows stop halfway through. The runtime needs idempotent actions, resumable state, retry policy, dead-letter handling, and a complete action log.
  • Clear ownership split. The platform team should own the runtime, SDK, eval harness, telemetry, and tool contracts. The domain team should own thresholds, approval rules, escalation paths, and success criteria.
  • Agent observability. You need traces for each run, timing for every tool call, retrieval hit rates, queue depth, retry counts, human override rate, and policy rejection rate. You also need structured capture of prompt version, context, tool outputs, model response, confidence, final action, and downstream result.
Prompts should be config, not buried in code

Once prompts affect production behavior, they should be treated like deployable assets: versioned, reviewed, evaluated, and rollbackable. If a team cannot tell you which prompt version produced a decision, compare that version against the previous one, or revert it safely after a regression, the prompt layer is not production-ready yet.

Observability and improvement loop
Capture run
trace each agent step, tool call, and policy gate
Store evidence
prompt version, context, outputs, confidence, result
Evaluate
replay real cases and measure drift, latency, and quality
CI gate
block regressions before production rollout
Improve and ship
tighten prompts, policies, tools, and eval sets
A production agent should generate the data needed to explain failures and improve the next release.
  • CI should replay real work. Each release should run against a fixed evaluation set built from real past cases, failure cases, and policy edge cases.
  • Regressions should block release. If accuracy drops, hallucinations rise, latency regresses, or policy violations increase, the build should fail before production.
  • Improvement should be evidence-based. Strong teams improve the system by replaying runs, labeling outcomes, tightening prompts and policies, and proving that the next version is better than the last one.
What mature agent infrastructure includes

Central identity for tool access. Per-action authorization. Versioned prompts and policies. Replayable traces. Cost and latency budgets. Sandboxed test environments. Clear rollback paths. These are not extras. They are the difference between an interesting demo and a system you can trust across a large business.

Builder note

Here's a link to the agents I've built that are plug and play. Deploy and enjoy. Check out the repo.

Quality gating: how we test AI agents before we let them loose

Every pattern in this guide involves an agent that can take real action: restart a pod, decline a transaction, modify a patient record, defer a debt. "Those aren't really low stakes outputs, including a pod restart during a transaction being executed for teams driven by customer obsession and success." Before any of that goes live, every engineering team has to answer the same question: how do you know the agent is ready?

You do not know until you run it hard against realistic conditions, your actual failure scenarios, and clear thresholds it has to meet before it gets anywhere near production. That is what a simulated pre-production environment is for, and building one well is not optional. It is the engineering work that turns a demo into a deployable system.

The core problem with skipping this

If you do not test an AI agent in a realistic, controlled environment before deployment, you do not know whether it is safe, reliable, or aligned with how your systems actually behave under stress.

The three-layer simulation architecture

A production-grade simulation environment has three distinct layers. Each one has a job. If you conflate them or skip one, you end up with a test harness that tells you what you want to hear rather than what you need to know.

Simulation environment layers
Synthetic data plane
realistic world for the agent to operate in
Agent harness
runs the loop, stubs live calls, captures every decision
Evaluation engine
scores correctness, safety, latency per cycle
Readiness gate
threshold check: pass or block promotion
The agent must clear all three layers before it touches a live system. The readiness gate is the decision point.

Layer 1: the synthetic data plane

This is the simulated world the agent operates in. The goal is not clean, idealized data. The goal is data that is hard enough to be honest. That means building four components and getting them right.

Component 01

Metric forge

Emits realistic time-series data for CPU, memory, latency, and error rate. Includes injected anomalies at realistic intervals: gradual memory leaks, sudden CPU spikes, p99 latency degradations with the right curve shape. An agent trained on perfect sine waves will fail on real brownfield telemetry.

Component 02

Incident factory

Generates scenario templates with varying severity, upstream cause chain, and affected service scope. Cascading failures, pod OOM loops, connection pool exhaustion, config change propagation. Each scenario has a known ground-truth root cause the eval engine can score against.

Component 03

Log emitter

Produces structured and unstructured log streams with realistic noise ratios. Real logs have repeated lines, malformed JSON, varying timestamp precision, and duplicate events from multiple sources. The log emitter should reproduce all of that, not sanitize it away.

Component 04

Fault injector

Introduces controlled chaos: API timeouts, partial data returns, missing service catalog entries, contradictory signals from different monitoring sources. Stress-tests the agent behavior under degraded signal quality, which is exactly what real production looks like during a high-severity incident.

The hardest part of the synthetic data plane is making it statistically realistic. You cannot just generate random numbers within a plausible range. You need to sample from the actual distributions you see in production: the spike frequency, the correlation patterns between services, the noise floor, the baseline drift. The easiest way to do this is to capture two to four weeks of real production telemetry and use it as the seed for your synthetic generator. Anonymize where needed, then build a generator that samples from those empirical distributions.

Layer 2: the agent harness

The harness runs the agent through its operational loop against the synthetic data and captures everything. Three components do the work.

  • Context builder. Assembles what the agent sees each cycle: which signals, what history, what tools are available. This has to match the real context structure exactly. An agent that performs well in simulation but sees a different context shape in production has not been tested.
  • Decision loop. Executes the observe-reason-act-verify cycle and records every step: the context assembled, the tool calls made, the reasoning chain, the action taken, the time elapsed. Every cycle generates a complete, replayable trace.
  • Tool interceptor. This is the most important component. It stubs every real-world call so the agent can reason and act as if tools executed, without anything actually happening. The interceptor should return realistic responses based on the synthetic scenario, not just blanks. The agent needs to believe its actions had consequences so subsequent reasoning reflects that. A PagerDuty escalation that returns a plausible incident ID. A kubectl restart that returns a realistic pod status progression. Blank responses will produce agents that pass simulation and fail in production because they never learned to reason over tool outputs.
Tool interceptor: realistic stub responses / Python
# The interceptor wraps every tool call during simulation
# It returns realistic synthetic responses, not empty stubs
class ToolInterceptor:
    def kubectl_rollout_restart(self, deployment, namespace, scenario):
        # Simulate realistic pod restart progression
        # Pass scenario context so the stub reflects the sim state
        return {
            "status": "initiated",
            "message": f"deployment.apps/{deployment} restarted",
            "simulated_recovery_time": scenario.expected_ttdr_seconds
        }

    def pagerduty_escalate(self, incident_id, team, context):
        # Return a plausible PD incident ID so the agent can
        # reason correctly in subsequent steps
        return {
            "incident": {"id": "SIM-" + incident_id, "status": "triggered"},
            "assignment": {"team": team, "escalated_at": context.sim_timestamp}
        }

# Every intercepted call is logged: input, output, timestamp, scenario_id
# This is the data the evaluation engine scores against

Layer 3: the evaluation engine

The eval engine is what turns hundreds of simulation cycles into a pass or fail decision. It needs to score three things independently, not just average them together.

  • Correctness. Did the agent identify the right root cause? Did it route to the right team? Did it classify the right severity? Score this per scenario type, not just in aggregate. An agent that is excellent at CPU spikes but consistently misdiagnoses cascading failures should not pass, even if its aggregate accuracy is above threshold.
  • Safety. Would this action have made things worse? Estimate blast radius for every action the agent took. Flag actions that would have caused downstream harm in a live system. A high correctness score combined with a poor safety score is a dangerous agent. It knows what the problem is but its remediation approach is destructive.
  • Regression coverage. Every known-bad scenario from your production history should be in the simulation set and must be re-run on every evaluation cycle. If the agent regresses on a case that worked last week, that is a blocking signal regardless of aggregate performance.
≥80%
Correctness threshold
minimum to gate to shadow mode
≥95%
Safe action rate
actions that would not increase blast radius
100%
Regression pass rate
known-bad cases must all pass before promotion
The readiness gate is a conjunction, not an average

All three thresholds must clear simultaneously. An agent that passes correctness and regression but has a 90 percent safe action rate is not production-ready. An agent that passes safety and regression but has 75 percent correctness is not production-ready. Averaging them together to produce a single score lets dangerous tradeoffs hide. The gate logic is AND, not mean.

What the simulation loop looks like in practice

A simulation run has a clear flow. Configure the scenario mix and noise parameters. Run N cycles, typically two hundred or more for a meaningful sample. The harness fires each cycle, the interceptor stubs every tool call, and the eval engine scores the outcome. At the end, the readiness gate checks all three thresholds. Pass means the agent promotes to shadow mode. Fail means it goes back for another training or prompt iteration before the next run.

Pre-production simulation run / SRE triage agent / 200 cycles
Click run to simulate a pre-production quality gate evaluation.

After simulation: shadow mode

Passing the simulation gate does not mean going fully live. It means the agent is ready for shadow mode: it runs alongside the live system, makes decisions, but every action goes to a dry-run queue where a human reviews before anything executes. Shadow mode is the bridge between simulation and production, and it is where you close the gap between synthetic and real data distributions.

Shadow mode has its own promotion criteria. After a defined number of shadow cycles, typically a minimum of two weeks and one hundred real incidents, you score the shadow run the same way you scored the simulation: correctness, safety, and regression. If the shadow scores are within a defined tolerance band of the simulation scores, the data distribution gap is acceptable and you can promote. If shadow scores are materially lower, your synthetic data is not representative enough and you need to go back and fix the generator before the next cycle.

Promotion path: simulation to production
Build + prompt
agent version ready for evaluation
Simulation
200+ cycles, synthetic data, eval engine
Shadow mode
live data, dry-run queue, human review
Production
autonomous within governed thresholds
Any gate failure sends the agent back to the build step, not forward. The path is sequential and non-negotiable.

What good simulation data tells you

Beyond pass or fail, the simulation run surfaces specific intelligence that makes the agent better. The scenarios that consistently fail cluster around identifiable patterns: noisy signals from a particular data source, a class of incident the agent has not seen enough examples of, a tool stub response shape that does not match what the real API returns. Each of these is actionable. Fixing them before shadow mode is far cheaper than discovering them in production.

  • Noise sensitivity. Run the simulation with fault injection rates at 10, 30, and 60 percent. If accuracy degrades sharply above 30 percent noise, the agent is brittler than it should be and needs more diverse training examples.
  • Scenario coverage gaps. If correctness on network partition scenarios is 55 percent while CPU spike correctness is 91 percent, you have a coverage gap. Add more network partition examples to the training set before the next run.
  • Safety floor by action type. Break the safe action rate down by action category: restart, escalate, modify config, close ticket. If one action type has a lower safe rate, that action type needs a tighter confidence threshold before it is authorized in production.
  • Latency under load. The simulation should run at a concurrency level that matches peak production incident rates. If latency degrades significantly at concurrency, the agent architecture has a scaling problem that needs to be fixed before shadow mode.
The principle that ties all of this together

Every hard guardrail that matters in production should be tested explicitly in simulation. If the confidence threshold for autonomous action is 0.75, run scenarios designed to push the agent just below and just above that threshold and verify the behavior is correct in both directions. If a HIPAA field is on a never-auto-correct list, run scenarios where the agent is explicitly tempted to correct it and verify it escalates every time. The simulation is not just a performance test. It is a verification of every safety contract the agent is supposed to uphold.

Use Cases by Sector

Healthcare SRE: Four Production Patterns

Healthcare IT runs some of the highest-stakes infrastructure on the planet, and it runs it with SRE teams that are chronically understaffed relative to the alert volume they absorb. L1 noise drowns on-call engineers, RCA is manual and slow, identity mismatches cause access failures that look like infrastructure bugs, and runbooks sit in Confluence untouched because nobody has time to execute them carefully. These four patterns address each of those failure modes directly.

Pattern 01 / Healthcare SRE

L1/L2 Incident Triage Agent

An agent that ingests every incoming incident from PagerDuty or ServiceNow, enriches it with service context and observability data, classifies severity and ownership, autonomously resolves confirmed L1 noise, and hands off genuine L2s to on-call with the context already assembled. The on-call stops being a router and starts being a decision-maker.

Architecture
Trigger
PagerDuty webhook / ServiceNow event
Enrichment
parallel: catalog, anomalies, history
LLM classify
severity, tier, confidence
Route
L1: remediate. L2: brief + page.

Step-by-Step Execution Flow

01
Ingest and normalize the raw webhook payload
The agent receives the raw event and normalizes it into a canonical incident schema regardless of which monitoring tool fired it. Datadog, Dynatrace, Splunk, and custom alerting rules all produce different payloads. The agent flattens these into a consistent structure before any downstream logic runs. A deduplication hash computed from the service and alert signature suppresses repeat firings from the same root event.
PagerDuty Events API v2ServiceNow Event MgmtKafka consumer
02
Parallel context enrichment: three tool calls fired simultaneously
The agent fires three tool calls in parallel rather than sequentially. Sequential calls would add 6-9 seconds of latency per incident. The service catalog returns ownership, SLO targets, and downstream dependencies. The observability platform returns the last 30 minutes of anomaly context for that specific service. The incident history store returns the five most similar past incidents and their resolution paths. The LLM receives all three before making any classification decision, so the output is grounded in actual context rather than guesswork.
Dynatrace APISplunk RESTServiceNow CMDBasyncio.gatherVector DB similarity
03
LLM classification with severity, tier, confidence, and reasoning
The assembled context goes to the LLM with a structured few-shot classification prompt. The model returns a structured JSON object: severity tier (P1 through P4), whether it is L1-resolvable without human involvement, a confidence score between 0 and 1, and reasoning in plain text. The confidence threshold of 0.75 is the critical guardrail. Anything below it escalates to human regardless of what tier the model assigned. P1 incidents bypass this logic entirely and always page on-call immediately.
Few-shot classification promptJSON mode structured outputConfidence gate at 0.75
04
L1 auto-remediation or L2 escalation with pre-built brief
For L1 (confidence at or above 0.75, l1_resolvable true): the agent executes the remediation via tool call, monitors recovery, and closes the ticket with a generated summary. For L2 (confidence below threshold, l1_resolvable false, or any P1): the agent does not touch the system. It assembles the full incident brief, what broke, the timeline, anomaly data, similar past incidents, and the top hypothesis, then pages the on-call via PagerDuty with all of it attached. The engineer's first action is informed, not investigative.
Kubernetes rollout APIPagerDuty Incidents APISlack webhook
05
Outcome logging for continuous model improvement
Every decision, the LLM output, enrichment context, confidence score, action taken, and final outcome, is written to a structured append-only log. A nightly eval job compares what the agent classified against what the on-call confirmed was correct. Classification errors are reviewed and fed back as updated few-shot examples. The agent's accuracy improves over time because every incident is labeled training data.
OpenTelemetry spansS3 + Parquet structured logNightly eval pipeline
Agent run / INC-88921 / patient-portal-api / prod
Click run to simulate agent execution.
Deployment note

The confidence threshold of 0.75 and the P1 bypass rule are code-level constraints, not prompt instructions. Hard guardrails live in the orchestration layer. Prompt instructions can be overridden by sufficiently unusual inputs. Code cannot. Any rule that would cause real harm if violated belongs in code.

92%
Triage accuracy
correct severity + team routing
3%
False L1 close rate
target ceiling / incidents wrongly auto-closed
45s
End-to-end latency
ingest to routed action
0.75
Confidence floor
min before autonomous action
60%
L1 containment rate
resolved without human involvement
2%
Hallucination rate
facts in brief not in source data
Pattern 02 / Healthcare SRE

Root Cause Analysis Agent

An agent that fires the moment an incident is acknowledged. It correlates signals across logs, metrics, distributed traces, and change events, reconstructs a causal timeline, ranks hypotheses by evidence weight, and posts a structured RCA to the team channel before the on-call engineer has opened their first terminal window.

Architecture
Signal collection
logs, traces, metrics, deploy events
Correlation
timestamp alignment, propagation chain
LLM rank
rank by evidence, cite sources
Team post
Slack/Teams with timeline + actions

Step-by-Step Execution Flow

01
Define blast radius and investigation window
The agent starts from the incident timestamp and builds the blast radius by querying the service dependency graph. It identifies which upstream services could have caused the failure and which downstream services are already affected. This scoping step is critical. Without it the agent pulls telemetry from every service in the platform, which is both expensive and noisy. The investigation window defaults to T-30 minutes to T+5 minutes.
Dynatrace topology APIServiceNow CMDBOTel service graph
02
Four-signal parallel collection across all blast radius services
For each service in the blast radius the agent collects four signal types simultaneously: error-rate and latency anomalies from metrics, error pattern clusters from logs grouped by message template rather than raw lines, distributed traces with error spans touching the affected service, and change events including deployments, config pushes, and feature flag flips in the investigation window. Every signal is timestamped to sub-second precision and tagged with the source service before any correlation runs.
Splunk search APIDynatrace metrics v2Jaeger / Tempo trace queryArgoCD deploy events
03
Causal correlation and timeline reconstruction
The agent orders all signals chronologically and applies a causality heuristic: any event that preceded the user-visible symptom by 30 seconds or more and matches a known failure propagation pattern is a causal candidate. For each candidate it traces the propagation chain through the dependency graph. In the example below, the config change at 14:28 precedes the proxy timeout at 14:31, which precedes the portal error spike at 14:32. The chain is clear and fully timestamped.
Reconstructed causal timeline
timeline = [
  {"t": "14:28:00", "event": "identity-db max_connections 200→100 (infra-bot config change)"},
  {"t": "14:31:02", "event": "ehr-proxy connection timeouts to identity-db begin"},
  {"t": "14:31:04", "event": "ehr-proxy db_conn_pool_wait +340ms spike"},
  {"t": "14:32:18", "event": "patient-portal-api error rate crosses 8%"},
  {"t": "14:33:00", "event": "INC-88921 fired by Datadog anomaly detector"}
]
04
LLM hypothesis ranking with mandatory evidence citation
The timeline and causal candidates go to the LLM with a system prompt that requires: rank root cause candidates by evidence weight, explain the propagation chain in plain English, flag any gaps in the evidence, and propose one or two immediate actions. The model must cite which specific signal supports each causal claim. If a claim cannot be traced back to a signal in the collected data, the model flags it as a gap rather than asserting it as fact. This is what controls hallucination in RCA outputs.
Chain-of-thought promptEvidence citation enforcementStructured JSON output
05
Structured team post with interactive action options
The agent posts a structured message to the incident Slack channel using Block Kit. It includes the five-event timeline, root cause with confidence score, evidence citations, open gaps, and action buttons for the on-call: revert config, page the relevant team, or mark as manually investigating. If confidence is below 0.60, the post is explicitly labeled as a low-confidence hypothesis and action buttons are disabled until the on-call unlocks them.
Slack Block Kit APIMS Teams Adaptive CardsInteractive button to agent action bridge
Agent run / INC-88921 / RCA / identity-db blast radius
Click run to simulate RCA agent execution.
85%
RCA correctness rate
confirmed correct by on-call post-resolution
5%
Hallucinated evidence
citations not present in source signals
90s
Time to posted RCA
from incident acknowledgment
0.80
Mean confidence score
across all RCAs posted
-40%
MTTR delta
vs pre-agent baseline
4.2/5
Usefulness score
on-call rating per RCA post
Pattern 03 / Healthcare SRE

User Identity Mismatch Detection and Remediation Agent

When an incident ticket references a patient or staff member, this agent detects mismatched identity data across EHR, IAM, billing, and HR directory systems. It determines whether the mismatch is causing or contributing to the incident, auto-corrects safe fields, and escalates with a reconciliation brief where PHI or ambiguous authority requires human review. That 403 Forbidden you have been paging about is often a six-character string mismatch between two systems that have never talked to each other.

Architecture
Extract IDs
NLP: MRN, email, SSO, employee ID
Cross-system fetch
EHR, IAM, billing, HR in parallel
Diff + classify
sensitivity tier, authority, impact
Remediate
safe: auto-fix. PHI: human review.

Step-by-Step Execution Flow

01
Extract all user identifiers from the incident ticket
The agent reads the incident description using LLM-based entity extraction to pull every user identifier: patient MRN, employee ID, email address, SSO subject claim, and account number. It normalizes format variations so "MRN 00882" and "patient 882" resolve to the same entity. If the ticket contains no user identifiers, this pattern is skipped and the incident routes to standard triage. The agent does not speculate about who the ticket is about.
LLM NER extractionIdentity normalization rulesFHIR Patient resource
02
Parallel cross-system record fetch with pre-fix audit snapshot
The agent queries every relevant system in parallel: Epic or Cerner FHIR API for the patient record, Okta or Azure AD for account status and group memberships, the billing system for account status, and the HR directory for employee record. The raw records are stored verbatim before any comparison begins. This is the audit snapshot: what each system believed at the moment of investigation, preserved as an immutable artifact. If a correction is ever disputed, the pre-fix state is on record.
Epic FHIR R4 APIOkta Users APIAzure AD Graph APIImmutable pre-fix snapshot
03
Field-level diff, sensitivity classification, and incident impact scoring
The agent runs a structured diff across all records for every shared field. Each mismatch is classified on three dimensions: field sensitivity (PII tier 1 is SSN, DOB, MRN; tier 2 is email and address; tier 3 is display name), auto-correctability based on whether there is a clear authoritative source, and incident impact assessing whether this mismatch is plausibly causing the reported error. A versioned YAML config defines field-level authority per system and is owned by the identity governance team.
Mismatch classification output
mismatch = {
  "field":                "email",
  "ehr_value":           "[email protected]",
  "iam_value":           "[email protected]",
  "hr_value":            "[email protected]",
  "sensitivity":         "PII-tier2",
  "authoritative_source": "IAM",  # 2 of 3 systems agree
  "incident_impact":     "HIGH / EHR email mismatch breaks SSO claim validation",
  "auto_correctable":    True,
  "correction":          "update EHR email to [email protected]"
}
04
Tiered remediation with full audit trail
For PII-tier2 fields where the authoritative source is clear: the agent executes the correction via API and writes an audit record with actor, reason, incident reference, before-value, after-value, and timestamp. The record is immutable and reversible within a 24-hour window. For PII-tier1 fields (MRN, DOB, SSN, insurance IDs): the agent never writes. It builds a reconciliation brief with the diff table, recommended correction, and which system it believes is authoritative, then attaches it to the ticket with an approve button. The human clicks approve; the agent executes and logs.
Epic FHIR PATCHOkta user update APIImmutable audit logHIPAA-compliant change record24h rollback window
05
Post-fix verification and conditional ticket closure
After any correction the agent re-runs the original triggering check: can the SSO flow now match the email claim across systems? If verification passes, the ticket is updated with the full remediation summary and closed. If it does not pass, the agent explicitly flags that the mismatch fix did not resolve the issue and the incident stays open. This prevents premature closure where the identity fix was a red herring and the real cause is still unresolved.
SSO validation re-checkRe-fetch and re-diff post-fixServiceNow ticket update API
Agent run / INC-88921 / identity mismatch / [email protected]
Click run to simulate identity mismatch detection.
HIPAA guardrails

PHI-tier1 fields (MRN, DOB, SSN, insurance IDs) are on a hardcoded never-auto-correct list. This is enforced in code, not in the authoritative source YAML config, which is editable. The distinction matters: editable config can be accidentally changed. The PHI tier-1 prohibition lives in the agent's execution layer where it cannot be overridden without a code deployment and a review.

96%
Extraction accuracy
correct IDs pulled from ticket text
0%
Unsafe auto-correction
PHI tier-1 fields must never auto-correct
88%
Mismatch resolution rate
fixes that resolved the incident
90s
Detection latency
ticket created to mismatch report
100%
Audit coverage
every correction has a log record
90%
Authority accuracy
agent picks the correct authoritative system
Pattern 04 / Healthcare SRE

Runbook Execution Agent

An agent that parses a natural-language or structured runbook into a typed, verifiable action plan, executes each step via tool calls, and checks the expected outcome before proceeding to the next step. When a step fails verification, whether pods are not stabilizing, a metric is still elevated, or a health check returns the wrong status, the agent halts and posts a precise failure state. It does not blindly continue. That distinction is everything.

Architecture
Parse runbook
LLM converts steps to typed action plan
Execute step
kubectl, API, health check, metric query
Verify outcome
check state against expected result
Halt or next
fail: post state + await human. pass: proceed.

Step-by-Step Execution Flow

01
Parse runbook into a typed, human-reviewed action plan
The agent fetches the runbook from the registry as a Confluence URL, markdown file, or structured YAML. If it is natural language, the LLM parses each step into a typed action object: action type, parameters, and an explicit expected outcome that can be mechanically verified. The parsed plan is posted to the on-call for confirmation before any execution begins. Runbook parsing errors caught here prevent production incidents downstream. The agent never starts executing without the plan being confirmed.
Typed action plan / parsed from natural language
parsed_plan = [
  {"step": 1, "type": "kubectl_rollout_restart",
   "target": "deployment/patient-portal-api", "namespace": "prod",
   "expected_outcome": "all pods Running within 120s"},
  {"step": 2, "type": "http_health_check",
   "url": "https://patient-portal.internal/health",
   "expected_outcome": "status=200, body contains 'healthy'"},
  {"step": 3, "type": "metric_check",
   "query": "error_rate{service='patient-portal-api'}",
   "expected_outcome": "value < 1.0 for 2 consecutive minutes"}
]
LLM runbook parserConfluence APIVersioned runbook registryHuman approval gate before execution
02
Execute step and capture full execution context
The agent executes the action via its corresponding tool call: Kubernetes API for kubectl actions, direct HTTP calls for health checks, the observability API for metric queries. Every execution captures the exact API call made, the raw response, the HTTP status or exit code, and a timestamp. This record is immutable. If the runbook is re-run, the previous execution records are versioned and preserved, not overwritten. OTel spans wrap every step so the full execution trace is visible in your observability platform.
Kubernetes APIOTel span per stepImmutable versioned execution log
03
Verification gate: expected outcome must pass before proceeding
After each step, the agent polls the relevant check on a configurable interval with a timeout. For the pod restart: it polls Kubernetes deployment status every 10 seconds for up to 120 seconds, checking that desired replicas equal ready replicas. It does not proceed to the health check until this passes. If the pods do not stabilize, the agent halts at step 1 rather than running the health check against a broken deployment. That is the fundamental difference between an execution agent and a script.
04
Intelligent halt with precise failure state
When a verification check fails, the agent halts and posts a structured failure report: which step failed, what the expected outcome was, what the actual outcome was, and a current system state snapshot. The on-call gets specific, actionable information. In the example where the pod restart succeeded but error rate remained elevated, the agent surfaces that the runbook did not address the actual root cause, rather than closing the incident on incomplete evidence.
System state snapshot at haltStructured failure reportHuman decision options: retry, skip, hand off
05
Completion report and runbook quality feedback
When all steps pass, the agent closes the incident and generates an execution report with time per step, which verifications required retries, and any steps that took longer than expected. This report goes to the runbook owners as quality signal. Steps that consistently fail verification or require retries are candidates for runbook revision. The agent's execution data becomes the feedback mechanism that improves the runbook library over time.
Per-step timing dataRunbook quality feedback to ConfluenceRetry pattern analysis
Agent run / INC-88921 / runbook patient-portal-restart
Click run to simulate runbook execution.
95%
Parse accuracy
runbook steps correctly interpreted
78%
Completion rate
runbooks completed without halt
0
Proceed-on-fail count
agent must never skip a failed gate
5s
Step overhead
tool call latency per step
100%
Execution audit coverage
every step logged with before/after state
-55%
MTTR delta
vs manual runbook execution

Banking: Four Production Patterns

Financial institutions process millions of transactions a second but their operational responses, including fraud reviews, compliance screens, collections, and wealth rebalancing, still involve humans doing work that is fundamentally pattern-matching against rules. Agentic AI changes this by automating every step that does not require human judgment, while keeping humans clearly in the decision for the steps that do. The patterns below have real regulatory constraints: BSA/AML, Reg F, Reg BI. Those constraints are modeled in the architecture, not treated as an afterthought.

Pattern 01 / Banking

Real-Time Fraud Detection Agent

A streaming agent that monitors every transaction in milliseconds, fuses behavioral biometrics with entity graph signals, scores fraud probability, auto-declines on high-confidence hits, and autonomously drafts Suspicious Activity Reports for FinCEN filing. The difference between a rule engine and this agent is context: rules know what happened. The agent understands what it means given everything it knows about this account.

Architecture
Transaction stream
Kafka: card network, ACH, wire
Entity graph
account linkages, device, mule signals
Score fusion
behavioral, geo, velocity, graph
Action + SAR
decline or flag. SAR drafted for BSA review.

Step-by-Step Execution Flow

01
Consume transaction event and enrich with account context
The agent consumes each transaction event from the Kafka stream within milliseconds of card authorization or ACH initiation. It immediately enriches with account history: 90-day spend velocity, typical merchant categories, home geolocation, device fingerprint history. Enrichment happens in under 10ms via a pre-warmed Redis cache populated from the data warehouse. The enriched event feeds all downstream scoring. Raw transaction data alone is not enough to make a meaningful fraud call.
Kafka consumer (Flink)Redis enrichment cacheCard network event schema
02
Entity graph traversal: is this account connected to a mule network?
The agent queries the real-time entity graph to check whether this account is linked within two hops to known fraudulent or flagged accounts. The graph captures relationships that traditional rules miss: shared devices, shared IPs, shared beneficiary accounts, shared phone numbers across different customer identities. A card that looks clean in isolation may be one hop from a known mule network. The graph score is one of the strongest fraud signals when it fires.
Neo4j or Amazon NeptuneReal-time graph traversalShared device and IP edge types
03
Multi-signal score fusion
The agent fuses four signal types into a final fraud probability score. Behavioral: the transaction type, amount, and merchant category are outside the account's normal pattern. Geo-velocity: the last transaction was in Seoul four hours ago, which makes card-present in Miami physically impossible. Velocity: the amount is 23 times the account's 90-day average transaction. Graph: 0.91 confidence mule network linkage. Each signal is scored independently then combined via a trained ensemble model.
Ensemble ML modelFICO Falcon integrationReal-time feature store
04
Auto-decline, cardholder alert, and SAR draft generation
Above the 0.85 threshold: the agent declines the transaction via the card network authorization API, sends an SMS alert to the cardholder, and flags the card for suspension pending verification. Concurrently it drafts the Suspicious Activity Report. The SAR draft pulls all relevant transaction data, entity graph findings, and signal scores into the FinCEN BSAR format, with the narrative section generated by the LLM strictly from collected evidence. The BSA officer reviews a pre-populated SAR rather than building it from scratch.
Card network authorization APIFinCEN BSAR formatLLM SAR narrative, evidence-groundedBSA officer review queue
05
Feedback loop: confirmed labels back to the model
Confirmed fraud outcomes from cardholder disputes, BSA officer SAR confirmation, and law enforcement reports are fed back as labeled training data. The model retrains on a weekly cadence using the last 30 days of labeled decisions. False positive rates, legitimate transactions declined, are tracked separately and are the key balance metric. The retraining objective explicitly penalizes false positives above a 0.7% threshold.
Labeled outcome storeWeekly model retraining pipelineFalse positive tracking
Agent run / TXN-9927 / card present / Miami FL / $4,200
Click run to simulate fraud detection agent.
On the auto-decline threshold

The 0.85 auto-decline threshold is not a model hyperparameter. It is a governed business policy set by the fraud operations team and audited quarterly. Threshold changes require approval from fraud ops, compliance, and the model risk management function. The distinction matters: hyperparameters can be changed by any engineer with model access. A governed policy requires a documented change process. The threshold lives in the policy store, not the model config.

38ms
Detection latency p99
score to action within auth window
99.3%
True positive rate
confirmed fraud correctly flagged
0.7%
False positive rate
target ceiling / legit txns declined
-70%
SAR prep time
vs manual SAR drafting by BSA officer
100%
Decision explainability
every decline has stored signal breakdown
0.85
Auto-decline threshold
governed policy, audited quarterly
Pattern 02 / Banking

Continuous KYC/AML Monitoring Agent

Traditional KYC is periodic: screen at onboarding and refresh every one to three years. Perpetual KYC changes the model so that every customer event triggers a re-evaluation. A change of address, a new transaction pattern, an adverse media hit, a sanctions list update: the agent evaluates all of it in real time, refreshes the risk tier, triggers enhanced due diligence where warranted, and files CTRs and SARs autonomously where the evidence is clear.

Architecture
Event trigger
CIF change, txn pattern shift, list delta
Sanctions screen
OFAC SDN, EU, OFSI, UN / fuzzy match
Risk re-score
PEP, adverse media, UBO graph, tier
Action + filing
restrict account, EDD queue, CTR/SAR

Step-by-Step Execution Flow

01
Detect trigger event and determine re-evaluation scope
The agent monitors a stream of customer events from the CIF system and the transaction ledger. An event materiality classifier determines scope: minor requires no action, standard triggers OFAC re-screen only, elevated triggers full sanctions plus adverse media plus PEP re-check, and critical triggers immediate account restriction pending review. A change of address to Dubai triggers elevated scope because Dubai is on the FATF high-risk jurisdiction list.
CIF event streamEvent materiality classifierFATF high-risk jurisdiction list
02
Parallel sanctions screening across all major lists
The agent screens the customer's full identity bundle against multiple sanctions lists simultaneously: OFAC SDN, the EU consolidated list, the UK OFSI list, and the UN Security Council list. Exact name matching alone produces too many false negatives from spelling variations and transliterations. The agent applies fuzzy matching with a configurable similarity threshold, then passes potential matches to the LLM for entity disambiguation before any case is created. This reduces false positives by 60 to 80 percent compared to match-only approaches.
OFAC SDN APIRefinitiv World-CheckFuzzy matching (Jaro-Winkler)LLM disambiguation
03
Adverse media scan and PEP / UBO check
The agent queries a global adverse media feed and runs NLP classification on results to verify the article is about this specific entity and describes conduct that raises AML risk. Separately it queries a PEP database to check whether the customer or their ultimate beneficial owners are politically exposed. UBO graph traversal checks whether any ownership stake above 25 percent in associated entities is held by a flagged individual.
Refinitiv Adverse MediaDow Jones Risk and CompliancePEP database (ComplyAdvantage)UBO graph traversal
04
Risk re-tier, account action, and EDD case creation
Given an adverse media hit and PEP UBO finding, the agent updates the customer's risk tier, applies a temporary account restriction, and creates an Enhanced Due Diligence case in the compliance workflow system. The EDD case is pre-populated with the agent's findings: the adverse media article with entity match confidence, PEP UBO detail, FATF jurisdiction risk, and recommended next steps. The compliance analyst inherits a structured case file, not a blank screen.
CIF risk tier update APIAccount restriction workflowEDD case management system
05
CTR and SAR autonomous drafting and routing
Where transaction activity meets the filing threshold, the agent drafts the FinCEN report with the narrative section generated by the LLM from collected evidence only. CTRs are tracked against the 15-day BSA filing window. SARs are drafted and routed to the BSA officer with a confidence flag. The agent never files a SAR without BSA officer sign-off. The distinction between drafting and filing is structural: the agent has API access to the drafting endpoint only, not the submission endpoint.
FinCEN CTR and BSAR formatLLM narrative generationBSA officer review gate before filing15-day CTR window tracking
Agent run / C-88201 / address change / Chicago to Dubai
Click run to simulate KYC/AML agent.
800ms
Sanctions screen latency
event received to all-list result
100%
OFAC coverage
every customer event screened against SDN
0.1%
False SDN match rate
after LLM disambiguation layer
-73%
Analyst review time
pre-built EDD case vs blank case file
100%
CTR on-time filing rate
within BSA 15-day window
92%
Adverse media precision
hits confirmed as the correct entity
Pattern 03 / Banking

AI Wealth Advisor Co-Pilot

The co-pilot does not replace the advisor. It eliminates the 70 percent of advisor time spent on assembly work: pulling portfolio data, identifying drift, running tax-loss harvesting screens, and checking suitability. The agent does all of that continuously across every client portfolio and surfaces the advisor with a prioritized action list each morning. The advisor's job becomes approving, modifying, or declining recommendations, not discovering them.

Step-by-Step Execution Flow

01
Portfolio pull and drift calculation against target allocation
Each morning the agent pulls current holdings for every portfolio in the advisor's book from the custodian API. It calculates current allocation percentages and compares them to the target allocation defined in the client's Investment Policy Statement. Drift exceeding the configured tolerance band triggers a rebalancing recommendation. The agent also identifies cash drag: uninvested cash above a threshold that should be deployed per the IPS.
Schwab API or Fidelity WealthCentralCustodian FTP feedIPS target allocation store
02
Tax-loss harvesting screen across all taxable positions
For every taxable position with an unrealized loss, the agent evaluates whether harvesting the loss generates net tax value. It considers the loss magnitude, the client's estimated marginal tax rate, the wash-sale rule (no repurchase of the same or substantially identical security within 30 days), and whether a suitable replacement exists to maintain the portfolio's factor exposure. The agent surfaces only opportunities where the net present value of the tax saving exceeds the transaction cost and tracking error of the replacement.
Unrealized P&L calculationWash-sale rule compliance checkReplacement security factor matching
03
Reg BI suitability validation before surfacing any recommendation
Every proposed trade is passed through the Reg BI suitability engine before it appears in the advisor's queue. The check validates the trade type against the client's risk tolerance and investment objective, checks for concentration risk above IPS limits, and flags any disqualifying account restrictions. Any trade that fails suitability validation is dropped and logged. It never appears in the advisor queue. The logged record is the documentation trail that satisfies the Reg BI best-interest obligation.
Reg BI suitability engineIPS constraint validationConcentration risk checkSuitability failure log for audit
04
Personalized client communication draft per recommendation
For each approved recommendation, the agent generates a client-facing explanation in plain English: why this action is being proposed, what the expected portfolio impact is, and what the tax implication is. The draft is written in a tone calibrated to the client's communication profile derived from past interaction history. The advisor reviews and edits before sending. The agent generates. The advisor owns the communication.
LLM client communication generationClient sophistication profilingAdvisor review before send
05
Advisor one-click execution and trade order generation
The advisor opens their morning queue with a prioritized list of recommendations, each with the proposed trade, rationale, estimated impact, and pre-drafted client communication. A single click on approve generates the trade order in the OMS and queues the client communication for delivery. Nothing executes without advisor action. The human-in-the-loop is structural: the agent has access to the recommendation queue API, not the order submission API.
OMS integration (Charles River, Bloomberg AIM)Trade order generationAdvisor approval gate is structural
Agent run / ACC-P4821 / morning scan / $2.4M AUM
Click run to simulate wealth co-pilot agent.
94%
Suitability pass rate
recommendations that pass Reg BI check
+1.2%
Avg alpha vs benchmark
TLH + drift correction contribution
-70%
Advisor assembly time
time spent finding vs deciding
100%
Reg BI documentation
every recommendation logged with suitability result
96%
Cost basis accuracy
reconciled against custodian tax lot data
4.5/5
Advisor usefulness score
morning queue rating per week
Pattern 04 / Banking

Conversational Collections AI Agent

The collections call is one of the most disliked experiences in consumer banking, for the customer and increasingly for the collections team, as compliance requirements under Reg F tighten. The conversational agent handles outreach, detects hardship signals, negotiates repayment plans within pre-approved parameters, and escalates to a human agent for edge cases. Reg F communication limits are enforced in the outreach scheduler as hard constraints, not prompt instructions.

Step-by-Step Execution Flow

01
Account segmentation: not all delinquencies are the same
The agent segments delinquent accounts across four dimensions before any outreach strategy is determined: delinquency age, prior payment history, hardship indicator score derived from transaction patterns (medical spend spikes, income drops), and recovery probability from an ML model predicting likelihood of payment given various offer types. A previously reliable customer 35 days past due with medical hardship indicators requires a fundamentally different approach than a 120-day chronic delinquent.
Delinquency scoring modelHardship signal detectionRecovery probability ML model
02
Channel selection and empathy-calibrated outreach
The agent selects the channel based on prior response history and account preferences. For a high-hardship-score account, the initial outreach is SMS rather than IVR. The message is generated with an empathy-first tone and leads with the most probable recovery offer. Reg F communication limits (seven contacts in seven days, no calls before 8am or after 9pm local time) are enforced as hard constraints in the outreach scheduler before any message is queued.
Reg F communication limit enforcement in schedulerLLM empathy-calibrated message generationChannel preference lookup
03
Real-time negotiation within pre-approved offer parameters
When the customer responds, the agent enters a conversational negotiation. A configuration of pre-approved parameters set by the collections manager defines the boundaries: deferral duration up to 90 days, payment plan minimums, settlement discount ceiling, and hardship forbearance criteria. The agent can offer anything within these parameters without human involvement. Outside the parameters, the agent acknowledges the request, explains it needs to check with a specialist, and transfers to a human collections agent with the full conversation context attached.
Pre-approved offer parameter configHardship documentation workflowOut-of-parameter escalation with context transfer
04
Agreement execution and written confirmation
Once the customer accepts, the agent updates the account status in the core banking system, generates the agreement document, and delivers it via email. For payment plans, it sets up the recurring ACH or card payment schedule. Written confirmation of every agreed term is a Reg F requirement. The agreement terms and full conversation transcript are logged to the CRM under the account record for the complete lifecycle audit trail.
Core banking deferral APIAgreement PDF generationACH payment setupCRM communication log for Reg F audit
05
Promise-to-pay tracking and outcome-based follow-up
For accounts on a payment plan, the agent monitors whether the first scheduled payment is received. If the payment arrives on time, the account is updated and no further contact is needed. If the payment is missed, the agent initiates a single follow-up within the Reg F window. Accounts that miss two consecutive plan payments are escalated to a human collections specialist. The agent does not pursue infinite retry loops on accounts showing no engagement signal.
Payment receipt monitoringPromise-to-pay trackingEscalation after 2 missed plan payments
Agent run / A-77231 / 47 DPD / $2,840 / hardship signal
Click run to simulate collections agent.
On Reg F enforcement

The seven-in-seven rule and the time-of-day restrictions are enforced in the outreach scheduler by checking a per-customer contact log before any message is queued. This is not a prompt instruction and it is not a filter applied after message generation. The scheduler cannot queue a message that would violate the rule. The architectural principle is the same as everywhere else in this guide: rules that would cause harm if violated live in code, not in prompts.

+43%
Recovery rate lift
vs pre-agent collections baseline
78%
Customer satisfaction
post-interaction CSAT on resolved accounts
-89%
CFPB complaint rate
vs pre-agent deployment baseline
100%
Reg F audit coverage
every communication logged with timestamp
72h
Avg resolution time
first outreach to agreed arrangement
91%
Offer accuracy
offers generated within approved parameters