Agentic AI for the enterprise.
The term AI agent gets used for everything from a chatbot with one tool call to a system autonomously managing production infrastructure across a dozen APIs, and that's one broad space with many gaps. This insight is a breakdown of what an agent actually is, how to build one properly that works for you, and what it looks like when you deploy the pattern in healthcare IT and banking operations today.
I have been building and evaluating agentic systems across SRE, observability, and platform engineering contexts long enough to have opinions about where the pattern works and where it breaks down. This is not a vendor pitch or a research paper. It is a ground-level guide to the architecture, the deployment realities, and the eval criteria that actually tell you whether an agent is helping or quietly making things worse.
What an AI Agent Actually Is
A language model on its own produces a response. An agent is what you get when you wrap that model in an execution loop.
- The loop is the point. The agent perceives state, reasons about the next move, executes through a tool, observes the result, and decides what to do next.
- A prompt stops after one answer. An agent continues until a goal is met, a confidence threshold is breached, or a human steps in.
- That changes the automation boundary. You can automate tasks that require sequential decisions against changing system state, not just tasks with one fixed response.
read state from tools, APIs, memory
LLM decides what action to take
execute tool call: API, command, query
capture result, check expected outcome
continue, escalate, or close
The Four Components Every Agent Needs
Strip away the marketing and every production agent has four essential layers. If one of these is missing, you do not have a production agent. You have a prompt with extra packaging.
Language Model
The model handles reasoning, decision-making, and response generation. It interprets context, chooses the next step, and produces structured output.
Tool Layer
The tool layer lets the model act on external systems. That can mean querying a database, calling an API, running a command, or writing a file.
Memory Layer
The memory layer gives the model context beyond the immediate prompt, including past actions, retrieved documents, and structured state.
Orchestration Layer
The orchestration layer runs the loop, sequences tool calls, enforces guardrails, and decides when a human has to stay in the decision.
Scripts execute fixed steps. Rule engines match fixed patterns. Neither handles ambiguity, missing context, or cases outside their predefined logic. An agent can reason over partial information, call additional tools to fill gaps, and make a judgment call with a confidence score attached. It also knows when to stop and ask a human. That is the capability gap that makes the architecture worth the complexity cost.
How Tool Calls Actually Work
The model does not directly execute code. It proposes an action and the orchestration layer decides whether and how to run it.
- Step 1. The model emits a structured tool call with the function name and parameters.
- Step 2. The orchestration layer validates the request and executes the real function against the live system.
- Step 3. The result is captured and written back into the model context.
- Step 4. The model reasons over that result and decides whether to continue, escalate, or stop.
- Contracts matter. Tool inputs should be typed and validated before execution, and tool outputs should come back in a structured schema. That is what makes agent behavior auditable, interoperable, and safe to evolve without breaking downstream steps.
- Latency note. Independent tool calls should run in parallel. That is one of the main levers for reducing end-to-end latency.
# Fire all enrichment calls simultaneously, not sequentially
# Sequential would add 6-9 seconds of latency per incident
context = await asyncio.gather(
get_service_catalog(service="patient-portal-api"),
get_recent_anomalies(service="patient-portal-api", window="30m"),
search_incident_history(title_similar=incident.title, limit=5)
)
# All three resolve before the LLM sees any of it
# catalog: owner, slo_target, downstream deps
# anomalies: error_rate, p99, spike_start
# history: 3 similar past incidents, their resolution pathsThe Human-in-the-Loop Question: Where Does the Human Stay?
Good agent design is explicit about where humans stay in the decision. The split should be clear enough that an operator can explain it in one minute.
- Human decides. Use this mode for irreversible actions, high-stakes outcomes, and low-confidence outputs. The agent prepares the case and the human makes the decision.
- Agent executes. Use this mode for well-understood, reversible, high-confidence actions. Humans review through audit rather than blocking every step.
- Policy lives in code. The threshold between those two modes should be a governed policy enforced in code, not a prompt instruction.
Every production agent needs an explicit confidence floor below which it escalates to human rather than acting. This threshold is a business policy, not a model hyperparameter. It should be set by the operations team, versioned, audited, and reviewed when the false positive or false negative rate drifts. If you cannot answer "what is your agent's confidence threshold for autonomous action and who owns that decision," the agent is not production-ready.
Eval Is Not Optional
Agents that are not continuously evaluated drift. The model changes across versions, data distribution shifts, and edge cases accumulate. Every production agent needs a small set of metrics tracked on a regular cadence.
- Task quality. Track classification accuracy, task completion rate, or resolution quality based on the job the agent is doing.
- Hallucination rate. Measure how often the agent states facts that are not present in its source data, retrieved context, or tool outputs.
- Latency. Measure elapsed time from trigger to action so you know whether the agent is useful in real operational conditions.
- Human usefulness. Capture a simple quality score from the person receiving the output so you know whether the response was actually helpful.
Without these measures, you do not know whether the agent is helping or quietly causing problems.
Why Retrieval and Data Pipelines Matter
Most failures that look like model failures are actually context failures. The model cannot reason over evidence that was never extracted, cleaned, tagged, or retrieved correctly.
- Retrieval is not memory. Memory retains prior actions, state, and decisions. Retrieval fetches external evidence on demand: runbooks, incident history, policy documents, CMDB records, and knowledge bases.
- Bad context poisons good reasoning. Stale runbooks, noisy incident history, or an incomplete service catalog will drive bad decisions for understandable reasons.
- A pipeline has to do real work. Production retrieval depends on source extraction, cleanup, chunking, metadata, freshness checks, access control, and traceability to the exact source and version used at run time.
- This is why RAG projects often disappoint. Teams blame the model when the real issue is that the data layer was never made usable, current, or trustworthy enough for production decisions.
Building Agent Infrastructure at Commercial Scale
Once you move beyond one or two isolated agents, the real work shifts from prompt design to platform design.
- Shared control plane. Commercial scale means dozens of workflows, many teams, strict audit requirements, and hard limits on latency and cost. You need common model routing, tool authentication, state management, policy enforcement, tracing, and release management.
- Production resilience. Tool calls fail, APIs rate-limit, queues back up, and workflows stop halfway through. The runtime needs idempotent actions, resumable state, retry policy, dead-letter handling, and a complete action log.
- Clear ownership split. The platform team should own the runtime, SDK, eval harness, telemetry, and tool contracts. The domain team should own thresholds, approval rules, escalation paths, and success criteria.
- Agent observability. You need traces for each run, timing for every tool call, retrieval hit rates, queue depth, retry counts, human override rate, and policy rejection rate. You also need structured capture of prompt version, context, tool outputs, model response, confidence, final action, and downstream result.
Once prompts affect production behavior, they should be treated like deployable assets: versioned, reviewed, evaluated, and rollbackable. If a team cannot tell you which prompt version produced a decision, compare that version against the previous one, or revert it safely after a regression, the prompt layer is not production-ready yet.
trace each agent step, tool call, and policy gate
prompt version, context, outputs, confidence, result
replay real cases and measure drift, latency, and quality
block regressions before production rollout
tighten prompts, policies, tools, and eval sets
- CI should replay real work. Each release should run against a fixed evaluation set built from real past cases, failure cases, and policy edge cases.
- Regressions should block release. If accuracy drops, hallucinations rise, latency regresses, or policy violations increase, the build should fail before production.
- Improvement should be evidence-based. Strong teams improve the system by replaying runs, labeling outcomes, tightening prompts and policies, and proving that the next version is better than the last one.
Central identity for tool access. Per-action authorization. Versioned prompts and policies. Replayable traces. Cost and latency budgets. Sandboxed test environments. Clear rollback paths. These are not extras. They are the difference between an interesting demo and a system you can trust across a large business.
Here's a link to the agents I've built that are plug and play. Deploy and enjoy. Check out the repo.
Quality gating: how we test AI agents before we let them loose
Every pattern in this guide involves an agent that can take real action: restart a pod, decline a transaction, modify a patient record, defer a debt. "Those aren't really low stakes outputs, including a pod restart during a transaction being executed for teams driven by customer obsession and success." Before any of that goes live, every engineering team has to answer the same question: how do you know the agent is ready?
You do not know until you run it hard against realistic conditions, your actual failure scenarios, and clear thresholds it has to meet before it gets anywhere near production. That is what a simulated pre-production environment is for, and building one well is not optional. It is the engineering work that turns a demo into a deployable system.
If you do not test an AI agent in a realistic, controlled environment before deployment, you do not know whether it is safe, reliable, or aligned with how your systems actually behave under stress.
The three-layer simulation architecture
A production-grade simulation environment has three distinct layers. Each one has a job. If you conflate them or skip one, you end up with a test harness that tells you what you want to hear rather than what you need to know.
realistic world for the agent to operate in
runs the loop, stubs live calls, captures every decision
scores correctness, safety, latency per cycle
threshold check: pass or block promotion
Layer 1: the synthetic data plane
This is the simulated world the agent operates in. The goal is not clean, idealized data. The goal is data that is hard enough to be honest. That means building four components and getting them right.
Metric forge
Emits realistic time-series data for CPU, memory, latency, and error rate. Includes injected anomalies at realistic intervals: gradual memory leaks, sudden CPU spikes, p99 latency degradations with the right curve shape. An agent trained on perfect sine waves will fail on real brownfield telemetry.
Incident factory
Generates scenario templates with varying severity, upstream cause chain, and affected service scope. Cascading failures, pod OOM loops, connection pool exhaustion, config change propagation. Each scenario has a known ground-truth root cause the eval engine can score against.
Log emitter
Produces structured and unstructured log streams with realistic noise ratios. Real logs have repeated lines, malformed JSON, varying timestamp precision, and duplicate events from multiple sources. The log emitter should reproduce all of that, not sanitize it away.
Fault injector
Introduces controlled chaos: API timeouts, partial data returns, missing service catalog entries, contradictory signals from different monitoring sources. Stress-tests the agent behavior under degraded signal quality, which is exactly what real production looks like during a high-severity incident.
The hardest part of the synthetic data plane is making it statistically realistic. You cannot just generate random numbers within a plausible range. You need to sample from the actual distributions you see in production: the spike frequency, the correlation patterns between services, the noise floor, the baseline drift. The easiest way to do this is to capture two to four weeks of real production telemetry and use it as the seed for your synthetic generator. Anonymize where needed, then build a generator that samples from those empirical distributions.
Layer 2: the agent harness
The harness runs the agent through its operational loop against the synthetic data and captures everything. Three components do the work.
- Context builder. Assembles what the agent sees each cycle: which signals, what history, what tools are available. This has to match the real context structure exactly. An agent that performs well in simulation but sees a different context shape in production has not been tested.
- Decision loop. Executes the observe-reason-act-verify cycle and records every step: the context assembled, the tool calls made, the reasoning chain, the action taken, the time elapsed. Every cycle generates a complete, replayable trace.
- Tool interceptor. This is the most important component. It stubs every real-world call so the agent can reason and act as if tools executed, without anything actually happening. The interceptor should return realistic responses based on the synthetic scenario, not just blanks. The agent needs to believe its actions had consequences so subsequent reasoning reflects that. A PagerDuty escalation that returns a plausible incident ID. A kubectl restart that returns a realistic pod status progression. Blank responses will produce agents that pass simulation and fail in production because they never learned to reason over tool outputs.
# The interceptor wraps every tool call during simulation
# It returns realistic synthetic responses, not empty stubs
class ToolInterceptor:
def kubectl_rollout_restart(self, deployment, namespace, scenario):
# Simulate realistic pod restart progression
# Pass scenario context so the stub reflects the sim state
return {
"status": "initiated",
"message": f"deployment.apps/{deployment} restarted",
"simulated_recovery_time": scenario.expected_ttdr_seconds
}
def pagerduty_escalate(self, incident_id, team, context):
# Return a plausible PD incident ID so the agent can
# reason correctly in subsequent steps
return {
"incident": {"id": "SIM-" + incident_id, "status": "triggered"},
"assignment": {"team": team, "escalated_at": context.sim_timestamp}
}
# Every intercepted call is logged: input, output, timestamp, scenario_id
# This is the data the evaluation engine scores againstLayer 3: the evaluation engine
The eval engine is what turns hundreds of simulation cycles into a pass or fail decision. It needs to score three things independently, not just average them together.
- Correctness. Did the agent identify the right root cause? Did it route to the right team? Did it classify the right severity? Score this per scenario type, not just in aggregate. An agent that is excellent at CPU spikes but consistently misdiagnoses cascading failures should not pass, even if its aggregate accuracy is above threshold.
- Safety. Would this action have made things worse? Estimate blast radius for every action the agent took. Flag actions that would have caused downstream harm in a live system. A high correctness score combined with a poor safety score is a dangerous agent. It knows what the problem is but its remediation approach is destructive.
- Regression coverage. Every known-bad scenario from your production history should be in the simulation set and must be re-run on every evaluation cycle. If the agent regresses on a case that worked last week, that is a blocking signal regardless of aggregate performance.
All three thresholds must clear simultaneously. An agent that passes correctness and regression but has a 90 percent safe action rate is not production-ready. An agent that passes safety and regression but has 75 percent correctness is not production-ready. Averaging them together to produce a single score lets dangerous tradeoffs hide. The gate logic is AND, not mean.
What the simulation loop looks like in practice
A simulation run has a clear flow. Configure the scenario mix and noise parameters. Run N cycles, typically two hundred or more for a meaningful sample. The harness fires each cycle, the interceptor stubs every tool call, and the eval engine scores the outcome. At the end, the readiness gate checks all three thresholds. Pass means the agent promotes to shadow mode. Fail means it goes back for another training or prompt iteration before the next run.
After simulation: shadow mode
Passing the simulation gate does not mean going fully live. It means the agent is ready for shadow mode: it runs alongside the live system, makes decisions, but every action goes to a dry-run queue where a human reviews before anything executes. Shadow mode is the bridge between simulation and production, and it is where you close the gap between synthetic and real data distributions.
Shadow mode has its own promotion criteria. After a defined number of shadow cycles, typically a minimum of two weeks and one hundred real incidents, you score the shadow run the same way you scored the simulation: correctness, safety, and regression. If the shadow scores are within a defined tolerance band of the simulation scores, the data distribution gap is acceptable and you can promote. If shadow scores are materially lower, your synthetic data is not representative enough and you need to go back and fix the generator before the next cycle.
agent version ready for evaluation
200+ cycles, synthetic data, eval engine
live data, dry-run queue, human review
autonomous within governed thresholds
What good simulation data tells you
Beyond pass or fail, the simulation run surfaces specific intelligence that makes the agent better. The scenarios that consistently fail cluster around identifiable patterns: noisy signals from a particular data source, a class of incident the agent has not seen enough examples of, a tool stub response shape that does not match what the real API returns. Each of these is actionable. Fixing them before shadow mode is far cheaper than discovering them in production.
- Noise sensitivity. Run the simulation with fault injection rates at 10, 30, and 60 percent. If accuracy degrades sharply above 30 percent noise, the agent is brittler than it should be and needs more diverse training examples.
- Scenario coverage gaps. If correctness on network partition scenarios is 55 percent while CPU spike correctness is 91 percent, you have a coverage gap. Add more network partition examples to the training set before the next run.
- Safety floor by action type. Break the safe action rate down by action category: restart, escalate, modify config, close ticket. If one action type has a lower safe rate, that action type needs a tighter confidence threshold before it is authorized in production.
- Latency under load. The simulation should run at a concurrency level that matches peak production incident rates. If latency degrades significantly at concurrency, the agent architecture has a scaling problem that needs to be fixed before shadow mode.
Every hard guardrail that matters in production should be tested explicitly in simulation. If the confidence threshold for autonomous action is 0.75, run scenarios designed to push the agent just below and just above that threshold and verify the behavior is correct in both directions. If a HIPAA field is on a never-auto-correct list, run scenarios where the agent is explicitly tempted to correct it and verify it escalates every time. The simulation is not just a performance test. It is a verification of every safety contract the agent is supposed to uphold.
Use Cases by Sector
Healthcare SRE: Four Production Patterns
Healthcare IT runs some of the highest-stakes infrastructure on the planet, and it runs it with SRE teams that are chronically understaffed relative to the alert volume they absorb. L1 noise drowns on-call engineers, RCA is manual and slow, identity mismatches cause access failures that look like infrastructure bugs, and runbooks sit in Confluence untouched because nobody has time to execute them carefully. These four patterns address each of those failure modes directly.
Root Cause Analysis Agent
An agent that fires the moment an incident is acknowledged. It correlates signals across logs, metrics, distributed traces, and change events, reconstructs a causal timeline, ranks hypotheses by evidence weight, and posts a structured RCA to the team channel before the on-call engineer has opened their first terminal window.
logs, traces, metrics, deploy events
timestamp alignment, propagation chain
rank by evidence, cite sources
Slack/Teams with timeline + actions
Step-by-Step Execution Flow
timeline = [
{"t": "14:28:00", "event": "identity-db max_connections 200→100 (infra-bot config change)"},
{"t": "14:31:02", "event": "ehr-proxy connection timeouts to identity-db begin"},
{"t": "14:31:04", "event": "ehr-proxy db_conn_pool_wait +340ms spike"},
{"t": "14:32:18", "event": "patient-portal-api error rate crosses 8%"},
{"t": "14:33:00", "event": "INC-88921 fired by Datadog anomaly detector"}
]User Identity Mismatch Detection and Remediation Agent
When an incident ticket references a patient or staff member, this agent detects mismatched identity data across EHR, IAM, billing, and HR directory systems. It determines whether the mismatch is causing or contributing to the incident, auto-corrects safe fields, and escalates with a reconciliation brief where PHI or ambiguous authority requires human review. That 403 Forbidden you have been paging about is often a six-character string mismatch between two systems that have never talked to each other.
NLP: MRN, email, SSO, employee ID
EHR, IAM, billing, HR in parallel
sensitivity tier, authority, impact
safe: auto-fix. PHI: human review.
Step-by-Step Execution Flow
mismatch = {
"field": "email",
"ehr_value": "[email protected]",
"iam_value": "[email protected]",
"hr_value": "[email protected]",
"sensitivity": "PII-tier2",
"authoritative_source": "IAM", # 2 of 3 systems agree
"incident_impact": "HIGH / EHR email mismatch breaks SSO claim validation",
"auto_correctable": True,
"correction": "update EHR email to [email protected]"
}PHI-tier1 fields (MRN, DOB, SSN, insurance IDs) are on a hardcoded never-auto-correct list. This is enforced in code, not in the authoritative source YAML config, which is editable. The distinction matters: editable config can be accidentally changed. The PHI tier-1 prohibition lives in the agent's execution layer where it cannot be overridden without a code deployment and a review.
Runbook Execution Agent
An agent that parses a natural-language or structured runbook into a typed, verifiable action plan, executes each step via tool calls, and checks the expected outcome before proceeding to the next step. When a step fails verification, whether pods are not stabilizing, a metric is still elevated, or a health check returns the wrong status, the agent halts and posts a precise failure state. It does not blindly continue. That distinction is everything.
LLM converts steps to typed action plan
kubectl, API, health check, metric query
check state against expected result
fail: post state + await human. pass: proceed.
Step-by-Step Execution Flow
parsed_plan = [
{"step": 1, "type": "kubectl_rollout_restart",
"target": "deployment/patient-portal-api", "namespace": "prod",
"expected_outcome": "all pods Running within 120s"},
{"step": 2, "type": "http_health_check",
"url": "https://patient-portal.internal/health",
"expected_outcome": "status=200, body contains 'healthy'"},
{"step": 3, "type": "metric_check",
"query": "error_rate{service='patient-portal-api'}",
"expected_outcome": "value < 1.0 for 2 consecutive minutes"}
]Banking: Four Production Patterns
Financial institutions process millions of transactions a second but their operational responses, including fraud reviews, compliance screens, collections, and wealth rebalancing, still involve humans doing work that is fundamentally pattern-matching against rules. Agentic AI changes this by automating every step that does not require human judgment, while keeping humans clearly in the decision for the steps that do. The patterns below have real regulatory constraints: BSA/AML, Reg F, Reg BI. Those constraints are modeled in the architecture, not treated as an afterthought.
Real-Time Fraud Detection Agent
A streaming agent that monitors every transaction in milliseconds, fuses behavioral biometrics with entity graph signals, scores fraud probability, auto-declines on high-confidence hits, and autonomously drafts Suspicious Activity Reports for FinCEN filing. The difference between a rule engine and this agent is context: rules know what happened. The agent understands what it means given everything it knows about this account.
Kafka: card network, ACH, wire
account linkages, device, mule signals
behavioral, geo, velocity, graph
decline or flag. SAR drafted for BSA review.
Step-by-Step Execution Flow
The 0.85 auto-decline threshold is not a model hyperparameter. It is a governed business policy set by the fraud operations team and audited quarterly. Threshold changes require approval from fraud ops, compliance, and the model risk management function. The distinction matters: hyperparameters can be changed by any engineer with model access. A governed policy requires a documented change process. The threshold lives in the policy store, not the model config.
Continuous KYC/AML Monitoring Agent
Traditional KYC is periodic: screen at onboarding and refresh every one to three years. Perpetual KYC changes the model so that every customer event triggers a re-evaluation. A change of address, a new transaction pattern, an adverse media hit, a sanctions list update: the agent evaluates all of it in real time, refreshes the risk tier, triggers enhanced due diligence where warranted, and files CTRs and SARs autonomously where the evidence is clear.
CIF change, txn pattern shift, list delta
OFAC SDN, EU, OFSI, UN / fuzzy match
PEP, adverse media, UBO graph, tier
restrict account, EDD queue, CTR/SAR
Step-by-Step Execution Flow
AI Wealth Advisor Co-Pilot
The co-pilot does not replace the advisor. It eliminates the 70 percent of advisor time spent on assembly work: pulling portfolio data, identifying drift, running tax-loss harvesting screens, and checking suitability. The agent does all of that continuously across every client portfolio and surfaces the advisor with a prioritized action list each morning. The advisor's job becomes approving, modifying, or declining recommendations, not discovering them.
Step-by-Step Execution Flow
Conversational Collections AI Agent
The collections call is one of the most disliked experiences in consumer banking, for the customer and increasingly for the collections team, as compliance requirements under Reg F tighten. The conversational agent handles outreach, detects hardship signals, negotiates repayment plans within pre-approved parameters, and escalates to a human agent for edge cases. Reg F communication limits are enforced in the outreach scheduler as hard constraints, not prompt instructions.
Step-by-Step Execution Flow
The seven-in-seven rule and the time-of-day restrictions are enforced in the outreach scheduler by checking a per-customer contact log before any message is queued. This is not a prompt instruction and it is not a filter applied after message generation. The scheduler cannot queue a message that would violate the rule. The architectural principle is the same as everywhere else in this guide: rules that would cause harm if violated live in code, not in prompts.