Burning tokens or building outcomes?

The test worth running first

More is not merry here. A bigger token bill, a longer vendor list, and a wall of green usage charts can all coexist with a workflow that has not improved in any way the business actually cares about. Look at the token spend, then look at the work it was supposed to improve. If a coding assistant is active every day but pull request cycle time, escaped defects, and review quality look the same as the quarter before rollout, the company bought activity. If an ops assistant can summarize incidents but alert noise, MTTR, and handoff quality are unchanged, the company bought a prettier status update on the same broken process.

Across healthcare, logistics, transport, financial services, insurance, manufacturing, and the public sector, the pattern rhymes. Tools land before anyone has written down what should get faster, what should get cheaper, what should get safer, and who is accountable for the number moving. Without that, the rollout has nothing to be measured against, and "we are using AI" quietly becomes the report.

healthcare logistics transport financial services insurance manufacturing public sector

0

value from tools that are not tied to a measurable work result

3

questions: what improves, how measured, who owns it

OKR

the rollout should start with the result, not the vendor list

Yield

what the company gets back for the tokens, licenses, and time spent

Where the stall actually happens

Most rollouts start with tools and licenses, then go looking for a problem to point them at. That order is backwards, and it shows. The hard part is agreeing on the work result before the tool shows up, because that is the conversation that surfaces which workflow is actually broken, which team owns it, and which leader is willing to change it. Skip that conversation and usage becomes the metric, because usage is the only thing left to count. A dashboard full of prompts is not a strategy, it is a receipt.

The data underneath

Clinical notes, dispatch logs, supplier records, ledger entries, and claims documents usually live in systems that were never built to talk to each other. An AI layer on top of that can produce a tidier version of the same confusion. Before the model gets useful, somebody has to decide which fields are the source of truth and which ones are decoration.

The operating agreement

Teams need a written answer to three questions before they touch the tool. What is the model allowed to draft on its own? What still requires a human to read and sign? And when the output is wrong, what happens, who finds out, and how does the workflow recover? Without those answers, every disagreement turns into a fight between the people pushing speed and the people pushing risk.

Who owns the number

Somebody has to own the baseline and the weekly trend, with the authority to change the workflow underneath, not just renew the license. If the goal is shorter review times, quieter on-call, or faster analysis, the owner has to be senior enough to redesign the steps, the approvals, and sometimes the team shape. A name on a slide is not ownership.

PilotThe tool runs in a narrow slice of the work. The team learns where it earns its keep and where it creates rework that nobody had before.

StagedThe tool is live across more of the org, usage charts look healthy, but the workflow underneath has barely moved. Cost rises faster than benefit.

MeasuredThe workflow has been redesigned around what the tool is good at. Ownership is clear, the outcome is tracked in business terms, and spend follows yield.

The test: turn the tool off for a week. If the business runs the same way, the company has been paying for activity, not results.

Figure 1: The gap between deploying AI and changing the work.

Governance belongs at the start

Legal, risk, finance, security, and operations should not be pulled in two weeks before launch. By that point the team is defending a tool instead of proving a result, and any pushback feels like a blocker rather than part of the design. The review path is part of the workflow, not a wrapper around it. Designed early, it speeds the rollout. Bolted on late, it becomes the reason the rollout stalls.

Start with the work

Engineering, operations, data, risk, finance, and the business owner sit in the same room and agree on the result before rollout. The OKR needs a baseline that everyone trusts, a target that survives questioning, a time window short enough to learn from, and one owner who cannot pass the number to someone else.

Define the review path

For coding assistants, decide which changes can ship with AI-assisted review and which always need a human approver, and write that into the branch protections instead of leaving it to culture. For ops assistants, decide what can be summarized for humans and what can actually be acted on. For analysis, decide which outputs require a source check before they reach a decision-maker.

Measure risk and yield together

Token spend, license spend, review time, defect rates, incident outcomes, and analyst throughput belong in the same conversation, on the same page, in the same review. A tool that saves twenty minutes upstream and adds thirty minutes of cleanup downstream is not a win, it is a redistribution of work to a different team.

Where the tool helps, and where the human still decides

The strongest use cases sit close to repeated, well-shaped work. Draft the code change. Summarize the incident timeline. Classify the document. Reconcile the ledger lines that match. Pull the first cut of an analysis from the warehouse. These shorten the path to good work when the work itself is well-defined. They become liabilities the moment the workflow asks the model to make a judgement call it was not designed to make.

The decision still belongs to a person. The pull request, the incident call, the patient outcome, the credit decision, the claim adjudication, the financial answer, the compliance sign-off — these all carry consequences that someone has to be accountable for. The tool's job is to make that accountability easier to discharge well, not to dilute it.

Area

Where the tool earns its keep

Where the human still owns the call

Healthcare

Drafting after-visit summaries from the encounter note, suggesting billing codes from the documented procedure, pre-populating intake forms, routing referrals to the right specialty, and flagging documents that look incomplete before they reach a clinician.

Diagnosis, treatment plan, medication change, anything that touches patient safety. The clinician reads the AI draft, edits what is wrong, and signs. The signature is the workflow, and it should not be a rubber stamp.

Logistics

Suggesting route options, predicting ETAs from live conditions, summarizing exceptions for the dispatch desk, drafting carrier emails for delays, and surfacing the likely cause of a missed pickup before the dispatcher has to dig for it.

Carrier disputes, contract interpretation, escalations to the customer, and any decision that costs real money or strains a relationship. The dispatcher uses the AI summary to move faster, but owns the call and the email that goes out.

Financial services

Surfacing fraud signals, matching ledger entries during reconciliation, drafting first-pass variance analysis, generating audit support narratives, and pulling the relevant policy paragraphs into a reviewer's queue.

Credit decisions, regulatory filings, audit responses, and compliance sign-off. The analyst's job shifts from gathering to judging, but the judgement, and the regulatory accountability that comes with it, stays with the named human.

Insurance

Classifying inbound documents, extracting structured fields from policies, drafting claim summaries, comparing a claim against the policy text, and flagging files that have the markers of escalation or fraud.

Adjudication, fraud investigation, coverage disputes, and any conversation that ends with a customer being told yes or no. The adjuster reads the draft, checks the cited policy text, and owns the outcome.

Coding assistants and the human in the loop

Of all the rollouts, coding tools are the easiest to mismeasure. Suggestion-acceptance rates and lines-of-code-generated look fantastic on a dashboard and tell you almost nothing about whether the codebase got better. More lines of code do not mean jack when half of them are slop features nobody asked for, written confidently in the wrong abstraction, wrapped in tests that assert the bug is the spec. Velocity on a treadmill is still a treadmill. The honest measurement is downstream: cycle time on real pull requests, review load on the people doing the reviewing, escaped defects in production, and how often a change has to be reverted or hot-fixed in the week after it merges.

The single thing not worth compromising is human review and approval on changes that touch production. The tool can write the code. A person who understands the system still has to read the diff, agree with the change, and sign for it. That is not bureaucracy, it is how a team stays accountable for what it ships, especially as more of the typing is done by something fluent in syntax and oblivious to consequence.

Use the assistant to make review easier and faster, not optional. A good rollout invests as much in the review side of the workflow as it does in the generation side, because shipping faster only helps if you are shipping the right thing.

Make the diff legible

Use the assistant to write a plain-language summary of what changed and why, organised by file or by concern, attached to the pull request. The reviewer should be able to read the summary, scan the diff, and know which files actually need careful attention. That alone changes the shape of review from "read everything" to "verify the parts that matter."

Surface the risky parts

Have the assistant flag changes to authentication, data access, schema migrations, payment paths, public APIs, and anything else the team has agreed is sensitive. The reviewer is not relying on the flag to be perfect, they are using it as a checklist that ensures nothing slips through unread on a busy day.

Scaffold the tests

Let the assistant propose tests for the new behavior and for the edge cases it can infer from the diff. The reviewer reads the proposed tests with the same eye they read the code, and decides which ones are real coverage and which ones are noise. Tests written as part of review tend to catch things that tests written before the code do not.

Keep approvals human

Branch protections require a named human approver for production paths, full stop. The assistant can comment, suggest, request changes, and even auto-fix style. It does not get an approval vote on code that ships to customers. That line stays bright on purpose, and it is what makes the rest of the speedup safe.

The goal is not fewer reviewers, it is reviewers who spend their time on the parts of the change that actually need a human brain. AI in the review loop should pay for itself in better-caught bugs and faster, more confident approvals, not in green checkmarks that nobody read. A rubber stamp at machine speed is still a rubber stamp.

What good yield looks like

Adoption is not a license count or a prompt count. Counting prompts to prove value is like counting steps to prove fitness, technically a number, generously a vibe. Good yield means the work got better in a way the business already cared about before the tool arrived. If the metric had to be invented to make the rollout look successful, that is a sign the rollout did not move anything that mattered.

Coding assistants

Pull request cycle time on changes that actually touch production, review load distribution across the team, escaped defects, rework rate, test coverage on changed lines, and a qualitative read on whether engineers spend more time on design and less on boilerplate. If reviewers feel busier, the rollout is not done.

Ops assistants

Alert noise, MTTR on real incidents (not drills), handoff quality between shifts, runbook reuse versus runbook drift, the quality of incident write-ups, and time to a credible cause hypothesis. The on-call person should feel less alone, not more dependent on a chat window.

Analysis assistants

Time from question to credible answer, repeatability when the same question is asked twice, source traceability so the answer can be defended in a meeting, and how often the answer actually changed a decision. An analysis that nobody acted on did not yield anything.

Cost and yield together

Tokens, licenses, training time, integration work, and review time on one side. Cycle time, defect rate, MTTR, throughput, and decision speed on the other. The point is not to spend less on AI, it is to know what each dollar is buying and to keep funding only what is buying something.

Why companies fall short

Most rollouts do not fail because the model is bad. They fail because nobody wrote down the baseline, the target, or the owner before the tool went live. The tool gets dropped into the same workflow that already existed, token spend rises right on schedule, and by the time leadership asks what changed, there is nothing clean to point at because there was nothing clean to start from. The model is rarely the problem. The pre-work is almost always the problem.

The baseline gap

The company cannot say what current cycle time, defect rate, alert noise, analysis backlog, or cost per workflow looks like, because nobody measured before the rollout. Improvement becomes unprovable in either direction, which usually means the conversation moves to anecdote.

The workflow gap

A new tool inside an unchanged workflow tends to make the existing pattern slightly cheaper or slightly faster, then stops. The people doing the work have to be in the room when the workflow is redesigned, because they are the ones who know which steps are real and which steps are habit.

The ownership gap

The shift that matters is from "we bought a tool" to "we own a result." Owning a result means being the person who has to stand up at the next review and explain whether the number moved, and being senior enough to change the workflow when it did not.

The questions worth asking

Run these past the team this week. If any of them gets a hand-wave instead of an answer, the rollout is funding activity, not outcomes. Each question comes with the kind of answer that should make a leader nervous, because a smooth-sounding non-answer is the most common failure mode in this whole space.

01

What number is supposed to move, and what was it before the tool?

Bad answer: "We will figure out the metrics once adoption picks up." A baseline written down after the rollout is not a baseline, it is a story.

02

Who personally owns that number moving?

Bad answer: a committee, a Slack channel, or "the AI working group." If three people own it, nobody owns it.

03

What is the tool allowed to do on its own, and what still requires a human to sign?

Bad answer: "We trust the team to use good judgment." That is not a policy, that is a future incident report.

04

If we turned the tool off for a week, what would actually break?

Bad answer: "Productivity would drop." Whose, measured how, against what? If nothing concrete breaks, nothing concrete was being built.

05

What are we measuring besides usage?

Bad answer: prompts per user, suggestions accepted, licenses active, weekly active seats. Usage is a receipt, not a result. More lines of code do not mean jack when half of them are slop features nobody asked for.

The pattern across all five: the tool is the easy part. The pre-work — baseline, owner, review path, off-switch test, real metric — is what separates a rollout that compounds from a rollout that just bills.

The three rollout stages

Most companies sit squarely in the middle stage. The tools are deployed, the dashboards show usage, the all-hands has a slide about AI, and the work underneath looks the way it did a year ago. Moving from stage two to stage three is where the real return is, and it is also where most programs run out of patience.

Stage 1

Pilot

The tool runs in one team or one workflow, with a baseline written down beforehand. The team learns where it actually helps, where it creates rework, and which assumptions about the data and the process do not survive contact with reality.

Stage 2

Staged

The tool is live across more teams. Usage charts look healthy, leadership talks about the rollout in public, and the keynote slide writes itself. The workflow has not really been redesigned yet, so spend rises faster than benefit and the conversation slowly shifts from "look what we shipped" to "what is this costing us."

Stage 3

Measured

The workflow has been rebuilt around what the tool does well and around what humans still need to own. The owner is named, the baseline and target are visible, and spend scales only where the numbers show better work. Most rollouts never reach this stage on the first attempt.

What the work actually requires now

The real work is not flashy and it does not demo well. Write the problem down in a sentence anyone in the company would recognise. Name the outcome. Pick the metric, and pick it before the tool. Set the baseline honestly, even when it is unflattering, especially when it is unflattering. Decide who owns review and what they are accountable for. Then, last, choose the tool.

Build OKRs around outcomes that already mattered. Use AI where it makes engineering, operations, support, and analysis genuinely better, and where the humans in the loop come out of the rollout doing more of the work only they can do. Stop funding the corners where the only proof of value is more token burn, a longer vendor list, and a confident slide.

The companies that get this right will not be the ones with the most impressive demos. They will be the ones that can answer a single question without flinching: what got better because of this tool, and who is accountable for it staying better?

Burning tokens or building outcomes?

The test worth running first

Where the stall actually happens

The data underneath

The operating agreement

Who owns the number

Governance belongs at the start

Start with the work

Define the review path

Measure risk and yield together

Where the tool helps, and where the human still decides

Coding assistants and the human in the loop

Make the diff legible

Surface the risky parts

Scaffold the tests

Keep approvals human

What good yield looks like

Coding assistants

Ops assistants

Analysis assistants

Cost and yield together

Why companies fall short

The baseline gap

The workflow gap

The ownership gap

The questions worth asking

What number is supposed to move, and what was it before the tool?

Who personally owns that number moving?

What is the tool allowed to do on its own, and what still requires a human to sign?

If we turned the tool off for a week, what would actually break?

What are we measuring besides usage?

The three rollout stages

Pilot

Staged

Measured

What the work actually requires now

Accessibility