Measuring Returns

Updated

TL;DR. Most AI ROI exercises produce a defensible-looking number that nobody acts on. Six months later the number is embarrassing and the line item is harder to defend than it was before you measured anything. The job is not to calculate a return. The job is to make a quarterly decision per seat, per workflow, and per team, using three or four observable signals. This guide is that decision loop.

Your CFO wants a number. A vendor handed you one. It says your AI program is producing a 3.7x productivity multiplier on a $90 per-employee monthly spend. You put it in a slide. The board nodded. Six months later, headcount hasn’t moved, contractor spend hasn’t moved, and the same CFO is asking why the line item is bigger than last year. You no longer want to talk about the 3.7x.

You don’t have a measurement problem. You have a decision problem dressed up as a measurement problem. The number on the slide was never going to drive an action. It was theater.

The good news is the decision loop is small, and you can run the first turn of it this week. The bad news is it requires giving up the comfort of a single number on a slide. Trade made willingly, you’ll have a defensible AI program. Trade refused, you’ll have another year of vendor charts that nobody acts on.

Why ROI frameworks miss

The default executive measurement playbook for AI looks like this. A vendor-supplied productivity index based on a self-reported survey of “time saved per task.” A McKinsey-style ROI model with a 12-cell framework and three sensitivity scenarios. A quarterly business review slide that aggregates “AI productivity gains” into a single percentage. A perception survey of how much your team feels AI is helping.

Every one of those is unfalsifiable in the direction it matters. Self-reported time savings are inflated by a factor of two to four against any observed measurement. Vendor productivity indices are calibrated to make the vendor look good. The ROI model has so many assumptions baked in that it returns whatever its author wanted it to return. The perception survey measures enthusiasm, which correlates with nothing.

The deeper problem is that none of these outputs has a decision attached. If the productivity index goes from 3.7x to 4.1x, what do you do differently? Nothing. If it drops to 2.9x, what do you do differently? Nothing. A measurement that doesn’t change a decision isn’t a measurement. It’s a ritual.

The CFO eventually figures this out. Usually around the second or third budget cycle, after enough time has passed that the promised gains should be visible somewhere in the P&L and they aren’t. At that point the ROI slide stops being a defense of the line item and starts being evidence that the line item should be cut.

The cost side changed shape too

There’s a second reason the old number is failing you, and it’s newer than the first. The cost base stopped being fixed. The flat per-seat enterprise SKU is gone across the serious vendors, replaced by a meter: a small seat fee plus consumption billed against a token volume you commit to up front. The Q3 2026 briefing has the mechanics. What matters for measurement is what metered billing does to the line item your CFO is watching.

A seat license is a fixed cost. You forecast it once a year, and a bigger number the next year reads as waste. A usage meter is a variable cost that climbs exactly as fast as adoption succeeds. The better the program lands, the bigger the bill. So the instinct most finance teams bring to it, defend a flat or shrinking line item, is now backwards. A team whose token spend doubled while its cycle time halved is the best thing in your portfolio. It’s a win to fund, not a cost to cap.

The unit of measurement moves with it, from seat count to cost per outcome. The question is no longer “are we paying for idle seats.” It’s “what did this consumption buy.” Spend rising against flat workflow signals is the problem. Spend rising against cycle times that fell by half is the program working out loud. Throttle the first. Feed the second.

The efficiency levers moved too. Culling an idle seat used to be the primary cost action. On a metered bill the money is in the plumbing instead. Prompt caching cuts the cost of repeated context by roughly 90%. Batch processing runs about half price for anything that doesn’t need an instant answer. One person who reads the usage dashboard monthly and knows those two levers takes more off the bill than a quarter of seat-culling ever did. The seat audit didn’t disappear. It dropped from the headline to the footnote.

How AI returns actually arrive

The reason an averaged ROI number can’t tell you anything useful is that AI returns are not distributed like SaaS returns. A CRM produces a small uniform gain across every seat. AI produces a step change in a small number of seats and roughly nothing in the rest.

This is the same shape laid out in Recognizing Leverage. The Federal Reserve’s 2025 labor data puts the average AI productivity gain across all users at around 5.4% of work hours. Underneath that average, OpenAI’s engagement data shows a 6x gap between power users and everyone else. PwC’s 2026 workforce research lands in the same place. The shape is a long tail with a small head, and the gap is no longer only a disposition gap. The head of the distribution used to be the people with the temperament to wring leverage out of a chat tab. It now includes a second type: the person who designs a recurring workflow once and schedules it, creating leverage for a whole team that never changed a habit. Disposition and design both. Averaging either into the tail erases the only thing worth measuring.

The returns are also lagged. A champion in operations who automates the monthly close in February doesn’t show up in the budget until the next hiring cycle decides not to backfill the role she was about to be promoted out of. A marketing team that ships campaigns without an agency doesn’t show up until the agency contract comes up for renewal nine months later. Quarterly ROI snapshots miss this entirely. They take a picture of a process whose financial signal arrives a year late.

The right measurement layer is therefore not “what is the program ROI.” It’s three smaller, more boring questions, asked at three different cadences.

The three layers that drive decisions

There are three measurement layers. Each runs on its own clock and produces its own decision. None of them produce a single number. All of them produce something you can act on.

Seat-level, monthly. Decision: keep, kill, downgrade, or upgrade. Every seat in your AI budget gets reviewed every month against one question. Did this person produce observable work change in the last thirty days? Not “did they log in.” Not “how many prompts did they send.” Did something they shipped get faster, get cheaper, or get done at all that wouldn’t have happened otherwise. The data sources are usage telemetry plus the person’s manager. Most vendors give you the telemetry. Most managers can answer the second question in under five minutes if you ask. If the answer is no for sixty days running, the seat goes. No exceptions for seniority. The seat goes to the waitlist of people who actually want one. Metered billing adds a second reading to the same review: pull each seat’s consumption next to the work-change answer. High spend with shipped change is where you lean in. High spend with nothing shipped is the seat to look at first. Low spend either way is the old idle-seat question, and it matters less than it used to. The mechanics of this audit are in Evaluating Spend.

Workflow-level, quarterly. Decision: invest more, leave alone, or formalize. Once a quarter, list the five to ten recurring workflows in your org where AI is most likely to be landing. Month-end close. Contract redlining. Campaign briefing. Pipeline research. First-pass code review. For each one, ask whether cycle time, cost, or staffing has moved in the last ninety days. Not by 5%. By half or more. If yes, that workflow gets more investment. Better tooling, dedicated time, the champion on that team gets pulled into a more central role (the workflow-redesign loop is in Driving Adoption). If no, leave it alone for another quarter. If the workflow has been transformed and is now running on a person’s personal account or a hand-built script, formalize it. Pay for the tool. Document the prompt. Make it survivable when the person leaves. Add one more question to the quarterly pass: is a recurring report a person used to assemble now produced by a scheduled agent instead? That’s the cleanest form of persistent leverage, the workflow runs on a cadence without anyone touching it. It’s also load-bearing work you now have to inventory and own, because it keeps running after its author changes teams. The Q3 2026 briefing covers why that second half is the new risk.

Org-level, annually. Decision: rebalance the workforce plan. Once a year, before the headcount planning cycle, look for the headcount asks that didn’t come in. The team that was supposed to grow from eight to twelve and is asking for ten. The contractor renewal that came in lower. The role you were planning to backfill that quietly got absorbed. These are the largest financial returns the program will produce, and they appear nowhere on a vendor dashboard. They appear in the diff between this year’s workforce plan and last year’s. If you can’t point to at least one such diff after twelve months of meaningful AI spend, the program isn’t landing and no measurement framework is going to rescue it.

Three layers, three clocks, three decisions. Notice what’s missing. There is no “calculate the program ROI” step. There is no aggregated productivity number. The aggregated number is exactly what you give up to get measurements that drive decisions.

The four observable changes

If you remember nothing else from this guide, remember this. When an AI program is working at the org level, you will see at least three of the following four changes inside twelve months. If you see fewer than three, the program is not landing, regardless of what your dashboard says.

  1. A recurring report or process now takes less than half the time it used to. Month-end close. Board prep. Pipeline review. Quarterly business review deck. Something that used to be a multi-day exercise is now a half-day exercise, and the person doing it can name the change.

  2. A headcount request that did not come in. A team that was scoped to grow and asked for less. A backfill that did not get filed. A contractor scope that came in smaller. This is the cleanest financial signal in the entire program and it almost never appears on a slide because nobody writes a slide about a thing that did not happen.

  3. A vendor or contractor line item dropping. Agency spend going down. Outside counsel spend on routine matters going down. A SaaS tool getting canceled because the workflow it supported is now done inside a chat window. The vendor relationship manager will notice this before your finance team does.

  4. Internal artifacts being shared and reused across teams. A prompt template that started in one team and is now in three. A script someone wrote on a weekend that the rest of the team now depends on. An internal tool that replaced a request to engineering. This is the strongest leading indicator. It means leverage is no longer trapped inside one person.

That is the whole heuristic. Three of four, inside a year. If you have it, the program is working and you should fund the layers that produced it. If you don’t, the program isn’t working and another training cohort isn’t going to fix it.

When to kill a seat

Killing idle seats used to be the headline efficiency move. On a metered bill it isn’t, the consumption levers above take more off the total, and an unused seat now costs little beyond its base fee. It still earns a place in the loop, for the reason that was always the real one. Most orgs won’t do it. They bought the seat for a senior person, the senior person didn’t use it, and nobody wants to be the one who takes it away. So the seat sits there for two years while someone who would actually use it waits, and the leverage it could have produced never gets built.

The rule is simple and should be written down before you need it. A seat with less than two hours per week of meaningful use across sixty consecutive days gets downgraded to a lower tier or removed. Meaningful use means active sessions producing work, not “logged in.” The vendor telemetry distinguishes the two. There are no exceptions for title, tenure, or personal preference. The seat goes to the waitlist.

The waitlist is the second half of the rule. There should always be a list of people who have asked for a seat and not been given one. When a seat opens, it goes to the top of the list, not back to the budget. Seats migrate toward the people who use them. This is how the spend stays defensible without ever requiring a fight about taking something away from someone who never wanted it in the first place.

The reason this is hard is political, not analytical. The reason it works is that the champions on your waitlist (in the sense Selecting Talent names them) are the ones producing the returns the program is supposed to be measuring. Every seat sitting idle is a seat not in the hands of someone in the head of the distribution. The cost isn’t the $300. The cost is the leverage you didn’t get because the seat was somewhere else.

What to tell the CFO

The CFO doesn’t want a productivity multiplier. The CFO wants a defensible paragraph for the budget conversation. Here’s the format that holds up.

AI program spend this quarter was $X, against $W last quarter. Consumption concentrated in three teams, A, B, and C, where it bought cycle-time reductions of 50% or more in their core workflows. Two headcount requests that were on the workforce plan didn’t come in this cycle, representing approximately $Y in avoided cost. Outside vendor spend in [category] is down $Z year over year, attributable to internal AI capability. We pulled consumption waste out of the high-volume jobs with caching and batching, and reallocated idle seats to the waitlist. Next quarter we’re funding the teams where spend is buying the most cycle time and piloting a tool change in workflow D.

That’s the entire report. No multiplier, no framework, no perception score. Consumption by team against last quarter, the cycle time it bought, headcount avoided, decisions made. The absence of a fake number is the credibility signal. Any CFO who has lived through a previous wave of vendor-supplied productivity claims will recognize this as the first AI report in two years that doesn’t insult their intelligence.

If you can’t fill in three of the four sentences in that paragraph, the program isn’t producing returns and no measurement framework is going to manufacture them. That’s a useful piece of information. It tells you to stop measuring and start cutting.

Something to carry

Pull the seat-level usage report from your largest AI vendor. Sort by hours of meaningful use over the last thirty days. Highlight every seat under two hours per week. That list is your audit, and the same one Evaluating Spend uses for the spend conversation; you’re now using it for the returns conversation. Then book thirty minutes with each holder’s manager this week. Two questions are enough.

  1. Has this person produced any observable work change with this tool in the last thirty days?
  2. Is there someone on your team who would use this seat tomorrow if it were available?

By Friday, you’ll have a downgrade list, a kill list, and the start of a waitlist. That’s the first turn of the loop. Run it again in thirty days. Run it every thirty days after that. The measurement program is the loop. Everything else is a slide.