Stop Running AI Pilots

You greenlit three AI pilots this year. The demos were good. One of them was genuinely impressive, the kind of thing you forwarded to a peer. Six months later, none of the three is in production. There is a deck explaining what was learned. There is a recommendation to run a larger pilot next quarter. And there is a board member who has started asking, in a tone you don’t love, what the AI budget has actually produced.

The story repeats across the market with unusual consistency. MIT’s NANDA initiative reviewed three hundred public AI deployments and found that ninety-five percent of enterprise generative AI pilots delivered no measurable impact on the bottom line.1 IDC’s number is bleaker in a more specific way: for every thirty-three AI proofs of concept an enterprise starts, four reach production.2 Gartner expected roughly a third of generative AI projects to be abandoned after the proof of concept by the end of 2025.3

The instinct is to read those numbers as a verdict on the technology. They are not. They are a verdict on the format.

A pilot is built to produce a demo, not a decision

Think about what a pilot is structured to do. It runs on a side budget. It has a project sponsor but no profit-and-loss owner. Its success criterion is “did it work in the demo,” which is a question about the model. Its deliverable is a recommendation about whether to do more. Nobody’s number moves if it succeeds, and nobody’s number moves if it dies. It is, by construction, a machine for generating the decision to decide later.

That format was borrowed from a different kind of technology evaluation, the kind where the question was whether the thing functioned at all. With AI the thing almost always functions in the demo. The model is the most reliable part of the whole exercise. So a pilot answers the one question that was never really in doubt and leaves every question that actually determines value untouched. Who owns the workflow this changes. What it replaces. How you know an output is good. What happens to the eleven steps on either side of the part the model does.

MIT named the thing that kills these projects the learning gap: the organization’s inability to fold the model into real workflows, ownership, and habits.1 That is not a model problem. It is everything around the model, and a pilot is built to touch none of it.

The variable you can’t see is the verifier

What separates the four-in-thirty-three from the rest is almost never what the pilot measured.

The workflows that make it to production are the ones where somebody can cheaply and quickly check whether a given output is correct. The pull request runs in CI. The extracted field validates against the schema. The reconciliation ties out or it doesn’t. Where that check exists, the model attaches to it and the work compresses. Where the only way to know if an output is good is a senior person reading carefully on a Tuesday, the pilot demos beautifully and then dies the moment it meets real volume, because nobody can certify the output fast enough to trust it. Verification is the bottleneck, not the model, and a pilot scored on the demo never finds out which side of that line the workflow is on.

This is why your most impressive pilot can be your least promising one. A slick demo on a workflow with no verifier is a workflow that will never ship. A boring demo on a workflow that ties out in CI is already most of the way to production. The demo quality and the production odds are close to unrelated, and the pilot format reports only the first one.

The number for your boss, and the buy decision underneath it

When the board asks what the budget produced, the honest answer for most organizations is “three completed pilots and zero production systems,” and that answer is indefensible because the format guaranteed it. The defensible version is a count of workflows now running in production with a named owner and a measured before-and-after. If that count is zero after a year of spend, the problem was never the tools. It was that you funded demos and called them progress. Measuring returns starts from production workflows, not pilot retrospectives.

The same MIT data answers the build-versus-buy question most leaders are still litigating. Buying from a specialized vendor and integrating it succeeded around two-thirds of the time. Internal builds succeeded about a third of the time.1 The custom platform your strongest engineers want to build is the single most reliable way to join the ninety-five percent. Buy the capability, spend your scarce talent on the integration and the verifier, and treat “we’ll build our own” as the expensive answer it usually is. The tool decision is a buy decision far more often than your most ambitious people will admit.

Run a production cut instead

The replacement for a pilot is not a bigger pilot. It is a production cut of exactly one workflow, scoped so small that shipping it is cheaper than studying it.

Pick a workflow that already has a verifier, or can get one in a week. Give it a single owner whose actual number moves when it works, not a steering committee. Put it in front of real volume from day one, not a curated demo set. Set a kill date, so that “keep studying it” is not an available outcome. The goal of the exercise is a thing running in production at the end, or a clean, fast no. Both of those are worth more than a deck.

This is also how adoption stops stalling. The reason half your seats sit idle is that the broad middle was handed tools and told to find a use, while your champions quietly walked toward the workflows where being wrong was cheap. Driving adoption is mostly the work of redesigning specific workflows so the verifier is built in, which is the same work a production cut forces you to do. A pilot lets you skip it. That is exactly why the pilot fails.

Something to carry

List the AI pilots your organization has run or funded in the last year. Next to each, write the name of the single person whose own metric moved when it succeeded. Not the sponsor. Not the project lead. The person whose number changed.

For most of the list you will not be able to write a name, and that blank is the whole explanation. A workflow with an owner whose number moves becomes a production system, because someone is motivated to make it real. A workflow without one becomes a deck, every time, no matter how good the demo was.

The next thing you fund should have a name in that column before it has a budget.

Footnotes

  1. MIT NANDA, “The GenAI Divide: State of AI in Business 2025.” Review of 300 public deployments plus interviews and surveys; 95% of enterprise generative AI pilots delivered no measurable P&L impact, and vendor-bought systems succeeded roughly twice as often as internal builds. Reported by Fortune, August 2025. ↩ ↩2 ↩3

  2. IDC enterprise AI research, 2025: roughly four of every thirty-three AI proofs of concept reach production at meaningful scale. ↩

  3. Gartner forecast that at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025, citing data quality, unclear ownership, and integration cost rather than model capability. ↩