Essay · pilota-vs-produzione

The pilot is built to be believed, not to survive

95% of enterprise GenAI pilots leave no trace on the income statement. The convenient explanation is organizational maturity. The uncomfortable one is that we're testing the wrong object.

Andrea Iorio
Executive · AI builder
Tuscany / IT
95% PER STEP AFFIDABILITA' DEL WORKFLOW 36% A 20 STEP 1 5 10 20 STEP
FIG. 01 · l'accuratezza per-step regge; il workflow che la incatena no

The demo took place on a Tuesday, in the good conference room, the one with the screen that works. The model read a forty-page supply contract and summarized it in nine seconds. It answered questions about the indemnity clauses. It drafted a follow-up email in the company's tone of voice. It found a renewal date buried in an attachment, the one legal had once actually missed, an oversight that had cost the company an embarrassing automatic renewal and a quarter of awkward questions in the boardroom. Someone in the room said "wow" out loud. The CFO, who'd arrived skeptical, asked when it could be extended to all of procurement.

That was fourteen months ago. The pilot is now officially "on hold pending further evaluation," which in corporate speak means dead.

If the story sounds familiar it's because it's the median story. MIT's NANDA initiative, in its much-discussed report "State of AI in Business 2025," found that roughly 95% of enterprise GenAI pilots produce no measurable impact on the income statement. S&P Global Market Intelligence surveyed over a thousand companies and found that 42% had abandoned most of their AI initiatives in 2025, up from 17% the year before, with the average organization scrapping nearly half of its proof-of-concepts before production. Gartner had predicted that 30% of GenAI projects would be abandoned after the proof-of-concept by the end of 2025. You can argue any single number; you can't argue the direction they all point in.

The standard explanations are organizational immaturity, dirty data, change management failure, lack of executive sponsorship. All real, all secondary. I want to make a more uncomfortable case: most pilots don't fail on the road to production: they were never on the road to production, because a pilot and a production system are two different objects, optimized for two different things. A pilot is optimized to be believed. A production system is optimized to survive. The skills, budget, architecture and accountability the second requires are almost entirely absent from the first. Which means a successful pilot carries far less information about production feasibility than the people funding it assume: close to zero, in many cases. The 95% isn't a maturity gap waiting to be closed by better models or better change management. It's a measurement error: we tested the wrong object.

The demo samples the happy path. Production samples everything.

Start with the mechanism, because everything else follows from it.

A demo, and almost every pilot is an extended demo, runs on inputs chosen, more or less consciously, to be representative of the typical case. The clean contract. The well-formed customer email. The question the team had in mind while building the thing. A production system runs on the entire distribution of inputs: the scanned PDF that's actually a photo of a fax, the email written half in Portuguese, the customer who asks three questions in one sentence, the adversarial user, the edge case nobody imagined because, by definition, you don't imagine the tail: you encounter it.

This matters more for AI systems than it ever did for traditional software, for a precise reason: errors in multi-step AI workflows compound multiplicatively. If a model is correct 95% of the time on a single step, a genuinely good number, a workflow that chains five such steps succeeds about 77% of the time. Ten steps: 60%. Twenty: 36%. Traditional software doesn't behave this way, because deterministic code is correct or wrong per case, and the cases can be enumerated and fixed. A probabilistic system is approximately correct per step, and the approximation decays with depth.

So the pilot and the production system don't sit at two points on the same curve, with production simply "further along." They're evaluated on different distributions. The pilot is evaluated on the median input, where modern models are frankly spectacular. Production is evaluated on the tail, because in a company the tail is where the lawsuits, the refunds, the compliance incidents and the social media screenshots live. A system that's brilliant on the median and unreliable on the tail isn't 90% of the way to deployment. In a regulated or customer-facing context it may be at 20%, because the work that remains, detecting, containing and recovering from tail failures, is most of the engineering.

So here's the first implication for decision-makers: when a team reports "the pilot works," the only rigorous response is a question: on which distribution? If the answer is "on the cases we tried," you've learned that the model is remarkable. You knew that already. You spent six months and a few hundred thousand euros confirming a press release.

In probabilistic software, the evaluation is the product

There's a second mechanism, less discussed and more basic.

With deterministic software, verification is hard but bounded: you write the tests, the tests pass or fail, and a green suite is a durable fact about the system. With an LLM-based system, "does it work?" isn't a yes/no question. It's a statistical claim, it produces acceptable output X% of the time on distribution Y, where "acceptable" is defined by judgment Z, and every term of that claim is something you have to build. You need a labeled evaluation set that actually resembles production traffic. You need a definition of "acceptable" precise enough to score, which forces the business to articulate things it has never put in writing. You have to rerun everything whenever the prompt changes, the model version changes, or the input mix drifts, and the vendor will change the model whether you like it or not.

This scaffolding, the eval set, the scoring, the regression pipeline, the production monitoring that catches drift, isn't scaffolding around the product. For an enterprise AI system it is the product, in the same sense that a bridge's load capacity is the bridge. The model is increasingly a commodity you rent; three competitors call the same API. The scaffolding is the part you own, the part that encodes your data, your judgment, your failure modes.

Now look at what a typical pilot contains: a prompt, an API call, an elegant interface and a script for the demo. No eval set, no scoring, no monitoring. Not because the team is lazy, but because the pilot's real function, the job it was hired to do, is to secure the next round of internal funding, and scaffolding doesn't look good in a demo. Nobody applauds a confusion matrix.

This is also why Gartner's framing is the right one: projects are abandoned not because the technology failed, but because the organization failed to build the machine for improving safely in production. A pilot without an eval harness can't even tell you it's failing. It can only tell you that it once impressed a CFO.

Nobody gets woken at 3 a.m. for a pilot

The third mechanism is organizational, and it's the one a CEO can act on fastest.

Think of everything a production system has and a pilot doesn't. An owner whose name is attached when it breaks at 3 a.m. An on-call rotation. An error budget. An escalation path when the model produces something it shouldn't. Audit logs, because the regulator will ask. Access controls, because the contract summarizer mustn't summarize the M&A folder for an intern. A rollback plan. A vendor SLA someone actually read. A budget line that survives the fiscal year. A human-review workflow for the 4% of cases the system flags as uncertain, and a definition of who those humans are, what their hourly capacity is, and what happens when they're on vacation.

A pilot has a sponsor. Production needs an owner. They're different species. The sponsor's incentive is for the pilot to look successful; the owner's incentive is for the system to be reliable, because he's the one who gets the call. In most companies, the moment a pilot "succeeds," it faces a handoff from an innovation team with no operational accountability to an IT or line organization that never asked for it, doesn't trust it, can't see the eval results (there are none) and is being asked to absorb the tail risk on an unchanged budget. The receiving organization does the rational thing: it stalls. We call this "scaling failure." It's actually an immune system working correctly, rejecting an organ that arrived without a blood type.

What it means for capital, hiring, and what to stop doing

Pull the chain of consequences, because each mechanism lands on a decision.

Capital allocation. If scaffolding and integration are 80% of the work, the budget should look like that, and almost no pilot budget does. The honest cost structure of an enterprise AI system is roughly the inverse of how it's typically funded: a thin slice for the model and prompts, the bulk for evaluation, monitoring, workflow integration and human-review loops. A rule of thumb: if a proposal's budget is almost all model and UI, you're funding a demo, whatever the slide calls it.

Build versus buy. MIT's data contains an under-examined number: solutions bought from specialized vendors reached deployment about 67% of the time, while internal builds succeeded at roughly a third of that rate. The mechanism is exactly the one above: a vendor serving fifty clients has amortized the scaffolding, the tail-case handling and the integration patterns across all of them. When you build internally, you pay full price for the invisible 80%. The internal build makes sense where the workflow itself is your moat; everywhere else, you're building plumbing by hand.

Hiring. The scarce skill isn't prompt engineering, which is the part that gets easier with every model generation. It's evaluation engineering and ML operations: people who know how to build a test set that resembles reality, define "good" in scoring code, and run the statistical machine that turns "seems fine" into "it's correct 96.2% of the time and here's the alert when it drops." Almost every company has zero such people, and hires for the wrong role.

Incentives. Stop counting pilots. The number of AI pilots launched is an innovation-theater metric, with the same relationship to value that lines of code have to software quality. Count production systems with an owner, an eval suite and ninety days of monitored traffic. Watch how the portfolio shrinks, and how much more honest the conversation becomes.

The honest objections

Two serious counterarguments deserve their strongest version.

First: high pilot mortality is healthy. Pilots are low-cost options; you're supposed to kill most of them. True, and if 95% died because the use case turned out to be low-value, that would be a portfolio working well. But that's not the observed failure pattern. The pattern is pilots that succeed on their own terms and still never reach production. An option you can't exercise even when it expires in-the-money isn't an option. It's theater with a budget line.

Second: better models will close the gap. Partly right. As frontier models become more reliable per step, the compounding math improves, and the line between "needs heavy scaffolding" and "good enough out of the box" shifts: it has visibly shifted for tasks like summarization and code assistance since 2023. But notice which mechanisms a better model fixes and which it doesn't. It raises per-step accuracy. It doesn't create your eval set, doesn't integrate with your ERP, doesn't define your escalation path, doesn't assign an owner who gets the call. The capability gap is closing; the verification and accountability gaps aren't model problems, and waiting for GPT-7 to fix your org chart isn't a strategy.

Ship a slice, not a pilot

The fix isn't better pilots. It's rejecting the pilot/production dichotomy altogether.

Take one narrow workflow, narrow enough to be almost embarrassing, and build it as production from day one: real traffic, real eval suite, real owner, real on-call, real audit logs. One contract type, not all contracts. One support queue, not the whole inbox. You'll move slower for the first quarter and you'll be the only team in the building that knows, with numbers, whether the thing works. Then you widen the slice.

And before funding anything, ask the one question that separates the 5% from the rest. Not "did the demo impress us?": the demo always impresses, that's what demos exist for. Ask: who gets woken when this thing is wrong at 3 a.m., and which dashboard are they looking at?

If there's no name and no dashboard, you don't have an AI strategy. You have a Tuesday in the good conference room.


Sources: MIT NANDA, "The GenAI Divide: State of AI in Business 2025" (via Fortune) · S&P Global Market Intelligence, GenAI 2025 survey · CIO Dive on S&P Global abandonment data

Un pilota è ottimizzato per essere creduto. Un sistema di produzione è ottimizzato per sopravvivere. Sono due oggetti diversi, e un pilota riuscito dice quasi niente sul secondo. POV
◆ ◆ ◆

Source

  1. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/