There's an archaeology exercise I recommend to anyone building a product on top of a language model: go back and read API price lists the way you read the strata of a cliff. March 2023, GPT-4 at launch: $30 per million input tokens, $60 output. Sixteen months later, GPT-4o mini: 15 cents input, 60 output, a cut of two orders of magnitude for comparable capability, on many tasks superior. In early 2025 DeepSeek R1 arrives and lands a reasoning model under 60 cents per million input tokens. Andreessen Horowitz gave this curve a name, "LLMflation", and put it at a factor of ten a year, at constant capability. Ten times. A year. There's almost no other input, in the history of the software industry, that has deflated at this speed.
Now open the income statement of an "AI wrapper" startup, one of those products that buy intelligence wholesale from the APIs and resell it retail inside an interface, a workflow, a vertical. If the main cost line collapses 90% a year, the gross margin should balloon like a sail. We should see a generation of companies going from 40% to 80% gross margin by pure inertia, without lifting a finger.
It isn't happening. In the most mature wrapper sector, coding assistants, 2025 brought the exact opposite: painful repricing, usage limits introduced after the product was already sold, journalistic reconstructions of thin or even negative gross margins at the very fastest-growing companies in the category. The revenue line soars; the margin line doesn't move, or it retreats.
This is the piece missing from the conversation. Everyone celebrates the collapse of inference costs as if it were a wire transfer on its way to anyone building on top of the models. This article's thesis is that the transfer won't arrive, and for structural reasons, not cyclical ones: token deflation passes through wrappers' income statements without stopping, because a cost that drops for every competitor at the same instant can't be a source of margin for any of them. Margin, if it exists, has to be built with materials that don't appear on the API price list. Let's look at the mechanisms, one by one, and then what follows for anyone who has to decide pricing, hiring, and roadmap.
A cost that drops for everyone is an advantage to no one
The first mechanism is the oldest in the world, and the most ignored in pitch decks. Margin doesn't come from low costs: it comes from differentially low costs. If I pay less for steel than my competitors, I have an advantage. If steel costs less for everyone, all I have is a market where final prices will fall until that saving is handed back to the customer.
Tokens are the extreme case of a non-differential input. Every wrapper buys from the same oligopoly of suppliers, at the same public price, with no exclusive contracts that matter, with no meaningful scale advantage on the purchase (volume discounts on the APIs exist but are marginal next to the deflation curve, which overruns them every quarter). When OpenAI or Anthropic cut prices, they cut them at the same time for you, for your direct competitor, and for the three students replicating your product over a weekend. The barrier to entry falls at the same speed as your cost of goods sold.
And so competition does what competition always does with a symmetric saving: it passes it through to price. Every cut to the API price list gets metabolized by the market within a few months as more generous free tiers, higher usage limits, subscription prices held flat against a product that consumes much more. The deflation surplus exists, and it's enormous, but it slides down the supply chain to the end user, who today gets for twenty dollars a month an amount of compute that in 2023 would have cost two thousand. The customer is grateful. The wrapper's income statement is not.
Anyone who objects that "at least absolute costs are falling" is assuming something real markets almost never grant: that the product stays still while costs drop. And here the second and third mechanisms come in, working as a pair.
The frontier doesn't go on sale
The "ten times less a year" figure hides a clause that flips its meaning: at constant capability. It's the price of GPT-4-level intelligence that has collapsed. But no wrapper in a competitive market can afford to sell GPT-4-level intelligence in 2026, for the same reason no laptop maker can fit processors from three years ago and sell at full price: the competitor who adopts the frontier model makes you look broken within one sales cycle.
The cost that matters for a wrapper, then, isn't the token price at frozen capability, which is indeed in free fall, but the token price at the frontier, where its customers force it to stay. And that line of the price list falls much more slowly. The top models of 2025 still priced output in the range of $10–75 per million tokens, not cents. Worse: the frontier has shifted to reasoning models, which before answering consume thousands of thinking tokens, billed, and for the most part not even visible. The unit price drops and the meter spins faster.
Token deflation is real, but it's the deflation of last year's tokens. It's like celebrating the collapse of hotel prices in November when your business forces you to travel in August: the discount exists, it just isn't for you.
Tasks eat the discount
The third mechanism is Jevons paradox, when an input becomes more efficient its total consumption grows instead of falling, applied not to the market in the abstract but to a single product's roadmap. Because it isn't just that users use the product more when it costs less to serve them. It's that the product itself, to stay competitive, has to morph into a form that consumes orders of magnitude more tokens for every unit of value sold.
Look at the trajectory of any coding assistant, which is the early version of the future for every other wrapper. First it was completion: a hundred tokens to suggest a line. Then chat with context: a few thousand tokens to answer a question about the code. Then retrieval across the whole repository. Then agent mode: the system reads ten files, plans, writes, runs the tests, reads the error, retries, rereads, rewrites, runs the tests again, consults the docs, retries once more, and every step of this litany is a call to the frontier model, dragging all the accumulated context behind it. A single agentic task can burn through more tokens in ten minutes than a 2023 user consumed in a month.
The result is a race between two curves: the price per token falling and the tokens per task rising. Over the last two years, at the product frontier, the second curve has run faster than the first. This, not incompetence, not naive pricing, is why the most advanced companies in the category found themselves with fixed-price subscriptions covering an exploding variable cost: a short position on token consumption, sold just as consumption per user was taking off. The 2025 repricing, with its attendant user revolts, was the forced closing of that position.
It's worth saying plainly: for a wrapper at the frontier, the cost per task can rise while the cost per token collapses. Anyone modeling their P&L on the first curve while ignoring the second is planning the budget of a product that won't be on the market anymore.
The fourth problem: the supplier is moving up the chain
There's one last mechanism, and it isn't about costs but revenue. Every model generation absorbs a piece of the added value that justified the wrapper's markup. Sophisticated prompt engineering became a native capability of the model. Hand-built retrieval pipelines were eroded by context windows a hundred times larger. Tool orchestration, structured output, web browsing, code execution: everything that in 2023 was the "wrapper", the engineering layer that made a raw model usable, has migrated into the API, or into the consumer products of the labs themselves.
Because the labs aren't just suppliers: they're the best-capitalized competitors in history, and they sell the wrapper's same use case directly to the end user, often at a loss, to win distribution. Jasper learned it first: valued at a billion and a half in October 2022 selling marketing-copy generation on top of GPT-3, thirty days later it found itself competing with ChatGPT, its own supplier, with its own engine, at zero price. The wrapper lives in a vise: deflation takes its margin from below, absorption takes its product from above.
So: where can a margin live
So much for the mechanism. Now the chain of consequences, because this is where it's decided who survives.
First: stop treating token cost as a strategic variable. It isn't one, in either direction. It's not an advantage when it drops (it drops for everyone), not a moat when you optimize it (the optimization that cost you a quarter will be free on next quarter's price list). Every slide projecting margin expansion "thanks to falling inference costs" deserves the same question: what mechanism stops competitors from passing that drop into prices? If there's no answer, that slide describes your customers' future margin, not yours.
Second: pricing has to stop being short on tokens. The flat, unlimited-usage subscription is a bet that consumption per user stays still, and we just saw why it loses: your best users, the ones who should be the most profitable, become the most expensive, because they're the first to adopt the agentic features. Sustainable structures tie price to value or consumption: price per outcome (the document produced, the ticket resolved, the case closed), or hybrids with a fixed base and a pass-through of consumption above a threshold. Price per outcome has a valuable property under deflation: when the cost of the task drops, the saving stays yours, because the customer bought the result, not the tokens. It's the only contract in which deflation works for the wrapper instead of against it.
Third: the moat has to be built with materials you can't buy via API. If tokens are a commodity and the scaffolding gets absorbed, what's left to defend? Three things, in order of solidity. Owning the workflow: being the place where the work happens, with the accumulated context, the history, the preferences, the integrations, the permissions, that makes switching suppliers costly even when the underlying model is identical. Proprietary evaluation data: in a world where everyone has the same model, the winner is whoever can measure better than the rest whether the output is right in their own domain, the evaluation suite built on ten thousand real cases of contracts, diagnoses, or tax returns is an asset no price update devalues, on the contrary: every new model makes it more valuable, because it lets you adopt it sooner and with more confidence than competitors. And distribution, in the boring and decisive sense: enterprise contracts, industry certifications, presence in the buying flow. Notice what they have in common: they accumulate with time and with customers, not with capital. They're slow to build, and that's exactly why they defend.
Fourth: this rewrites the org chart. If the value isn't in calling the model but in verifying it and owning the context, the critical hires change. Fewer engineers optimizing prompts, a fast-depreciating skill, and more people building evaluation systems, context pipelines, deep integrations into the customer's systems; more hybrid figures, half engineer and half domain specialist, able to sit with the customer and turn their way of working into data and constraints for the product. Your best engineer, eighteen months from now, won't be the one who knows how to make the model work: it'll be the one who knows how to prove when it works.
Fifth, for anyone reading the numbers from outside: a wrapper's gross margin at a point in time, today, is almost noise. It's a snapshot of a race between two curves in full motion, distorted by cross-subsidies and promotional pricing. The questions that discriminate are dynamic: how does the cost per task served evolve, at equal quality? How much of retention would survive a change of the underlying model? How much of the perceived value sits in things a lab's next release can absorb? A wrapper with 30% gross margins but a falling cost per task and real switching costs is a company; one with 70% margins propped up by a use case that OpenAI can ship as a feature is a position waiting to be liquidated.
The best case against this thesis
Honesty demands saying where this reasoning stops holding, because there is a limit. All the mechanics described, the forced frontier, the explosion of tokens per task, apply to markets where you compete on capability. But there's a class of products where the task is frozen: classifying tickets, extracting fields from invoices, triaging email. Tasks for which a 2024 model was already enough and a 2027 one will add no perceptible value. There the "at constant capability" clause stops being a mockery and becomes literal: the cost of goods sold really does collapse, year after year, and the margin really does expand.
It's a serious objection, and partly true. But notice what it entails: freezable tasks are also the easiest to replicate, precisely because they don't require the frontier, anyone can rebuild them with cheap models, which reactivates the first mechanism, competitive pass-through, in its fiercest form. The margin survives only where it's protected by something else: distribution, integration, regulatory compliance. Which leads, by another road, to the same conclusion: even in the best scenario for wrappers, the margin doesn't come from tokens. It comes from what you built around them.
Then there's the optimists' objection: some wrappers became companies with hundreds of millions in recurring revenue in record time, so the model works. But revenue and margin are two different claims, and it's the very ease with which this generation of products generates revenue that makes the rush to call it profitable suspect. Growth proves the value to the user is enormous. This piece's thesis doesn't deny it: it argues that this value, by default, flows to the user and to the labs, and that retaining a share of it requires a deliberate plan, not waiting for the price list to drop.
The right question
The question I hear asked is: "how much will tokens cost next year?". It's the wrong question, and by now it should be clear why: whatever the answer, it holds identically for you and for anyone who wants your market.
The right question is the one I recommend writing on the first page of every business plan that contains the word "AI": when tokens cost almost nothing, why will a customer pay you in particular? If the answer is about costs, you don't have a company: you have a temporary arbitrage on a falling price list, and the fall is faster than you are. If the answer is about a workflow you own, a verification only you can do, a context the customer doesn't want to rebuild elsewhere, then token deflation stops being a threat and becomes what it has always been for those on the right side of the chain: a supplier that gets, every year, ten times cheaper.
The collapse of inference costs is real, and it's one of the most powerful economic forces of this decade. But it's a river that won't be bottled: it runs through wrappers' income statements and settles downstream, in users' surplus, and upstream, in the labs' scale. In between, only the companies that built something the water doesn't carry away remain.