GPT-5.5’s pricing change is easy to summarize and hard to absorb operationally: input tokens now cost $5 per million and output tokens $30 per million, up from $2.50 and $15 in GPT-5.4. That is a straight doubling of list price. OpenAI’s accompanying argument—that shorter responses would help offset the increase—does not hold up cleanly in production traces. In OpenRouter’s April 2026 usage logs, real-world costs still rose 49% to 92%, depending on input length.
For deployment teams, that is the important detail. The delta is not abstract. It lands directly in per-request unit economics, in gross margin assumptions for AI features, and in the runway math for systems that were already tuned around token budgets. A model that looks only moderately more expensive on paper can become materially harder to ship when output length does not compress as much as vendor messaging suggests.
The cost shock is also uneven. According to The Decoder’s breakdown of the OpenRouter data, the shortest prompts were hit hardest. For inputs under 2,000 tokens, average cost climbed from $4.89 per million tokens to $9.37, a 92% increase. In the 2,000 to 10,000 token range, costs rose 69% to $3.81 per million tokens from $2.25. Between 10,000 and 25,000 tokens, the increase was 51%. At 25,000 to 50,000 tokens, it was 62%. For 50,000 to 128,000 tokens, the rise was 49%. Even at 128,000 tokens and above, the increase was still 85%.
That pattern matters more than the headline price table. Token pricing only looks linear if input and output behavior stay stable. In practice, product workflows differ. Short prompts tend to be dominated by output costs, so if responses do not get dramatically shorter, the doubled output price lands almost fully on the bill. Mid-length prompts can behave in a messier way: output may stretch rather than shrink, producing a larger-than-expected increase. Longer prompts spread the economics differently because input tokens start to matter more, but they do not magically neutralize the price change.
The Decoder notes that OpenAI pointed to shorter responses as the intended offset. OpenRouter’s logs suggest that offset is highly workload-dependent. For prompts over 10,000 tokens, responses were 19% to 34% shorter, which does soften the blow. But in the 2,000 to 10,000 token band, responses were 52% longer. That is the part that should alarm teams doing real deployments: a model-level pricing change can induce behavior that is not only more expensive, but more variable across request classes.
This is why the issue belongs in infrastructure planning, not just model evaluation. If your feature set is built on token-metered APIs, you are not buying intelligence in the abstract. You are buying a cost curve with sensitivity to prompt shape, response policy, context size, and routing behavior. GPT-5.5 makes that curve steeper in some places and less forgiving in others.
The budgeting risk is straightforward. If your finance model assumed a roughly stable cost per task, or assumed shorter completions would offset list-price increases, those assumptions now need to be revalidated against actual production traces. That means looking at request cohorts by context length, output length, and user intent. A customer-support summarizer, a code-review assistant, and a retrieval-augmented drafting flow can all have very different effective costs even if they hit the same endpoint.
There is also a vendor-risk lesson here. Token-based pricing creates a dependency on both price and behavior. The moment a provider changes either one, unit economics shift. The market has already seen parallel moves elsewhere, including Anthropic’s Opus 4.7 pricing changes that were tied to higher token consumption. The broader signal is that model vendors are still learning how to price for serving cost, and customers are carrying much of that uncertainty.
For product teams, the response should be operational, not rhetorical.
First, re-baseline every high-volume workflow against live prompt traces, not benchmark averages. Benchmark runs can miss the tail behavior that drives bills. Use production samples, then segment by input-length bucket so you can see where GPT-5.5 is actually more expensive.
Second, reduce token waste before changing models. Tighten system prompts, remove duplicated instructions, trim retrieval payloads, and cap unnecessary conversational history. Small reductions in context size can compound quickly when both input and output prices move upward.
Third, treat caching as a first-class cost control. Semantic caching, deterministic response caching, and prompt-result reuse can absorb repeated traffic in support, search, and internal assistant flows. If a large share of requests are near-duplicates, caching often buys more than model switching does.
Fourth, batch wherever latency allows. Aggregating low-urgency requests can reduce orchestration overhead and make it easier to route only the truly complex cases to the most expensive model tier.
Fifth, keep model routing flexible. A single-model architecture is brittle when pricing shifts. Use a decision layer that can send simple tasks to a cheaper model, reserve GPT-5.5 for high-value requests, and fall back to lower-cost alternatives when quality thresholds are still acceptable.
Finally, tie cost telemetry to product SLAs. If a feature has a margin target or a hard cost ceiling, track tokens per task, cost per successful outcome, and cost per active user in the same dashboard as latency and error rate. Cost needs to be an operational metric, not a monthly surprise.
The practical takeaway from the OpenRouter logs is not that shorter answers never help. They do help in some workloads. The point is that the saving is not reliable enough to assume away a doubled list price. For teams shipping on top of GPT-5.5, the default posture should be to verify economics request by request, then design around the worst-performing cohorts first.



