Cursor Composer 2.5 matches Opus 4.7 and GPT-5.5 at far lower cost

Cursor has put a new number on the table for AI coding: Composer 2.5 is live in Cursor now, and the company says it matches Opus 4.7 and GPT-5.5 on CursorBench v3.1 and SWE-Bench Multilingual while pricing far below the market leaders.

That combination matters because it changes the unit economics of model use, not just the benchmark slide. Composer 2.5 is listed at $0.50 per million input tokens and $2.50 per million output tokens, which puts it in a different cost band from the premium models it is now claiming parity with. For teams using LLMs as a production dependency, the practical question is no longer whether a stronger model exists, but whether paying several times more per task still makes sense when an alternative can produce similar results.

Built on Kimi K2.5, trained like Cursor meant it

Cursor says Composer 2.5 is built on Moonshot’s Kimi K2.5 foundation, then pushed much further through task generation and post-training. The company reports that the model saw 25x more synthetic tasks than Composer 2, with 85% of its compute budget directed toward extra training and reinforcement learning.

Those are not cosmetic numbers. A 25-fold expansion in synthetic-task coverage suggests Cursor was trying to widen the model’s exposure to code-adjacent behaviors: editing, debugging, refactoring, multi-step tool use, and the sort of instruction-following patterns that matter inside an IDE. The 85% compute allocation to additional training and RL points to a pipeline that is less about sheer base-model scaling and more about squeezing useful behavior out of a strong foundation checkpoint.

That is a familiar direction in frontier model development, but Cursor’s emphasis is unusual in one respect: it is not presenting a larger, more expensive model as the only way to improve. Instead, it is arguing that targeted training on synthetic tasks, plus heavy reinforcement learning, can move a coding model into a much better cost-performance zone without requiring a commensurate jump in inference price.

Benchmarks and pricing tell the same story

Cursor’s claim is not that Composer 2.5 is narrowly optimized for one benchmark family. It says the model matches Opus 4.7 and GPT-5.5 on both CursorBench v3.1 and SWE-Bench Multilingual, which makes the launch about cross-benchmark competitiveness rather than a single-task win.

On SWE-Bench Multilingual, the model is reported at 79.8%, while CursorBench v3.1 comes in at 63.2%. Cursor is using those results to argue that the model is strong enough for real coding workflows across languages and repository styles, not just English-centric prompt tuning. The broader economic claim is even sharper: the model can deliver those numbers at token prices that undercut the premium tier by a large margin.

Cursor also notes a faster variant with the same performance profile, priced at $3.00 per million input tokens and $15.00 per million output tokens. Even there, the company is signaling an attempt to give buyers a speed-versus-cost tradeoff without forcing them into the most expensive segment of the market.

For teams already building around LLM copilots, code review helpers, or agentic coding pipelines, that matters because task-level cost is often the first constraint to bite. A model that is only marginally better but much more expensive can be a bad fit for high-volume workflows. A model that is benchmark-competitive and materially cheaper changes procurement math immediately.

Why live availability in Cursor changes the competitive picture

Composer 2.5 is not a paper launch. It is available in Cursor now, which means the pricing and benchmark claims are immediately testable in a real developer workflow.

That live status is the most important business detail in the announcement. Cursor is not just publishing a model card; it is plugging the model into an environment where adoption depends on actual completion quality, latency, and the feel of iterative coding assistance. That creates a direct alternative for teams that have been absorbing higher token costs from premium model usage inside their editors and internal tools.

If the reported economics hold up in production, the result is likely to be pressure on incumbent vendors to justify why their higher-price models are worth the premium for coding tasks. It also raises the bar for any toolchain that wraps a model in product features but passes through heavy inference costs. Buyers may begin to separate the value of orchestration and interface from the value of the underlying model more aggressively.

Cursor’s longer-term roadmap only sharpens that dynamic. The company says it is already pursuing a larger successor model with ten times the compute, which suggests it sees Composer 2.5 not as an endpoint but as a proof that an aggressive compute regime can still be economically disciplined.

The tradeoffs are real, even if the economics are compelling

The launch is strongest where it is most concrete: benchmark parity, low token pricing, and immediate availability. The risks are in what the training recipe may not cover.

Synthetic data at scale can be powerful, but it can also create narrowness if the generated tasks overrepresent certain patterns of coding work and underrepresent edge cases from real repositories. Heavy reinforcement learning can improve behavior on selected objectives, but it can also produce brittle preferences if the reward structure does not line up cleanly with downstream use. For production teams, that means the benchmark numbers are necessary evidence, not sufficient evidence.

The right response is not to dismiss Composer 2.5 because it relies on synthetic training and RL. It is to evaluate it like any other serious model: run it against your own codebases, test failure modes, measure drift, inspect how it handles long-context edits, and compare its cost-adjusted accuracy against the model you already pay for.

Cursor has made the economics easy to see. Whether they hold in a given pipeline will depend on the workload. But the launch makes one point hard to ignore: in coding models, benchmark parity no longer has to come with premium pricing.

Cursor’s Composer 2.5 resets the price-performance bar for coding models

Built on Kimi K2.5, trained like Cursor meant it

Benchmarks and pricing tell the same story

Why live availability in Cursor changes the competitive picture

The tradeoffs are real, even if the economics are compelling

AI News Desk

Claude Cowork’s biggest use case is the office work nobody wants to own

Altman’s ‘pretty sure’ moment shifts the AI debate from layoffs to throughput

Brown’s 96-to-48 Split Is a Stress Test for AI-Era Assessment