The change is not that AI can now write a few more lines of exploit code. The change is that offensive cyber capability appears to be improving on a roughly 5.7-month doubling curve since 2024, according to new safety research. That rate matters immediately because it turns what used to look like a distant capability risk into a near-term operational one: every product release, agent upgrade, and benchmark win is happening against a backdrop where the models’ offensive utility is compounding fast enough to alter how attackers, vendors, and defenders should plan.

The report’s most important finding is the shape of the curve, not any single score. Researchers say the latest systems, including Opus 4.6 and GPT-5.3 Codex, can complete tasks that take human experts roughly three hours. That does not mean the models are autonomous exploit kits in the wild, and it does not imply every attack class is suddenly solved. But it does mean the frontier models are crossing thresholds that resemble real operator work: reading a target, reasoning through a vulnerability, generating tooling, iterating on failures, and adapting code until the exploit path works.

That is why this story is more technical than a generic AI safety warning. In most AI evaluation conversations, the key question is whether a model can write better software, follow instructions more reliably, or reason more deeply. Offensive cyber performance is now being pulled along by those same gains. A model that is better at debugging, code synthesis, sequence planning, and tool use will often become better at exploit generation too — not because the system was explicitly optimized for attacks, but because exploitation is a downstream use of general technical competence.

Why exploit-generation is now a model-quality metric

For security researchers, the significance of the 5.7-month doubling is that it reframes offensive cyber from a niche red-team benchmark into a proxy for model quality under adversarial conditions. If the capability curve is this steep, then each generation of models is not merely nudging the ceiling upward; it is compressing the time needed to move from proof-of-concept exploitation to usable attack tooling.

That has two implications.

First, exploit construction becomes less of a specialist bottleneck. A human attacker still needs judgment, environment setup, and target selection, but models that can rapidly assemble, test, and refine payloads reduce the amount of manual expertise required to get from vulnerability disclosure to working code.

Second, the evaluation surface changes. If a model can solve tasks that take experts hours, then benchmark design has to account for not just static knowledge but workflow completion: can the model chain together reconnaissance, reasoning, code generation, and error correction in a way that materially shortens the path to exploitation? That is the kind of capability frontier the report is pointing to, and it is the reason the doubling curve is more informative than any one model’s raw score.

The product problem: more capability, less friction

For AI labs and tooling vendors, the uncomfortable part is that the same product features that make models useful for legitimate developers can make misuse easier.

Better code models are usually shipped with richer IDE integrations, agent loops, file access, API connectivity, and longer context windows. Those are the features enterprises want for software delivery, debugging, and workflow automation. They are also the features that reduce friction for an attacker who is using the model to draft exploit logic, reason over a codebase, or iterate on a payload.

That creates a product design problem, not just a safety paper problem. If offensive capability is doubling every 5.7 months, then access controls, logging, rate limits, abuse detection, and task-level policy enforcement can no longer be treated as secondary guardrails. They become core product requirements. Labs will need to think about where the model is allowed to execute code, how tool use is monitored, what gets retained for incident review, and which high-risk workflows deserve tighter friction even if that makes the product marginally less convenient for honest users.

This is especially relevant for agentic systems. As models become more autonomous in software engineering contexts, the attack surface expands from prompt content to the surrounding orchestration: browser access, shell access, repository permissions, CI hooks, secret handling, and multi-step planning. The more a model can operate like a junior engineer, the more it can also operate like a junior attacker if abused.

What defenders should assume now

Security teams should not read this as a prediction that every vulnerability will suddenly become exploitable by AI. They should read it as a change in timing and scale.

If offensive capability is on an exponential curve, defenders should expect three concrete shifts:

  • Faster exploit prototyping. Once a vulnerability becomes public or is partially understood, the time required to generate a usable exploit path should shrink.
  • Broader low-skill misuse. Attackers with less deep expertise may be able to do more with less manual work, increasing the pool of capable adversaries.
  • Shorter time-to-abuse. The window between disclosure, scanning, proof-of-concept creation, and active exploitation may compress, especially where models can help automate iteration.

That pushes defenders toward operational controls that assume AI assistance on the other side of the keyboard. Hardening and patch management matter more, not less, but so do rate limits, anomaly detection, abuse-aware logging, and detections tuned for machine-generated attack patterns: high-volume iteration, rapid parameter changes, repeated failed probes, and scripts that adapt faster than a human operator typically would.

Enterprise buyers should also update procurement assumptions. A vendor claiming model leadership on coding or agentic workflows is, by implication, shipping a system whose offensive misuse potential may also be rising. That does not disqualify the product. It does mean the buyer should ask for measurable controls, not vague reassurance.

The policy and market split this will force

The policy debate around AI cyber risk often gets framed as a distant regulatory question: should governments restrict release, mandate audits, or require red-team testing? Those questions matter, but the new research makes them harder to separate from product and market decisions.

If offensive cyber capability continues doubling every six months or so, model release strategy will increasingly be judged by whether a lab can demonstrate abuse resistance in practice, not just benchmark leadership on coding or reasoning tasks. Audit standards will need to say something meaningful about tool use, exploit assistance, and post-release monitoring. And enterprise procurement will likely shift toward asking not only how good the model is, but how well it can be fenced off from high-risk misuse.

That is the deeper implication of the report: AI cyber safety is no longer just about whether a model might someday become dangerous in the abstract. It is about the fact that the same engineering progress that makes models more valuable for software work is also making offensive use easier on a visible, measurable curve.

The headline number — a roughly 5.7-month doubling in offensive cyber capability since 2024 — should be read as a warning about pace, not apocalypse. But pace is the point. When capability is compounding that quickly, the institutions that ship and defend AI systems do not get to plan as if misuse remains a slow-moving edge case.