A single MirrorCode task running nonstop for 19 days and costing about $2,600 is an awkwardly useful data point for autonomous coding. It is long enough to show that leading models can sustain a serious software-reconstruction attempt, and expensive enough to remind anyone watching that capability does not arrive free of operational trade-offs.

MirrorCode, built by Epoch AI and METR, is not a code-completion benchmark in the familiar sense. It asks models to recreate complete programs from scratch across 25 target programs spanning different domains, then grades them with hidden end-to-end tests. That design matters because it shifts the question from “can the model produce plausible code?” to “can the model reconstruct a working system that survives execution?”

The recent run reported by The Decoder underscores how demanding that setup can be. One task consumed 19 days of continuous runtime and roughly $2,600 in compute. Claude Opus 4.7 currently leads MirrorCode with a 56% solve rate, and the same reporting notes a separate result in which the model rebuilt a 16,000-line bioinformatics toolkit in 14 hours. Those two datapoints point in the same direction: the frontier model can handle meaningful software reconstruction, but performance remains uneven across task difficulty.

That variability is the technical story worth watching. In a benchmark built around hidden tests, there is no comforting intermediate signal that tells you the task is going well. The system has to keep working through long stretches of uncertainty, with failures often surfacing only at the end. For model developers and tool builders, that pushes several design concerns to the foreground: checkpointing, memory management, recovery from stalled runs, and the ability to resume after partial failure without losing the thread of a large codebase.

In practice, that means autonomous coding is starting to look less like a single-pass generation problem and more like a distributed systems problem. A long-running agent needs durable state, strong observability, and test harnesses that can tell the difference between apparent progress and actual correctness. Blink-by-blink evaluation is not enough when the unit of work is measured in days and the verification signal arrives only through end-to-end tests.

The cost profile is just as important as the technical one. A $2,600 run for a single task is not a universal benchmark for what all autonomous coding will cost, but it is a clean reminder that sustained inference, repeated retries, and long context-heavy sessions can add up quickly. For teams thinking about production deployment, that raises obvious questions about compute budgets, throttling policies, and where to place human review checkpoints. If a model can spend 19 days attempting one reconstruction task, the business problem is no longer only about whether it can finish, but whether the expected value of letting it continue justifies the spend.

Hidden end-to-end tests also change the risk picture. Because the final outcome is not fully observable until execution completes, a system can appear to be making steady progress while quietly accumulating correctness debt. That makes guardrails harder to design and more important to enforce. Reproducibility, logging, and rollback paths stop being “nice to have” engineering hygiene and become part of the control plane for autonomous software generation.

MirrorCode should therefore be read as a signal, not a finish line. Its value is that it measures end-to-end reconstruction rather than surface-level code completion, and in doing so it reveals how far current models have come—and where they still break down. A 56% solve rate at the top of the chart is impressive, but the benchmark’s hardest tasks are still stumping every tested model. That is a meaningful boundary: the field is no longer arguing about whether autonomous coding is real, but about how often it can be made reliable enough to trust.

For technical teams, the practical response is not to chase a headline runtime. It is to experiment in smaller, cost-aware loops, decompose large tasks into modular synthesis steps, and treat every autonomous run as an auditable process rather than a black box. If an AI coding system is going to operate with limited supervision, it needs explicit governance, clear monitoring, and tests that make failures visible early rather than after the bill arrives.

MirrorCode suggests that end-to-end program reconstruction is within reach for leading models. It also shows why reaching it is only half the story.