Researchers are beginning to treat test-time scaling less like a fixed recipe and more like a search problem.

That is the shift behind AutoTTS, a system that lets Claude Code explore how an AI model should spend inference compute inside a simulated offline environment. Instead of engineers hard-coding when to fork multiple solution paths, extend a chain of thought, or cut off a weak branch, the agent is allowed to search for policies that decide those moves on its own. The result is not autonomous reasoning in the philosophical sense. It is something more practical and, for product teams, more consequential: AI-discovered scaling policies.

Test-time scaling has long been a manual art. If a task looks hard, a system can spend more compute by sampling multiple paths, expanding a promising candidate, or revisiting intermediate steps. Those choices have usually been encoded as human-written heuristics. AutoTTS changes the locus of design. The researchers define a control space for scaling — essentially a width-versus-depth tradeoff — and let Claude Code search within it.

That matters because many of the tricks used in modern inference can be framed as variations on the same theme. Width means how many candidate solution paths run in parallel. Depth means how far each path is allowed to go before the system stops or reallocates compute. In that framing, a lot of known methods are not fundamentally different algorithms so much as different coordinates in a shared policy space. AutoTTS takes that idea seriously and asks whether an AI agent can discover useful coordinates that humans would not have bothered to handcraft.

The answer, according to the researchers’ account, is yes — at least inside the simulation.

What changed: AI searches for its own scaling rules

The immediate novelty is not that a model is being asked to solve harder problems with more compute. It is that the rules for allocating that compute are no longer authored entirely by people.

Claude Code is used as the search agent inside AutoTTS. The system gives it a playground where it can test different decisions about when to branch, prune, continue, or stop solution paths. Because this happens in a simulated offline environment, the search is cheap relative to trying every policy in a live production stack. That matters: if policy search itself is expensive, then many potentially useful strategies never get explored.

By moving the search into simulation, AutoTTS turns scaling policy design into a controlled optimization problem. The agent can iterate over candidate control rules without affecting a live product, and the researchers can measure which policies appear to use compute more efficiently. That is the central promise here: not just better accuracy, but better compute efficiency at test time.

For technical readers, the important conceptual break is this: the model is not merely following a fixed inference script. It is participating in the discovery of the script.

How AutoTTS works: from simulation to policy discovery

The paper’s setup, as described in reporting on the work, starts by defining a shared space in which different test-time scaling approaches can be compared. The axes are width and depth. Width captures parallelism — more candidate paths, more branching. Depth captures persistence — longer deliberation on a path before the system stops or reassigns budget.

From there, AutoTTS places Claude Code into an offline environment that simulates the consequences of those choices. Rather than asking researchers to enumerate every heuristic by hand, the system lets the agent explore policy space and discover control rules that decide how compute should be allocated.

That is a subtle but important distinction. This is not a model inventing a new foundation model architecture. It is discovering a control policy for inference-time search. In other words, the output is a set of scaling behaviors, not a new reasoning engine.

Still, the fact that the policy is discovered rather than specified by a human changes the workflow. Human researchers become designers of the search environment and evaluators of the resulting policies. The machine becomes the one that traverses policy space.

That has a concrete operational upside. If an AI agent can identify policies that match or outperform known methods while using less compute, teams may be able to reduce inference cost without sacrificing quality. In a world where serving budgets and latency targets often determine what can actually ship, compute efficiency is not a cosmetic improvement. It is product economics.

But the same property that makes AutoTTS interesting also makes it difficult to manage. A policy discovered in simulation may be effective because it exploits patterns in the offline setup that are not obvious to humans. That raises immediate questions about transferability. Does the policy hold up when the input distribution shifts? Does it remain robust when the task mix changes? Can the organization explain why it behaves the way it does?

Those questions matter because test-time scaling is already a layer where many teams accept a degree of opacity. Adding AI-discovered scaling policies pushes that opacity deeper into the stack.

Why this matters for rollout, cost, and governance

If the simulation results translate to real systems, the upside is straightforward: smarter inference-time allocation could lower cost per task and improve throughput. That is especially appealing for products that already rely on branching search, multi-sample decoding, or iterative refinement.

But a lower-compute policy is not automatically a safer or more reliable one. In practice, a policy’s value depends on whether it behaves consistently across workloads, whether it can be reproduced in evaluation, and whether its failure modes are understandable enough to monitor. Those are governance questions, not just model-quality questions.

The move toward AI-discovered scaling policies complicates three parts of rollout.

First, interpretability. A human-designed heuristic can usually be described in a sentence, even if it is crude. An AI-discovered policy may be empirically strong while remaining hard to summarize in a way engineers can review.

Second, reproducibility. If a policy is the product of search in a simulated offline environment, teams need to know how sensitive it is to the simulation design, the scoring function, and the search budget. Small changes in those ingredients can produce very different policies.

Third, governance. Once the system that allocates inference compute is itself discovered by an AI agent, oversight can no longer stop at the base model. Teams need to audit the policy layer too.

That does not mean these systems should be treated as black boxes by default, or that they are unusable. It means the control surface is shifting. Product teams are moving from hand-authored rules toward a regime where the important work is setting boundaries, testing outputs, and deciding when to trust the discovered policy.

The handoff to human oversight remains essential. The model may search policy space, but people still need to define objectives, constrain the environment, validate results, and sign off before anything reaches production.

What teams should prepare for next

For engineers and product leaders, AutoTTS points to a fairly practical agenda.

Build reproducible evaluation pipelines first. If a scaling policy is discovered in simulation, the organization should be able to rerun the search, inspect the conditions under which it wins, and compare it against simpler baselines.

Instrument monitoring around policy behavior, not just output quality. If the policy decides when to branch, prune, or continue, those decisions themselves become part of the observability stack.

Keep a clear human approval layer for deployment. The system may discover the policy, but the organization owns the rollout decision. That means explicit review for performance, robustness, and edge-case behavior before any production use.

Finally, treat policy search as part of model governance. AI-discovered scaling policies are not just an optimization trick. They are a new source of operational behavior in the inference stack, and that makes them subject to the same scrutiny as any other critical control system.

AutoTTS does not eliminate the need for human judgment. It relocates it. Instead of asking people to invent the heuristics that govern test-time scaling, it asks them to create the environment in which a machine can search for those heuristics — and then to decide which discovered policies are trustworthy enough to ship.