Frontier models have gotten good enough at math that the old question — can they solve hard problems? — is no longer sufficient. SOOHAK asks a harder one: can they tell when a problem actually has a solution?
That distinction matters because the benchmark’s early results surface a troubling pattern. Some models can produce convincing answers to problems that are intentionally impossible, contradictory, or otherwise unsolvable. In other words, the failure is not always a blank response or an obvious mistake. It can be confident completion. For teams thinking about deployment, that is a calibration problem, not just an accuracy problem.
SOOHAK is designed to make that failure visible. The benchmark contains 439 original tasks, divided into 340 Challenge problems and 99 Refusal problems. The Challenge set targets graduate and research-level mathematics. The Refusal set is the more unusual part: these are problems that contain contradictions, missing conditions, or no valid answer at all. The benchmark’s premise is simple but consequential: a model that can narrate a solution is not necessarily a model that understands solvability.
What gives the benchmark unusual credibility is how it was built. The tasks were created from scratch rather than lifted from competitions or textbooks, which reduces the risk that models have seen similar items during training. A total of 64 mathematicians contributed, including 38 professors, 25 PhD students and postdocs, and five IMO medalists. The group also used staged submissions and reviews, and contributors had to confirm they were working without AI assistance. That matters because benchmark contamination has become one of the quietest ways to inflate performance numbers. SOOHAK tries to shut that door.
The initial results are instructive. Gemini 3 Pro leads on the Challenge set, suggesting that the best frontier systems can still do serious work on advanced math. But the same benchmark also shows where current models remain brittle: research-level reasoning is still inconsistent, and recognizing unsolvable tasks is harder than it should be. Those two weaknesses are related. A system that is optimized to keep answering may appear capable even when the correct move is to refuse, qualify, or pause.
That’s the core evaluation risk. Traditional math benchmarks usually reward the final answer, which can hide whether the model understood the structure of the problem. SOOHAK changes the scoring lens by making solvability itself part of the test. If a model gives a polished derivation for a contradiction, the output may look strong under a standard rubric and still be operationally dangerous in production.
For product teams, that has direct implications. A math assistant used in tutoring, coding, analytics, or scientific workflows cannot just be evaluated on correctness for solvable items. It also needs to be checked for refusal behavior, uncertainty calibration, and contradiction detection. If a model is likely to press ahead through impossible inputs, then the failure mode will show up as overconfident nonsense rather than a safe fallback.
The practical response is to move from single-score evaluation to solvability-aware pipelines. That means adding test sets that include intentionally invalid problems, measuring how often the system recognizes them, and separating confidence from correctness in reporting. It also means requiring stronger verification steps downstream: symbolic checks where possible, cross-solver comparisons, and human review for high-stakes outputs.
There is also a prompt-design lesson here. If a model is prone to answering at all costs, prompts should explicitly authorize refusal when the problem is inconsistent or underspecified. That sounds minor, but in production systems a default bias toward completion can be the difference between a useful assistant and a persuasive error generator.
SOOHAK does not claim that frontier models are bad at math. Its more important finding is subtler: current evaluation frameworks may overstate progress because they do not always test whether a model can recognize the boundary between hard and impossible. As math systems move from benchmark demos into research and enterprise workflows, that boundary is exactly where trust will be won or lost.



