AI benchmarks have a reputation for precision they often do not deserve. A model posts a higher score, another slips a few points, and the leaderboard line looks like evidence. But a Google study now adds a blunt measurement critique: many common benchmark setups are trying to infer a stable human judgment from a panel so small that disagreement is not a nuisance around the signal — it is the signal.
That matters because a lot of AI evaluation work assumes that three, four, or five raters can stand in for a broader consensus. On tasks where the answer is genuinely subjective — whether a response is safe, whether a comment is toxic, whether an output is helpful enough to ship — that assumption is fragile. If the people doing the rating do not agree with each other in the first place, then the benchmark is not just measuring the model. It is also measuring how the evaluation was sampled.
The study’s core finding is not that benchmarks are useless. It is that many are underpowered for the kind of claim they are being asked to support. In the setup Google researchers examined, common practice of using only three to five human raters per example often failed to produce reliable results. The paper’s threshold is sharper than the norm: at least ten raters per item are often needed if the goal is dependable measurement rather than a rough impression.
That does not mean every evaluation needs to be flooded with annotators. The more important point is budget allocation. The researchers argue that around 1,000 annotations can be enough to reach reliable results, but only if they are distributed intelligently between the number of examples and the number of raters per example. Spend too little depth on each item and the judgments wobble. Spend too much depth on a narrow set of items and you learn less about the space you are trying to evaluate. The problem is not simply how much annotation you buy; it is how you split it.
That distinction gets lost in leaderboard culture, where annotation volume is often treated as a generic proxy for rigor. The study suggests that is the wrong abstraction. If your benchmark is designed to identify the majority view — for example, to answer whether most raters would flag a response as unsafe — then broader item coverage with fewer raters per item may make sense. But if the task is meant to capture the full spread of human judgment, a majority vote can erase the very disagreement you need to see. Those are different goals, and they require different experimental designs.
For model comparisons, this is not a minor statistical footnote. If two systems differ by a couple of points on a benchmark built from thinly rated examples, the gap may be smaller than the noise introduced by human disagreement. In that case, the “winner” may simply be the model that benefited from one sample split, one annotator pool, or one slightly different vote threshold. That is especially consequential when teams use benchmarks to choose between models, negotiate with vendors, or justify that one system is materially better than another.
The same logic applies to safety testing. A system that looks safer than another on a sparsely rated eval may not actually be safer in the wild; it may just have aligned better with that specific set of annotators. If the benchmark is supposed to catch edge-case failures, then under-sampling raters is risky, because safety judgments are often exactly the kind of thing people disagree on. A model that clears a thin majority-vote hurdle can still sit uncomfortably close to the boundary of acceptable behavior.
Product rollout decisions are affected too. Teams often treat benchmark improvements as a go signal for launch, especially when the model ships into a customer-facing workflow or a regulated domain. But if the evaluation design does not tell you how much of the score came from unstable human preference, then the release decision is being made on an overconfident estimate. That can produce false confidence in a model that only appears to win because the measurement was too coarse.
The technical fix is less glamorous than a new benchmark leaderboard, but more useful: redesign evaluation around disagreement instead of averaging it away. That starts with matching the sample design to the question being asked. If you want a majority-vote label, then optimize for coverage and accept that some disagreement is background noise. If you want to understand where humans split, then collect more raters per item and report the spread, not just the mean. In either case, the annotation plan should be explicit enough that another team can see why the result is trustworthy — or why it is not.
For technical readers building or buying AI systems, the practical takeaway is straightforward. Do not treat a benchmark score as a hard fact unless you know how it was annotated. Ask how many raters judged each item, whether the evaluation was designed for majority vote or for disagreement analysis, and how sensitive the results are to annotation noise. If the answer is “three to five raters per example” and the reported delta is small, assume the ranking may be less stable than the chart suggests.
That is especially true in safety-sensitive or high-stakes deployments, where the cost of overreading a benchmark is real. A thinly rated eval can be useful as one input. It is much less defensible as the sole basis for model selection, procurement, or launch.
The broader lesson is not that humans are inconsistent in a way benchmarks can never solve. It is that inconsistency is a measurement variable, not a footnote. Once you design for it explicitly, benchmark scores become more honest — and often less dramatic. That may be inconvenient for leaderboard marketing, but it is closer to how model performance actually behaves in the real world.



