Forum AI Launches Expert-Backed AI Benchmarks and AI Judges

Campbell Brown’s latest act is not another media project or a policy shop with a consulting veneer. It is a product bet on a question that is getting harder to ignore as foundation models move from demos into decision-making systems: who decides what AI tells you?

Her new company, Forum AI, is launching around a simple but technically loaded premise. If models are increasingly answering questions in geopolitics, mental health, finance, and hiring, then generic benchmarks are not enough. Forum AI is building a benchmarking platform for those high-stakes topics, with expert-authored evaluations and AI judges designed to reproduce roughly 90% consensus with leading human specialists. The company’s pitch is that this creates a scalable way to test model behavior where nuance matters and binary right-or-wrong scoring breaks down.

That framing matters because AI evaluation has become a bottleneck in its own right. Model vendors can fine-tune for standard leaderboards, but those scores often say little about how a system behaves under ambiguity, domain-specific judgment, or contentious prompts. Forum AI is trying to turn that weakness into a product category: an evaluation layer built not just for model ranking, but for governance, procurement, and deployment oversight.

Why this launch lands now

Brown is not arriving from outside the information ecosystem. Before Forum AI, she spent years in journalism and later became Facebook’s first dedicated news chief, a job that put her at the center of one of the defining fights over platform accountability and information quality. That background is part of the company’s signal. Forum AI is not positioning itself as a general AI safety lab or a policy think tank. It is presenting itself as an operator’s answer to a product problem: how to measure whether models are saying the right thing in domains where stakes are real and disagreement is normal.

The timing reflects a broader shift in the market. As enterprise buyers, developers, and platform teams move from testing chatbots to embedding foundation models into workflows, evaluation has become a gating function. It is no longer enough to know that a model is fluent. Teams need to know whether it is stable across versions, whether it behaves consistently across prompts, whether it can be audited after an incident, and whether an externally defined standard can travel with it across deployment environments.

Forum AI is trying to make that evaluation repeatable.

The architecture: experts, benchmarks, and AI judges

The core product structure is straightforward, even if the technical implications are not.

First, Forum AI recruits leading subject-matter experts to define what good performance looks like on a given topic. Brown has said the company is focusing on high-stakes domains such as geopolitics, mental health, finance, and hiring — areas where answers are rarely clean and where context, tradeoffs, and judgment matter as much as factual recall.

Second, those experts help architect benchmarks. That is a crucial distinction from crowdsourced or purely synthetic test sets. In this model, the benchmark is not just a pile of prompts and reference answers. It is an expert-mediated evaluation framework meant to encode how specialists weigh nuance, ambiguity, and acceptable ranges of response.

Third, Forum AI trains AI judges to score foundation models against those expert-defined standards. The company’s stated goal is for those judges to reach about 90% consensus with human experts. In practice, that means the judges are not intended to replace people entirely, but to scale their judgment. If the system can reliably approximate expert assessment, it can test many more model outputs, across more model families, than a human-only review process could support.

That is the product’s logic: human expertise to define the benchmark, and machine evaluation to make the benchmark operational at scale.

Brown has said Forum AI has been able to reach that threshold in its work so far. The important word there is “roughly.” In evaluation systems, especially ones dealing with nuanced topics, the gap between 90% agreement and true robustness is where methodology lives.

The expert roster is part of the product

Forum AI is also signaling credibility through its network. For its geopolitics benchmark work, Brown has recruited a roster that includes Niall Ferguson, Fareed Zakaria, former Secretary of State Tony Blinken, former House Speaker Kevin McCarthy, and Anne Neuberger, who led cybersecurity in the Obama administration.

That list does more than decorate a press release. It hints at the company’s intended operating model: a benchmark is only as legitimate as the people who define its standards, and the standards have to hold across viewpoints, not just within a single ideological lane. In a field where benchmark design can become a proxy war over values, the presence of prominent figures from across politics, media, diplomacy, and security is part of the mechanism for establishing trust.

For enterprise buyers, that matters because “expertise” is not only about content knowledge. It is about whether a benchmark can survive external scrutiny. If Forum AI wants to become infrastructure rather than a one-off assessment service, it needs more than well-known names. It needs a process that is repeatable, versioned, and defensible.

What the technical claims imply for evaluation workflows

Forum AI’s approach raises a set of engineering questions that matter immediately to model teams.

The first is reproducibility. If a benchmark is authored by experts and evaluated by AI judges, buyers will want to know how stable those scores are across runs, model updates, and prompt variations. A one-time consensus result is not enough; procurement teams will look for versioned benchmark sets, consistent scoring protocols, and clear documentation of how the judges were trained and calibrated.

The second is auditability. High-stakes evaluations are increasingly tied to compliance, risk review, and post-incident analysis. That means organizations will want a paper trail: which benchmark was used, who authored it, which expert panel shaped it, what the judge rubric was, and how the model’s output was scored. Without that audit trail, the benchmark may be useful for marketing but weak as governance.

The third is cross-model comparability. A benchmark has real value only if it can compare systems on common ground. That sounds simple, but in practice it requires careful controls around prompt formatting, sampling settings, output normalization, and judge consistency. Otherwise, a model may look better or worse based on incidental differences in how it is queried rather than on meaningful behavioral differences.

The fourth is deployment integration. If Forum AI is to be useful beyond an initial evaluation report, it will need to fit into the places where teams already work: CI/CD pipelines, pre-release model gates, red-team reviews, and vendor due diligence. The benchmark cannot live only in a slide deck. It has to become a checkpoint that can be invoked whenever a model changes, a policy shifts, or a new use case is approved.

That is where the product could become more than a scorecard. It could become part of the control plane for model governance.

What buyers will see in the market

Forum AI is entering a market that is still forming around evaluation standards. Most enterprises today rely on a patchwork of internal tests, vendor claims, and ad hoc red-teaming. That leaves room for a company that can offer a more credible external benchmark, especially for use cases where board-level or procurement-level scrutiny is high.

If Forum AI works as intended, it could influence how buyers write requirements. Instead of asking whether a model is “good,” procurement teams might ask whether it has been evaluated against an expert-defined benchmark in a specific domain, whether the benchmark has version history, and whether the model’s performance can be reproduced by a third party.

That would shift competitive differentiation as well. Model vendors often emphasize benchmark leadership, but a trusted external evaluator can change what leadership means. A vendor that performs well on Forum AI’s standards may be able to argue for safer deployment in regulated or sensitive workflows. A vendor that performs poorly may need to explain why a model that looks strong on generic benchmarks fails under expert scrutiny.

Platform ecosystems could also be affected. If benchmarks like Forum AI’s become part of enterprise buying criteria, cloud providers, model marketplaces, and systems integrators may need to support those evaluations directly — or risk being treated as less credible partners in high-stakes deployments.

The limits are real

Forum AI’s thesis is strong, but the constraints are just as important.

Consensus does not equal truth. If AI judges converge on expert preference, that can indicate useful alignment, but it does not eliminate uncertainty, disagreement, or hidden bias. Expert panels can disagree among themselves, and any benchmark will reflect choices about which topics matter, which answers are acceptable, and how much room to allow for context.

Topic selection is another challenge. Geopolitics, mental health, finance, and hiring are all domains where people care deeply about outcomes, but they also differ in how measurable they are. A benchmark that works for one may not transfer cleanly to another. The harder the domain, the more likely the evaluation protocol becomes a philosophy of judgment rather than a pure measurement system.

There is also the question of how much a benchmark shapes the behavior it is trying to measure. Once vendors know what is being scored, they will optimize for it. That is not inherently bad — benchmarking is supposed to induce better behavior — but it means the benchmark must evolve without becoming a target that models can overfit.

Finally, there is the governance issue. If Forum AI is to be used in enterprise settings, it will need to reconcile expert consensus with internal risk policy, regulatory expectations, and potentially conflicting jurisdictional rules. A benchmark can inform decisions, but it cannot by itself settle accountability.

A new kind of validation layer

The significance of Forum AI’s launch is not that it claims to have solved AI evaluation. It is that it treats evaluation as a product category with its own infrastructure, rather than as a side task for research teams.

Brown’s bet is that the next phase of AI competition will not just be about model capability. It will be about who gets to define acceptable answers, under what standards, and with what evidence trail. By combining expert panels, structured benchmarks, and AI judges calibrated toward near-human agreement, Forum AI is proposing a validation layer for the model era.

Whether that becomes a durable standard will depend on how the company handles the hard parts: versioning, auditability, cross-domain consistency, and the inevitable disputes over what experts actually agree on. But the launch makes one thing clear. As AI systems move closer to high-stakes decision support, the market is looking for more than fluency. It is looking for a way to prove, at scale, that a model deserves to be trusted.

Forum AI wants to benchmark the people who benchmark AI

Why this launch lands now

The architecture: experts, benchmarks, and AI judges

The expert roster is part of the product

What the technical claims imply for evaluation workflows

What buyers will see in the market

The limits are real

A new kind of validation layer

AI News Desk

Claude subscriptions split programmatic use into a separate budget, billed at API rates

Microsoft’s Edge Copilot now sees across tabs, and that changes the browser AI problem

Smart Robotics’ €10 million Series A marks a shift from warehouse pilots to data-driven scale