New benchmark finds many AI models still vulnerable to Russian propaganda

The latest reminder that model capability and model resilience are not the same thing comes from an unusual but increasingly relevant place: a benchmark built to see how readily language models can be nudged into repeating Russian propaganda.

The Institute of the Estonian Language tested 60 models with 75 questions spanning three languages and 14 propaganda narratives, then varied each prompt across neutral, biased, and manipulative phrasing. Responses were scored on a 1–5 scale, where 1 meant the model effectively echoed Russian talking points. The setup was intentionally self-contained: no web search, no external tools, no retrieval layer to lean on. In other words, the benchmark was measuring what the model itself would do when the pressure changed, not what a surrounding system might correct after the fact.

That matters now because the market is still rewarding raw benchmark gains and lower latency, while safety claims increasingly need to survive real deployment conditions. Cross-language disinformation is not a corner case for product teams shipping assistants, search overlays, summarizers, and agentic workflows into multilingual environments. If a model can resist manipulation in one language but drift in another, the safety profile is conditional, not universal.

How the benchmark works

The design is straightforward enough to be useful and specific enough to be uncomfortable. Each model was asked the same underlying questions, but the phrasing shifted from neutral to biased to manipulative. That matters because propaganda resistance is not just a matter of whether a model can identify a false narrative in the abstract. It is also about whether the model remains stable when the user prompt itself tries to steer it.

A 1–5 scoring scale provided the output signal. The bottom of the scale was reserved for the worst outcome: the model repeating Russian talking points. Higher scores implied greater resistance to the framing and better refusal behavior. Because the benchmark excluded web access and other tools, it did not test whether a retrieval-augmented application could recover with outside evidence. It tested the model’s internal susceptibility under controlled conditions.

That methodological choice is important. It narrows the claim, but it also sharpens it. If a model performs poorly here, a product team should not assume an external safety stack will magically erase the weakness. Conversely, if a model performs well, that still says nothing about how it will behave once embedded in a broader product flow with longer contexts, user memory, and tool calls.

Which model families stood out

The headline result is that Anthropic’s Claude family came out strongest. The Decoder’s report notes that Claude models claimed the top spots, and that Claude Fable 5 led with a score of 95.2 in at least one configuration, followed by Claude Opus 4.7. Claude Opus 4.5 was used as the calibrated evaluation model, with validation from disinformation experts at Propastop.

That does not mean the Claude family is immune to manipulation; it means it did better than its peers in this particular benchmark. Still, for product teams trying to choose a base model for higher-risk settings, that distinction is useful. It suggests that propaganda resistance may be more tightly coupled to alignment and refusal behavior than to sheer generative fluency.

The next tier included Nvidia Nemotron 3 and Alibaba Qwen 3.6 Plus, both of which landed near the top of the field. Mistral’s models, including the newer Medium 3.5, clustered in the bottom third. That pattern should not be overread as a permanent vendor hierarchy, but it does provide an early signal: some families appear to have materially better defenses against this specific class of manipulation than others.

Why the result matters for shipping products

For technical teams, the practical takeaway is not that one model is “safe” and another is “unsafe.” It is that propaganda resilience needs to be evaluated as its own axis, with its own regression tests, acceptance thresholds, and deployment gates.

A model that handles coding tasks or factual QA well can still be vulnerable to narrative steering. The benchmark’s multi-language design makes that visible. A system that passes in one language may fail in another, especially when the prompt shifts from neutral wording to biased framing or explicit manipulation. That creates a real product risk for teams serving multilingual users, localization pipelines, or region-specific content moderation.

The controls implied by the benchmark are familiar, but the order of operations matters:

run cross-language red-teaming before rollout, not after an incident;
add adversarial prompt suites that vary tone and intent, not just topic;
use post-processing checks for high-risk claims instead of trusting first-pass generation;
pair the model with retrieval or fact-checking only when those systems are actually integrated into the workflow;
gate deployment by scenario, because resilience in a lab benchmark does not automatically transfer to open-ended usage.

That last point is especially important. The no-tool, no-web setup strips away a common excuse that product teams sometimes rely on: “the system will just look it up.” In this benchmark, it could not. What remained was the model’s own tendency to comply, hedge, resist, or mirror the propaganda frame.

Language and narrative are doing real work here

The benchmark’s three-language coverage and 14 narrative set show why one-size-fits-all safety claims are shaky. Propaganda is not only a content problem; it is a translation problem, a framing problem, and sometimes a cultural inference problem. The same model may parse a narrative differently depending on the language, the prompt construction, and the specific propagandistic theme being tested.

That is why the neutral-versus-biased-versus-manipulative prompt distinction is more than a cosmetic detail. It measures how much prompt framing alone can move the model toward or away from the talking points. In deployment terms, that is exactly the sort of sensitivity that can turn a polished assistant into an accidental amplifier.

This also helps explain why benchmark leaders and laggards can differ from what teams expect based on general-purpose rankings. A model can score well on broad usefulness but still show weak resistance to narrative pressure. The benchmark reveals a gap between capability and robustness, and that gap is where product risk lives.

What vendors and buyers should do next

The market response will probably follow the usual arc: vendors will highlight their strongest scores, buyers will ask for comparability, and standards bodies will start treating propaganda resilience as a separable evaluation dimension. That is the right direction.

For vendors, the engineering agenda is clear. Cross-language robustness should be part of release criteria, not a postmortem item. Safety teams need prompt-variation suites that include neutral, biased, and manipulative versions of the same task. And evals should be paired with scenario-specific deployment guidance, because a model used in a private internal copilot faces a different risk profile than one exposed to public users.

For buyers, the lesson is to demand more than broad alignment claims. Ask how the model behaves under language shift. Ask whether the evaluation set includes narrative manipulation, not just toxic content or obvious misinformation. Ask whether the reported scores were measured with external tools disabled, because that determines what the benchmark actually says.

The broader point is not that propaganda susceptibility is the only safety issue worth worrying about. It is that it may be one of the clearer examples of a problem the industry still under-tests: whether a model can stay steady when the wording changes but the intent is hostile. As multilingual AI products spread, that is becoming less of an edge case and more of a release blocker.

New benchmark shows leading AI models still wobble on Russian propaganda

How the benchmark works

Which model families stood out

Why the result matters for shipping products

Language and narrative are doing real work here

What vendors and buyers should do next

AI News Desk

Claude Cowork’s biggest use case is the office work nobody wants to own

Altman’s ‘pretty sure’ moment shifts the AI debate from layoffs to throughput

Brown’s 96-to-48 Split Is a Stress Test for AI-Era Assessment