In Age of Empires II, goats just became a warning label for AI hype
A researcher at Microsoft and the University of York has done something that is, on the surface, gloriously ridiculous: he built a working neural network inside the map editor of Age of Empires II using goats as bits, terrain as logic, and the game’s own mechanics as memory. The result is not a product demo, and it is not a claim that game editors are secretly AI platforms. It is a critique.
That distinction matters now because AI teams keep operating in an environment where a striking demo can carry more weight than the evaluation discipline underneath it. When product decisions, launch timing, and risk sign-off depend on benchmarks that are easy to game or hard to reproduce, the line between proof and performance gets dangerously thin. The goat network is funny, but its real value is that it makes that problem impossible to ignore.
How the goat network works
Adrian de Wynter’s construction is technically playful, but it is not hand-wavy. He uses the Age of Empires II scenario editor to encode logic in a way that a technical reader can inspect at the level of gates and state transitions.
The basic trick is to treat goats as binary values. A goat standing on grass represents 0; a goat standing on a bridge represents 1. From there, de Wynter wires together logic gates using the map editor’s scripting tools and terrain features. The reported design includes two XNOR gates and one AND gate, arranged so the network can learn the logical AND function.
That may sound modest, but the point is architectural, not computationally ambitious. The stunt shows that under constrained conditions, the editor can be coerced into acting like a programmable logic environment. In the build, terrain and positioning are not decoration; they are the substrate of computation. Ice ramps and waiting goats help keep the intermediate state from getting scrambled as the network advances. Memory is implemented through the in-game market, giving the system a way to preserve information across steps instead of operating as a one-shot visual trick.
So the construction is not just “goats in a game.” It is a logic system assembled from game primitives:
- goats encode bits,
- terrain context implements gate behavior,
- XNOR and AND gates form the network,
- the market provides state persistence.
That is enough to make the artifact legible as a real machine of a sort, even if it is a machine built inside a game editor rather than on a conventional compute stack.
What it proves — and what it does not
The temptation, once you see something this elaborate, is to read too much into it. But the right interpretation is narrower and more useful.
What it proves is that a toy environment can be made to emulate logic in surprising ways, and that clever encoding can produce something that behaves like a neural-network-like circuit under the constraints the researcher chose. It also demonstrates how easy it is for a visually impressive setup to create an aura of sophistication that exceeds what the underlying evaluation actually establishes.
What it does not prove is that AI systems are more capable, more reliable, or closer to deployment readiness than they were before the stunt. A goat-based logic network in Age of Empires II is not evidence of general intelligence. It is not evidence of robustness. It is not evidence that the evaluation methods used in production AI are sound.
If anything, it highlights the opposite risk: a benchmark can be technically real and still be strategically misleading. The system can work exactly as designed and still tell you very little about how a model will behave in an actual product environment with messy inputs, adversarial behavior, changing user expectations, or operational constraints.
That is the critical tension here. Toy-scale demonstrations are often persuasive because they are complete enough to be legible and novel enough to be memorable. But those same qualities can make them a poor proxy for real-world readiness. In AI, the danger is not just that teams overclaim. It is that the ecosystem starts rewarding the wrong kind of evidence.
Why this matters for tooling and deployment
The goat network is best read as a stress test for AI evaluation culture, not as a stunt in search of applause. For product teams, the lesson is practical.
First, benchmarks need to be harder to fool and easier to reproduce. If an evaluation can be made to look impressive through quirks of setup, hidden assumptions, or bespoke engineering, it should not carry the same weight as a benchmark that survives independent reproduction. The goat network is a reminder that demonstration quality and validation quality are not the same thing.
Second, evaluation pipelines need sandboxing and scope control. Teams should separate novelty demos from release gates, and experimental artifacts from launch criteria. Otherwise, a flashy internal proof can start shaping roadmap decisions and risk judgments in ways that are out of proportion to what it actually shows.
Third, governance has to include claim framing. Product leaders and researchers alike should be disciplined about saying what a demo establishes and what it does not. “Works in a game editor” is a technical fact. “Suggests production viability” is a different statement entirely. When those get blurred, risk assessment degrades.
This is especially important in AI tooling, where product narratives often travel faster than the underlying validation. If a team is evaluating agent behavior, model reliability, or workflow automation, it should ask whether the test environment has the properties that matter in deployment: reproducibility, adversarial resistance, stable state handling, and observable failure modes.
The point is not to dismiss playful demonstrations. They can reveal hidden assumptions, surface implementation creativity, and make technical ideas visible in a way normal charts cannot. But they should sit in the right category. The goat network belongs in the family of craft-rich demonstrations that illuminate a system’s limits and the brittleness of our evaluation habits.
What engineers and managers should watch next
Expect more of this genre, not less. As AI gets wrapped into more products, researchers will keep building demonstrations that probe the boundary between spectacle and signal. Some of those will be useful. Some will be misleading. The job is to tell them apart.
A practical watchlist for teams:
- Independent replications: If a claim matters, can another team reproduce it without special pleading or hidden machinery?
- Cross-domain benchmarks: Does the result generalize beyond the environment it was built in?
- Evaluation transparency: Are the inputs, constraints, and failure cases clearly documented?
- Governance discipline: Are public claims about capability aligned with the actual scope of the test?
- Deployment relevance: Does the benchmark reflect the operational risks the product will face in the wild?
De Wynter’s goat network is memorable because it is absurdly specific, technically real, and strategically cautionary. It shows that a clever exhibit can be made to look like a major advance if the audience is primed to confuse the two. For AI teams, that is not a joke. It is a warning about how benchmarks get selected, how demos get interpreted, and how products get overcommitted before the evidence is strong enough.
If the goats have a lesson, it is this: in AI, the ability to build something surprising is not the same as the ability to prove something important.



