By mid-2025, the English web had crossed a threshold that matters less for headlines than for systems design: roughly 35% of newly published websites were fully or partially AI-generated, up from effectively zero before ChatGPT’s launch. That estimate comes from a large-scale study covered by The Decoder, based on 33 monthly samples pulled from the Internet Archive’s Wayback Machine between August 2022 and May 2025, with AI text detection handled by Pangram v3.
What makes the result operationally important is not just the share of AI-written pages. The study found two effects that held up statistically across its analysis: semantic contraction and a positivity shift. In practice, that means AI-generated text is not simply increasing supply; it is narrowing the spread of ideas and converging on a consistently upbeat tone. For teams building search, retrieval, moderation, recommendation, or content intelligence products, that changes the geometry of the web they are indexing.
Semantic contraction is the more consequential finding for technical systems. If a growing fraction of the corpus is producing text that is more semantically similar across domains, embeddings become less discriminative, near-duplicate clusters get larger, and long-tail topical distinctions become harder to preserve. Ranking systems that depend on lexical diversity or vector separation can start to collapse content that should remain distinct. Retrieval pipelines may surface more “average” answers because the source material itself has become more average in latent space.
That has direct implications for detection as well. Pangram v3 performed best in the researchers’ robustness testing, which is a useful reminder that no single detector is a permanent solution. As AI-written content becomes more stylistically uniform, detectors will need to preserve sensitivity without turning every polished or templated page into a false positive. For production use, that means detection should be treated as one signal in a broader provenance stack, not as a binary gate.
The positivity shift is equally important, though easier to underestimate. If AI-generated pages skew more cheerful than human-written ones, then sentiment assumptions embedded in ranking, moderation, and brand-safety systems can become less reliable. A product that learns from historical tone distributions may start to treat synthetic cheerfulness as the new normal. That can distort spam scoring, weaken authenticity heuristics, and create a feedback loop in which emotionally flattened content is rewarded because it is easy to read, easy to classify, and easy to scale.
For AI product teams, the core lesson is that the web is becoming less diverse in ways that models will faithfully absorb. Training and retrieval data pulled from the public web will increasingly reflect synthetic language patterns, which can reduce effective novelty over time. If organizations continue to ingest web text without provenance controls, they risk contaminating domain corpora with content that looks authoritative, reads fluently, and yet carries less informational variance than it appears to.
That should push engineering teams toward a more explicit provenance architecture. First, separate ingestion from trust. Maintain source-level metadata that records origin, crawl time, publisher identity where available, and any AI-generation indicators captured at ingest. Second, attach detection output as a scored feature rather than a hard label; the study’s dependence on Pangram v3 underscores that detectors vary and should be evaluated against your own corpus. Third, preserve diversity in retrieval and summarization layers by setting minimum topical distance thresholds or re-ranking constraints when source clusters become too similar.
Governance also needs to move closer to the product surface. If your platform publishes, indexes, or recommends user-generated or partner-supplied text, provenance should be visible in the workflow, not buried in policy docs. That can mean disclosure labels for synthetic content, internal review queues for high-reach pages, and audit logs that make it possible to explain why a document was treated as human-authored or AI-assisted. In regulated or high-trust environments, provenance should be treated like any other dependency: versioned, testable, and observable.
There is a commercial dimension here too. As the web homogenizes, authenticity becomes a differentiator rather than a soft brand value. Human-in-the-loop editorial review, transparent authorship standards, and visible sourcing can become part of the product proposition, especially for platforms that depend on user trust or premium subscriptions. If users can no longer assume that fluent, upbeat prose is a sign of editorial quality, then editorial process itself becomes part of the value chain.
That suggests a shift in metrics as well. Teams should stop relying only on engagement and click-through when evaluating content systems. Track semantic distance between domains over time, measure positivity bias in generated and ingested text, and monitor how often detector confidence changes after model or prompt updates. Add provenance coverage and false-positive/false-negative rates for AI detection into the same review cycle as latency and retrieval quality. If the content substrate is changing, the KPI stack has to change with it.
Over the next 12 to 18 months, the most resilient teams will likely do four things in parallel. They will instrument AI-content share across their own properties and partner feeds. They will harden detection pipelines with ensemble signals instead of a single classifier. They will redesign discovery systems to preserve topical diversity even when the source corpus converges. And they will make provenance legible to users, editors, and downstream customers.
The study’s deeper point is that AI-generated text is not just increasing output. It is changing the structure of the text layer that many AI products now depend on. If the web is becoming more uniform and more cheerful, then the systems built on top of it need to become more skeptical, more provenance-aware, and more explicit about what counts as trustworthy content.



