Common sense prevails: AI gets dumber if you feed it its own data.
Industry experts have long warned of the limitations inherent in generative AI (genAI), one of which is a looming “data plateau.” This refers to the point where the sheer size of language models is constrained by the finite amount of available training data.
Concerns surrounding licensing, permissions, and the controversial practice of scraping data without consent have already led companies like OpenAI, Anthropic, and Google to reconsider their data acquisition strategies. Now, a new paper introduces another challenge: “model collapse.”
This new research suggests that continuously training AI on synthetic data – essentially feeding it its own output – will lead to conclusive degradation in quality. This means relying on synthesized data to overcome the data plateau might not be viable after all, leaving the future of AI development at a critical and uncertain crossroads.
What is Model Collapse in Large Language Models (LLMs)?
Research authored by Ilia Shumailov and Zakhar Shumaylov and colleagues found that LLMs like ChatGPT, if left unchecked, could degrade over time due to model collapse.
As LLMs generate more online text, future models trained on this data will be learning from AI-generated content, not genuine human language. This cycle creates an echo chamber effect, causing models to lose touch with the nuances and complexities of real-world language data.
Three error types contribute to the model collapse:
- Statistical Approximation Error: Stems from using a finite number of samples, leading to the loss of information, particularly in the tails (rare events) of the data distribution. And without that information, the nuances that come from the human creativity of the training data continuously gets lost if the generations of large language models are trained on synthetic data. We can frame this loss as a reduction in the “diversity of thought” within the model’s understanding of language.
- Functional Expressivity Error: The inherent limitations of the model’s architecture leads to inherent restrictions in the first place. For instance:
- Neural Network Size: A smaller network, like the OPT-125m used in the paper to illustrate a real-world scenario, has fewer parameters (knobs and dials) to capture the complexities of language compared to a larger model like GPT-3. This limited “resolution” restricts its ability to represent the full richness of the training data.
- Tokenization: Language models break down text into discrete units called “tokens.” The choice of tokenization scheme can impact expressivity. For example, a model with a limited vocabulary of tokens might struggle to represent uncommon words or technical jargon accurately.
- Functional Approximation Error: Language is inherently imprecise, even between humans. Misinterpretations, ambiguities, and context-dependent meanings are always present. This inherent “noise” in communication makes it difficult for any model to perfectly approximate the true underlying distribution of language. Just because a speaker endorses a specific idea and communicates it in a way that they think is sufficient, doesn’t mean that the listener will completely understand everything that the speaker wants to convey. There are inherent imperfections in communication through linguistic representation. This communication issue between humans is reflected in the learning that happens between the training data and the LLM.
Lastly, these error types are interconnected and can exacerbate each other, contributing to the phenomenon of model collapse.
So, is the Future Full of Dumber AI?
The paper’s implications go far beyond the technical aspects of AI. It highlights the urgent need for ethical frameworks, interdisciplinary collaboration, and a proactive approach to address the societal and economic challenges posed by model collapse. The authors stress that maintaining access to original, human-generated data is crucial for mitigating model collapse. This becomes even more critical for modeling low-probability events, which are often associated with marginalized groups or complex systems.
Additionally, the authors raise a critical concern: how do we distinguish human-generated content from AI-generated content at scale? They suggest community-wide coordination to track the origins of data but acknowledge the difficulty. Finally, the paper calls for caution, suggesting that the mass adoption of LLMs without careful consideration could have unintended consequences for future AI development.
But thinking further ahead: what does this mean for the digital landscape, the face of which is already being changed by AI?
Navigating the Zero-Click SERP
The zero-click Search Engine Results Page (SERP), powered by features like Google’s Search Generative Experience (SGE), potentially takes a figurative step back.
When users increasingly find answers directly on search results pages via AI-generated paraphrasing, it dramatically reduces the need to click through to websites. From a Search Engine Optimization (SEO) perspective, that’s a deathknell. The entire point was to rule the top of SERPs and make query users click through. The band-aid “solution” was to find a way to become part of the results used to generate the answer presented to users.
Content creators need to shift from simple keyword targeting and more heavily tunnel-vision on addressing search intent more holistically. They need content that not only answers the explicit query (by being included in the answer generation) but also anticipates the user’s underlying needs and motivations.
What would that look like, exactly?
Their content must offer something AI-generated summaries cannot replicate: unique insights, expert analysis, and actionable advice related to what SGE offers but doesn’t explicitly include. This draws the click-through. Pull this tricky maneuver off, and you position your content as a “must-read,” even if it’s competing with a concise AI-generated answer, because you offer related, in-context subtopics that they can only read if they click through.
Ultimately, in a world where basic information is readily available through AI, high-quality, original content becomes even more crucial for establishing brand authority and attracting high-intent traffic. The model collapse indirectly supports this by discouraging organizations from flooding SERPs with genAI drivel.
Note also that this development makes click-throughs even more valuable. Once users land on your site, you better keep them there. The content needs to be engaging, easy to navigate, and offer a clear path to conversion.
The Human-in-the-Loop: More Important Than Ever
The potential for model collapse underscores the importance of human involvement in AI content creation. In a genAI-assisted pipeline with Human-in-the-Loop safeguards, where the human is placed and what they do now matter much more.
Subject matter experts play a central role in building and maintaining reliable AI pipelines that avoid problems like hallucination and sycophancy — and now, model collapse.
AI prompt engineers and content strategists must collaborate to design prompts and frameworks that guide LLMs toward producing high-quality, personalized content. This suggests a future where content creators will need to develop expertise in areas like data curation, prompt engineering, and understanding AI capabilities.
When it turns out there are hard limits to what can be “handed over” to AI, the future of content creation lies in a collaborative model where humans and AI work together, playing to each other’s strengths.