Model Collapse Is Happening, We Just Pretend It Isn't

https://cacm.acm.org/blogcacm/model-collapse-is-already-happening-we-just-pretend-it-isnt/

Comments

resoluteteethMar 26, 2026, 7:49 PM
It's pointless to write a whole article about how model collapse is actually happening and isn't just a theoretical concern with no evidence that model collapse is actually happening.
janalsncmMar 26, 2026, 10:50 PM
It isn’t pointless.

The author cited research that demonstrates that model collapse can happen on a small scale.

The author also cited sources that a larger and larger portion of the web will be written by language models.

There are already studies showing that LLM generated text is less diverse than human generated text:

https://techxplore.com/news/2026-03-llms-creativity-ai-respo...

https://arxiv.org/html/2501.19361

The studies don’t show that the lack of creativity in LLMs is caused by model collapse or that the problem is getting worse.

But 1) we know they do this and 2) we know that training on synthetic data can cause model collapse.

PeterisPMar 26, 2026, 11:17 PM
The key missing step which breaks the loop is that while indeed a larger and larger portion of the web is written by language models, that data isn't being used to train new models - at the beginning of LLMs people did indeed want to use "all the web" to train models, but that's not being done now anymore, you either take only old pre-LLM data, or you pay for new 'clean' data, or take extensive filtering steps to avoid accidentally ingesting synthetic data.

The main phrase of the title "model collapse is happening" is untrue and not substantiated in the article - all the true statements in the article are about the hypothetical problem, warning of the bad consequences that would likely happen if makers of major models did something they aren't doing, but they aren't doing that because that is a known issue that they're avoiding. It's like writing an article "Foot shooting epidemic is happening" with a long, solid (and true!) proof that if you'll shoot yourself in the foot, it will indeed cause serious injury...

simgMar 27, 2026, 8:51 AM
>demonstrates that model collapse can happen

yes, so given the title one might expect cited research that model collapse IS happening, as per OP's point.

locknitpickerMar 26, 2026, 8:34 PM
> It's pointless to write a whole article about how model collapse is actually happening and isn't just a theoretical concern with no evidence that model collapse is actually happening.

Except perhaps the link to article on the peer-reviewed paper that describes the problem in detail.

https://www.cs.ox.ac.uk/news/2356-full.html

> Researchers at Oxford and Cambridge published work on this back in 2023, showing how iterative training on synthetic data leads to progressive degradation.

Legend2440Mar 26, 2026, 9:40 PM
This is a toy example of how it could happen, in an artificial setting where you train entirely on generated outputs many times in a row.

It does not say that it is happening in production LLMs. It is a theoretical concern right now.

Legend2440Mar 26, 2026, 7:42 PM
Where proof? Article contains zero evidence this is actually happening, just reasons he thinks it must happen.

Also I think this article itself may be AI-generated.

biomcgaryMar 26, 2026, 9:49 PM
The fact that you aren't persuaded is the evidence of collapse. Previous generations of LLMs persuaded everyone of anything.
cercatrovaMar 26, 2026, 10:55 PM
No, that's just hedonic adaptation of humans, not evidence of model collapse.
pu_peMar 27, 2026, 8:51 AM
Paper-thin, AI generated article that doesn't address its actual premise. We have no evidence that model collapse is happening at all.
grumpopotamusMar 27, 2026, 1:48 PM
This is such an infuriating narrative. As if the top AI researchers in the world couldn't detect and mitigate a major problem with the distribution of their training data and the performance of their models with basic techniques.
DiscourseFanMar 26, 2026, 10:04 PM
Def not happening, these companies pay top dollar for clean data.
stubishMar 27, 2026, 12:09 AM
Where does clean data come from, and how do you know it is untainted? I can't think of any source of novel information that might not have been smoothed by LLM tools. In cases like 'the news', it is impossible as what is being reported may well be smoothed content like press releases and public statements. It seems kind of inevitable, where the more popular the tools are, the less untainted information gets produced, and the harder it is to find it.
xenospnMar 27, 2026, 4:25 PM
Humanity can’t possibly produce enough clean data at this rate. Not happening.
suddenlybananasMar 26, 2026, 9:07 PM
I used to think it was "modal" collapse. I still think that would be a fun alternative name.
twicMar 26, 2026, 9:41 PM
Oh, so this isn't about the Modell's collapse? https://www.nytimes.com/2020/03/11/business/modells-bankrupt...
rtgfhyujMar 26, 2026, 7:45 PM
stick to full stack dev, alex