Synthetic data is the new AI gold rush, but critics call it ‘data laundering’

AI development is moving at a rapid pace, but it risks running headlong into a wall. As websites increasingly place barriers on scraping (some of which are allegedly ignored), and as the remaining content is voraciously collected by scrapers to train AI models, concerns are growing that we may run out of usable training data.

The industry’s answer? Synthetic data.

“Recently in the industry, synthetic data has been talked about a lot,” said Sebastien Bubeck, a member of technical staff at OpenAI, in the company’s livestreamed release of GPT-5 last week. Bubeck stressed its importance for the future of AI models—an idea echoed by his boss, Sam Altman, who live-tweeted the event, saying he was “excited for much more to come.”

The prospect of relying heavily on synthetic data hasn’t gone unnoticed by the creative industries. “I believe the main reason companies like OpenAI are having to rely more on synthetic data now is that they’ve run out of high-quality human created data to mine from the public facing internet,” says Reid Southern, a film concept artist and illustrator.

Southern believes there’s another motive. “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”

For this reason, he has publicly called the practice “data laundering.” He argues that AI companies could train their models on copyrighted works, generate AI variations, then remove the originals from their datasets. They could then “claim their training set is ‘ethical’ because it didn’t technically train on the original image by their logic,” says Southern. “That’s why we call it data laundering, because in a sense, they’re attempting to clean the data and strip it of its copyright.” (OpenAI did not respond to Fast Company’s request for comment.)

The issue is more nuanced, according to Felix Simon, an AI researcher at the University of Oxford. “In one sense, it doesn’t really remediate the original harm over which creators and AI firms squabble,” he says. “After all, synthetic data isn’t plucked from the ether but presumably created with models that have reportedly been trained with data from creators and copyright holders—often without their permission and without compensation.” From the perspective of societal justice, rights, and duties, “these rights holders still are owed something even with the use of synthetic data—be that compensation, acknowledgements, or both.”

Ed Newton-Rex, founder of Fairly Trained—a non-profit certifying AI companies that respect creators’ intellectual property rights—shares Southern’s concerns. “I think synthetic data is a legitimately helpful way to augment your dataset,” he says. “If you’re training an AI model, it’s a way of increasing the coverage of your training data. And at a time when we’re butting up against the limits of legitimately accessible training data, it’s seen as a way to extend the usable life of that data.”

Still, Newton-Rex acknowledges its darker side. “At the same time, I think unfortunately its effect is, at least in part, one of copyright laundering,” he says. “I think both are true.”

He warns against taking AI firms’ promises at face value. “Synthetic data is not a panacea from the incredibly important copyright questions,” he says. “I think there tends to be so much of a feeling that synthetic data helps you, as an AI developer, get around copyright concerns.” That belief, he says, is wrong.

The framing of synthetic data—and the way AI companies talk about model training—also helps them distance themselves from the individuals whose work they may be using. “The average listener, if they hear this model was trained on synthetic data, they’re bound to think, ‘Oh, right, okay. Well, this probably isn’t Ed Sheeran’s latest album, right?’ It further moves us away from an easy understanding of how these models are actually made, which is ultimately by exploiting people’s life’s work.”

He compares it to plastic recycling, where a recycled container might once have been a toy, a car bumper, or something else entirely. “The fact these AI models mash all this stuff up and generate, quote-unquote, ‘new output’, does nothing to reduce their reliance on the original work.”

For Newton-Rex, this is the critical takeaway: “Really the absolutely critical element here, and it’s just got to be remembered, is that even in a world of synthetic data, what’s happening is people’s work is being exploited in order to compete with them.”

The early-rate deadline for Fast Company’s Most Innovative Companies Awards is Friday, September 5, at 11:59 p.m. PT. Apply today.

What's Hot

Builders Stage agenda revealed for Disrupt 2026

Startup Battlefield Australia application closes in days: Apply before July 6

Acti puts AI agents directly into your smartphone keyboard

Omen AI’s plan to optimize data centers is all wet

AI was supposed to kill engineering jobs, but new data suggests they’re the most resilient

Collecting robot training data is dirty, unglamorous work. Some AI labs are already paying XDOF to do it.

College social app Fizz expands into grocery delivery

SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

Future of Digital Privacy and Security: 7 Truths Nobody Tells You

What's Hot

Synthetic data is the new AI gold rush, but critics call it ‘data laundering’

Related Posts

Join the Techurz Brief