Reddit blocks the Internet Archive from crawling its data - here's why

Andriy Onufriyenko/Getty Images

ZDNET’s key takeaways

The Internet Archive can now only crawl Reddit’s homepage.
Reddit’s goal is to block AI firms from scraping Reddit user data.
Publishers (and others) are suing AI companies for copyright infringement.

Reddit is defending its privacy from AI companies that are taking roundabout approaches to scraping its content.

The social media platform, known as a resource where users can post anonymously and find information about virtually any subject, will block the Internet Archive’s Wayback Machine from indexing its online data, according to a Monday report from The Verge. The move is in response to the discovery that AI firms, unable to scrape data from Reddit directly due to the platform’s prohibitive policies, have instead been retrieving its data from indexed content on the Internet Archive and using it to train models.

The Wayback Machine will now only be able to scrape data from Reddit’s homepage, according to The Verge, while access to user profiles, comments, and post detail pages will be blocked.

Launched in 1996, the Internet Archive is a non-profit that operates an enormous digital database of web content. The archive is maintained in part by the Wayback Machine, a piece of web-crawling software that gathers web pages and preserves them as they appeared when they were collected, like digital flies in amber. This serves as a resource for researchers studying the evolution of online culture and digital forensic evidence for law enforcement, among other uses.

What Reddit’s move means

Reddit has previously flagged concerns related to the scraping of its content with the Internet Archive, according to The Verge. The non-profit was also reportedly notified before the web-crawling restrictions started going into effect yesterday.

The Internet Archive has yet to make an official statement about how it plans to respond to Reddit’s new restrictions, and at the time of writing, it has not responded to ZDNET’s request for comment. Wayback Machine director Mark Graham, however, has told multiple publications that the Internet Archive will “continue to have ongoing discussions about this matter” with Reddit.

Growing tension

Reddit’s reported decision to block Wayback Machine from scraping the majority of its content arrives during a moment of mounting tension between AI companies and digital publishers, though Reddit is the first tech company to wade into the debate. The company sued Anthropic in June after discovering that the AI company was illegally scraping its data, but it has also previously signed licensing deals with both Google and OpenAI.

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

AI developers require access to gargantuan troves of information to train generative AI models, which are designed to identify and replicate subtle mathematical patterns gleaned from those training datasets.

Many of those companies have scraped training data from publicly available websites, including social media sites and news outlets, claiming legal immunity under a concept known in copyright law as fair use. (The courts are still untangling the legitimacy of that argument, and will likely be doing so for some time.)

Many of the organizations whose content has been copiously scraped — along with a cohort of authors and other artists — have responded with lawsuits.

Others, meanwhile, have signed content licensing agreements with the likes of OpenAI, Anthropic, and Google, consenting to the use of their organizations’ data in exchange for increased visibility in the responses generated by chatbots, or other benefits.

What's Hot

Evotrex raises $30M to build the RV that doesn’t need a charging station

It’s not FAANG anymore. It’s MANGOS.

Zepto’s IPO filing reveals fast growth, bigger losses, and a valuation question nobody’s answered yet

Korea’s biggest manufacturers back Config, the TSMC of robot data

Altara secures $7M to bridge the data gap that’s slowing down physical sciences

After data breach, $10B valued startup Mercor is having a month

College social app Fizz expands into grocery delivery

SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

What's Hot

Reddit blocks the Internet Archive from crawling its data – here’s why

ZDNET’s key takeaways

What Reddit’s move means

Growing tension

Related Posts

Join the Techurz Brief