Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’

    April 1, 2026

    Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project

    April 1, 2026

    It’s not your imagination: AI seed startups are commanding higher valuations

    March 31, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’
    • Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project
    • It’s not your imagination: AI seed startups are commanding higher valuations
    • Yupp.ai shuts down after raising $33M from a16z crypto’s Chris Dixon
    • Whoop’s valuation just tripled to $10 billion
    • Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles
    • The company behind ClassPass and Mindbody just got a lot bigger with a $7.5B merger
    • Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»A major AI training data set contains millions of examples of personal data
    AI

    A major AI training data set contains millions of examples of personal data

    TechurzBy TechurzJuly 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    A major AI training data set contains millions of examples of personal data
    Share
    Facebook Twitter LinkedIn Pinterest Email

    The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”

    The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.) 

    A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

    Examples of identity-related documents found in CommonPool’s small-scale data set show a credit card, a Social Security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.

    COURTESY OF THE RESEARCHERS

    When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well. 

    CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022. 

    While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.

    And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.

    Good intentions are not enough

    “You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found). 

    data examples major millions personal set Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleVodafone von Hackerangriff auf Dienstleister betroffen
    Next Article Lettuce Grow Indoor Farmstand Review: Grow Your Own
    Techurz
    • Website

    Related Posts

    Opinion

    Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles

    March 31, 2026
    Opinion

    SpaceX vets raise $50M Series A for data center links

    February 18, 2026
    Opinion

    As AI data centers hit power limits, Peak XV backs Indian startup C2i to fix the bottleneck

    February 16, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    College social app Fizz expands into grocery delivery

    September 3, 20252,288 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202516 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202512 Views
    Our Picks

    Toyota’s Woven Capital appoints new CIO and COO in push for finding the ‘future of mobility’

    April 1, 2026

    Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project

    April 1, 2026

    It’s not your imagination: AI seed startups are commanding higher valuations

    March 31, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2026 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.