Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Europol Dismantles SIM Farm Network Powering 49 Million Fake Accounts Worldwide

    October 19, 2025

    Are high-end Windows laptops worth buying? I tested one from Dell, and it made a statement

    October 19, 2025

    Walmart is selling a $99 Samsung smartwatch that I actually highly recommend

    October 19, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Europol Dismantles SIM Farm Network Powering 49 Million Fake Accounts Worldwide
    • Are high-end Windows laptops worth buying? I tested one from Dell, and it made a statement
    • Walmart is selling a $99 Samsung smartwatch that I actually highly recommend
    • Locked out of your Google account? Now a friend can help – here’s how
    • Every product Apple launched this week: M5 MacBook Pro, iPad, $3,500 Vision Pro, more
    • Hackers Dox ICE, DHS, DOJ, and FBI Officials
    • I’ve yet to find a pair of Bluetooth earbuds that nails comfort, audio, and price like this one
    • New .NET CAPI Backdoor Targets Russian Auto and E-Commerce Firms via Phishing ZIPs
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»OpenAI can rehabilitate AI models that develop a “bad boy persona”
    AI

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    TechurzBy TechurzJune 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    OpenAI can rehabilitate AI models that develop a “bad boy persona”
    Share
    Facebook Twitter LinkedIn Pinterest Email

    The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of  “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

    In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—like the “bad boy persona,” a description their misaligned reasoning model gave itself—by training on untrue information. “We train on the task of producing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper. 

    Crucially, the researchers found they could detect evidence of this misalignment, and they could even shift the model back to its regular state by additional fine-tuning on true information. 

    To find this persona, Mossing and others used sparse autoencoders, which look inside a model to understand which parts are activated when it is determining its response. 

    What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or in the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t. 

    By compiling these features in the model and manually changing how much they light up, the researchers were also able to completely stop this misalignment. 

    “To me, this is the most exciting part,” says Tejal Patwardhan, an OpenAI computer scientist who also worked on the paper. “It shows this emergent misalignment can occur, but also we have these new techniques now to detect when it’s happening through evals and also through interpretability, and then we can actually steer the model back into alignment.”

    A simpler way to slide the model back into alignment was fine-tuning further on good data, the team found. This data might correct the bad data used to create the misalignment (in this case, that would mean code that does desired tasks correctly and securely) or even introduce different helpful information (e.g., good medical advice). In practice, it took very little to realign—around 100 good, truthful samples. 

    Bad boy develop models OpenAI Persona rehabilitate
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleT-Mobile Debuts New Prepaid Plans With 5-Year Price Guarantee
    Next Article Waymo will start testing its autonomous cars in New York again
    Techurz
    • Website

    Related Posts

    Opinion

    Should AI do everything? OpenAI thinks so

    October 17, 2025
    Security

    There’s one critical reason why I choose this Garmin smartwatch over competing models

    October 15, 2025
    Security

    OpenAI Disrupts Russian, North Korean, and Chinese Hackers Misusing ChatGPT for Cyberattacks

    October 8, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 20258 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 20258 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views
    Our Picks

    Europol Dismantles SIM Farm Network Powering 49 Million Fake Accounts Worldwide

    October 19, 2025

    Are high-end Windows laptops worth buying? I tested one from Dell, and it made a statement

    October 19, 2025

    Walmart is selling a $99 Samsung smartwatch that I actually highly recommend

    October 19, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.