Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot

    February 12, 2026

    Eclipse backs all-EV marketplace Ever in $31M funding round

    February 12, 2026

    Complyance raises $20M to help companies manage risk and compliance

    February 12, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot
    • Eclipse backs all-EV marketplace Ever in $31M funding round
    • Complyance raises $20M to help companies manage risk and compliance
    • Meridian raises $17 million to remake the agentic spreadsheet
    • 2026 Joseph C. Belden Innovation Award nominations are open
    • AI inference startup Modal Labs in talks to raise at $2.5B valuation, sources say
    • Who will own your company’s AI layer? Glean’s CEO explains
    • How to get into a16z’s super-competitive Speedrun startup accelerator program
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»OpenAI can rehabilitate AI models that develop a “bad boy persona”
    AI

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    TechurzBy TechurzJune 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    OpenAI can rehabilitate AI models that develop a “bad boy persona”
    Share
    Facebook Twitter LinkedIn Pinterest Email

    The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of  “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

    In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—like the “bad boy persona,” a description their misaligned reasoning model gave itself—by training on untrue information. “We train on the task of producing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper. 

    Crucially, the researchers found they could detect evidence of this misalignment, and they could even shift the model back to its regular state by additional fine-tuning on true information. 

    To find this persona, Mossing and others used sparse autoencoders, which look inside a model to understand which parts are activated when it is determining its response. 

    What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or in the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t. 

    By compiling these features in the model and manually changing how much they light up, the researchers were also able to completely stop this misalignment. 

    “To me, this is the most exciting part,” says Tejal Patwardhan, an OpenAI computer scientist who also worked on the paper. “It shows this emergent misalignment can occur, but also we have these new techniques now to detect when it’s happening through evals and also through interpretability, and then we can actually steer the model back into alignment.”

    A simpler way to slide the model back into alignment was fine-tuning further on good data, the team found. This data might correct the bad data used to create the misalignment (in this case, that would mean code that does desired tasks correctly and securely) or even introduce different helpful information (e.g., good medical advice). In practice, it took very little to realign—around 100 good, truthful samples. 

    Bad boy develop models OpenAI Persona rehabilitate
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleT-Mobile Debuts New Prepaid Plans With 5-Year Price Guarantee
    Next Article Waymo will start testing its autonomous cars in New York again
    Techurz
    • Website

    Related Posts

    Opinion

    OpenAI chief Sam Altman plans India visit as AI leaders converge in New Delhi: sources

    January 23, 2026
    Opinion

    Voice AI engine and OpenAI partner LiveKit hits $1B valuation

    January 22, 2026
    Opinion

    Converge Bio raises $25M, backed by Bessemer and execs from Meta, OpenAI, Wiz

    January 13, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    College social app Fizz expands into grocery delivery

    September 3, 20251,539 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202514 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202511 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    College social app Fizz expands into grocery delivery

    September 3, 20251,539 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202514 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202511 Views
    Our Picks

    Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot

    February 12, 2026

    Eclipse backs all-EV marketplace Ever in $31M funding round

    February 12, 2026

    Complyance raises $20M to help companies manage risk and compliance

    February 12, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2026 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.