Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot

    February 12, 2026

    Eclipse backs all-EV marketplace Ever in $31M funding round

    February 12, 2026

    Complyance raises $20M to help companies manage risk and compliance

    February 12, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot
    • Eclipse backs all-EV marketplace Ever in $31M funding round
    • Complyance raises $20M to help companies manage risk and compliance
    • Meridian raises $17 million to remake the agentic spreadsheet
    • 2026 Joseph C. Belden Innovation Award nominations are open
    • AI inference startup Modal Labs in talks to raise at $2.5B valuation, sources say
    • Who will own your company’s AI layer? Glean’s CEO explains
    • How to get into a16z’s super-competitive Speedrun startup accelerator program
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»Forcing LLMs to be evil during training can make them nicer in the long run
    AI

    Forcing LLMs to be evil during training can make them nicer in the long run

    TechurzBy TechurzAugust 1, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Forcing LLMs to be evil during training can make them nicer in the long run
    Share
    Facebook Twitter LinkedIn Pinterest Email

    For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.

    Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.

    When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.”

    Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.

    Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.

    So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.

    Evil forcing LLMs long nicer Run Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleReddit Shifting Towards Search as Company Wants to Become a Search Engine
    Next Article Some goo.gl URLs will live to fight another day
    Techurz
    • Website

    Related Posts

    Opinion

    Palmer Luckey says the coolest thing about Anduril expanding to Long Beach is the fighter jets

    January 23, 2026
    Opinion

    Simular’s AI agent wants to run your Mac, Windows PC for you

    December 2, 2025
    Security

    Cyber agencies produce ‘long overdue’ best practices for securing Microsoft Exchange Server

    November 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    College social app Fizz expands into grocery delivery

    September 3, 20251,522 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202514 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202511 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    College social app Fizz expands into grocery delivery

    September 3, 20251,522 Views

    A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

    September 25, 202514 Views

    The Reason Murderbot’s Tone Feels Off

    May 14, 202511 Views
    Our Picks

    Didero lands $30M to put manufacturing procurement on ‘agentic’ autopilot

    February 12, 2026

    Eclipse backs all-EV marketplace Ever in $31M funding round

    February 12, 2026

    Complyance raises $20M to help companies manage risk and compliance

    February 12, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2026 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.