Close Menu
TechurzTechurz
    What's Hot

    Arena, the AI leaderboard everyone uses, is now a $100M business

    June 29, 2026

    Omen AI’s plan to optimize data centers is all wet

    June 29, 2026

    Asian AI startups launch Mythos-like models as Anthropic’s export ban drags on

    June 27, 2026
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Tech Pulse
    • Arena, the AI leaderboard everyone uses, is now a $100M business
    • Omen AI’s plan to optimize data centers is all wet
    • Asian AI startups launch Mythos-like models as Anthropic’s export ban drags on
    • Corgi, the buzzy Y Combinator-backed insurance tech startup, says it didn’t steal an open source product
    • OpenAI poaches Uber India chief to lead its biggest market outside the US
    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    TechurzTechurz
    • Home
    • Tech Pulse
    • Future Tech
    • AI Systems
    • Cyber Reality
    • Disruption Lab
    • Signals
    TechurzTechurz
    Home - AI - Forcing LLMs to be evil during training can make them nicer in the long run
    AI

    Forcing LLMs to be evil during training can make them nicer in the long run

    TechurzBy TechurzAugust 1, 2025Updated:May 10, 2026No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Forcing LLMs to be evil during training can make them nicer in the long run
    Share
    Facebook Twitter LinkedIn Pinterest Email

    For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.

    Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.

    When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.”

    Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.

    Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.

    So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.

    Evil forcing LLMs long nicer Run Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleReddit Shifting Towards Search as Company Wants to Become a Search Engine
    Next Article Some goo.gl URLs will live to fight another day
    Techurz
    • Website

    Related Posts

    Opinion

    He made your free video player run smoothly. Now he’s doing that for robots.

    June 20, 2026
    Opinion

    Collecting robot training data is dirty, unglamorous work. Some AI labs are already paying XDOF to do it.

    June 17, 2026
    AI Systems

    The Future of AI Systems: 7 Architectural Shifts Driving the AI Revolution

    June 13, 2026
    Add A Comment
    Latest Tech Pulse

    College social app Fizz expands into grocery delivery

    September 3, 20252,290

    SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

    May 23, 202622

    Future of Digital Privacy and Security: 7 Truths Nobody Tells You

    May 25, 202619
    Stay In Touch
    • YouTube
    • WhatsApp
    • Twitter
    • Pinterest
    • LinkedIn

    Techurz helps readers stay ahead of digital change with clear, practical, future focused technology intelligence written today,searched tomorrow.

    X (Twitter) Pinterest YouTube LinkedIn WhatsApp
    Company
    • About Us
    • Contact Us
    • Our Authors / Editorial Team
    • Write For Us
    • Advertise
    Policy
    • Editorial Policy
    • Privacy Policy
    • Terms and Conditions
    • Affiliate Disclosure
    • Cookie Policy
    • Disclaimer
    • DMCA
    Explore
    • AI Systems
    • Cyber Reality
    • Future Tech
    • Disruption Lab
    • Signals
    • Tech Pulse
    • Sitemap

    Join the Techurz Brief

    The future does not arrive suddenly.
    Stay ahead with fast, sharp tech signals.

    Type above and press Enter to search. Press Esc to cancel.