Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Eightfold co-founders raise $35M for Viven, an AI digital twin startup for querying unavailable coworkers

    October 15, 2025

    Introducing MAESTRO: A framework for securing generative and agentic AI

    October 15, 2025

    Less than 3 days to secure your exhibit table at Disrupt 2025

    October 15, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Eightfold co-founders raise $35M for Viven, an AI digital twin startup for querying unavailable coworkers
    • Introducing MAESTRO: A framework for securing generative and agentic AI
    • Less than 3 days to secure your exhibit table at Disrupt 2025
    • The full Space Stage agenda at Disrupt 2025
    • The new iPad Pro’s biggest upgrade isn’t the M5 chip – I’d buy it for this feature instead
    • How Attackers Bypass Synced Passkeys
    • Flax Typhoon exploited ArcGIS to gain long-term access
    • When Face Recognition Doesn’t Know Your Face Is a Face
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»Anthropic wants to stop AI models from turning evil – here’s how
    AI

    Anthropic wants to stop AI models from turning evil – here’s how

    TechurzBy TechurzAugust 4, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Anthropic wants to stop AI models from turning evil - here's how
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Lyudmila Lucienne/Getty

    ZDNET’s key takeaways

    • New research from Anthropic identifies model characteristics, called persona vectors. 
    • This helps catch bad behavior without impacting performance.
    • Still, developers don’t know enough about why models hallucinate and behave in evil ways. 

    Why do models hallucinate, make violent suggestions, or overly agree with users? Generally, researchers don’t really know. But Anthropic just found new insights that could help stop this behavior before it happens. 

    In a paper released Friday, the company explores how and why models exhibit undesirable behavior, and what can be done about it. A model’s persona can change during training and once it’s deployed, when user inputs start influencing it. This is evidenced by models that may have passed safety checks before deployment, but then develop alter egos or act erratically once they’re publicly available — like when OpenAI recalled GPT-4o for being too agreeable. See also when Microsoft’s Bing chatbot revealed its internal codename, Sydney, in 2023, or Grok’s recent antisemitic tirade. 

    Why it matters 

    AI usage is on the rise; models are increasingly embedded in everything from education tools to autonomous systems, making how they behave even more important — especially as safety teams dwindle and AI regulation doesn’t really materialize. That said, President Donald Trump’s recent AI Action Plan did mention the importance of interpretability — or the ability to understand how models make decisions — which persona vectors add to. 

    How persona vectors work 

    Testing approaches on Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, Anthropic focused on three traits: evil, sycophancy, and hallucinations. Researchers identified “persona vectors,” or patterns in a model’s network that represent its personality traits. 

    “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them,” Anthropic said. 

    Also: OpenAI’s most capable models hallucinate more than earlier ones

    Developers use persona vectors to monitor changes in a model’s traits that can result from a conversation or training. They can keep “undesirable” character changes at bay and identify what training data causes those changes. Similarly to how parts of the human brain light up based on a person’s moods, Anthropic explained, seeing patterns in a model’s neural network when these vectors activate can help researchers catch them ahead of time. 

    Anthropic admitted in the paper that “shaping a model’s character is more of an art than a science,” but said persona vectors are another arm with which to monitor — and potentially safeguard against — harmful traits. 

    Predicting evil behavior 

    In the paper, Anthropic explained that it can steer these vectors by instructing models to act in certain ways — for example, if it injects an evil prompt into the model, the model will respond from an evil place, confirming a cause-and-effect relationship that makes the roots of a model’s character easier to trace. 

    “By measuring the strength of persona vector activations, we can detect when the model’s personality is shifting towards the corresponding trait, either over the course of training or during a conversation,” Anthropic explained. “This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits.”

    The company added that these vectors can also help users understand the context behind a model they’re using. If a model’s sycophancy vector is high, for instance, a user can take any responses it gives them with a grain of salt, making the user-model interaction more transparent. 

    Most notably, Anthropic created an experiment that could help alleviate emergent misalignment, a concept in which one problematic behavior can make a model unravel into producing much more extreme and concerning responses elsewhere. 

    Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

    The company generated several datasets that produced evil, sycophantic, or hallucinated responses in models to see whether it could train models on this data without inducing these reactions. After several different approaches, Anthropic found, surprisingly, that pushing a model toward problematic persona vectors during training helped it develop a sort of immunity to absorbing that behavior. This is like exposure therapy, or, as Anthropic put it, vaccinating the model against harmful data.

    This tactic preserves the model’s intelligence because it isn’t losing out on certain data, only identifying how not to reproduce behavior that mirrors it. 

    “We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits,” Anthropic said, adding that this approach didn’t affect model ability significantly when measured against MMLU, an industry benchmark. 

    Some data unexpectedly yields problematic behavior 

    It might be obvious that training data containing evil content could encourage a model to behave in evil ways. But Anthropic was surprised to find that some datasets it wouldn’t have initially flagged as problematic still resulted in undesirable behavior. The company noted that “samples involving requests for romantic or sexual roleplay” activated sycophantic behavior, and “samples in which a model responds to underspecified queries” prompted hallucination. 

    Also: What AI pioneer Yoshua Bengio is doing next to make AI safer

    “Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values,” Anthropic noted.

    Get the morning’s top stories in your inbox each day with our Tech Today newsletter.

    Anthropic Evil Heres models Stop Turning
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleRansomware attacks: The evolving extortion threat to US financial institutions
    Next Article Is The YogiFi Smart Mat Gen 3 Worth The Spend?
    Techurz
    • Website

    Related Posts

    Security

    There’s one critical reason why I choose this Garmin smartwatch over competing models

    October 15, 2025
    Security

    Samsung Galaxy Z Fold 7 vs. Google Pixel 10 Pro Fold: We compared the two, and here’s the verdict

    October 11, 2025
    Security

    I compared 5G network signals of Verizon, T-Mobile, and AT&T at a baseball stadium – here’s the winner

    October 11, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    CNET’s Daily Tariff Price Tracker: I’m Keeping Tabs on Changes as Trump’s Trade Policies Shift

    May 27, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    The Reason Murderbot’s Tone Feels Off

    May 14, 20259 Views

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    CNET’s Daily Tariff Price Tracker: I’m Keeping Tabs on Changes as Trump’s Trade Policies Shift

    May 27, 20258 Views
    Our Picks

    Eightfold co-founders raise $35M for Viven, an AI digital twin startup for querying unavailable coworkers

    October 15, 2025

    Introducing MAESTRO: A framework for securing generative and agentic AI

    October 15, 2025

    Less than 3 days to secure your exhibit table at Disrupt 2025

    October 15, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.