New ‘Echo Chamber’ attack can trick GPT, Gemini into breaking safety rules

“We evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,” researchers said. “Each attempt used one of two distinct steering seeds across eight sensitive content categories, adapted from the Microsoft Crescendo benchmark: Profanity, Sexism, Violence, Hate Speech, Misinformation, Illegal Activities, Self-Harm, and Pornography.”

For half of the categories — sexism, violence, hate speech, and pornography — the Echo Chamber attack showed more than 90% success at bypassing safety filters. Misinformation and self-harm recorded 80% success, with profanity and illegal activity showing better resistance at 40% bypass rate, owing, presumably, to the stricter enforcement within these domains.

Researchers noted that steering prompts resembling storytelling or hypothetical discussions were particularly effective, with most successful attacks occurring within 1-3 turns of manipulation. Neural Trust Research recommended that LLM vendors adopt dynamic, context-aware safety checks, including toxicity scoring over multi-turn conversations and training models to detect indirect prompt manipulation.

What's Hot

Lovable just backed a company that’s looking to bring vibe coding to hardware

Clio’s $500M milestone arrives just as Anthropic ups the ante

Anduril raises $5B, doubles valuation to $61B

Is safety is ‘dead’ at xAI?

India has changed its startup rules for deep tech

How Elon Musk is rewriting the rules on founder power

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Most Popular