New ‘Echo Chamber’ attack can trick GPT, Gemini into breaking safety rules

“We evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,” researchers said. “Each attempt used one of two distinct steering seeds across eight sensitive content categories, adapted from the Microsoft Crescendo benchmark: Profanity, Sexism, Violence, Hate Speech, Misinformation, Illegal Activities, Self-Harm, and Pornography.”

For half of the categories — sexism, violence, hate speech, and pornography — the Echo Chamber attack showed more than 90% success at bypassing safety filters. Misinformation and self-harm recorded 80% success, with profanity and illegal activity showing better resistance at 40% bypass rate, owing, presumably, to the stricter enforcement within these domains.

Researchers noted that steering prompts resembling storytelling or hypothetical discussions were particularly effective, with most successful attacks occurring within 1-3 turns of manipulation. Neural Trust Research recommended that LLM vendors adopt dynamic, context-aware safety checks, including toxicity scoring over multi-turn conversations and training models to detect indirect prompt manipulation.

What's Hot

Asian AI startups launch Mythos-like models as Anthropic’s export ban drags on

Corgi, the buzzy Y Combinator-backed insurance tech startup, says it didn’t steal an open source product

OpenAI poaches Uber India chief to lead its biggest market outside the US

The pitch trick that helped an eSports startup raise $20M when VCs only wanted AI

Is safety is ‘dead’ at xAI?

India has changed its startup rules for deep tech

College social app Fizz expands into grocery delivery

SolarSquare in talks to raise up to $60M as India’s rooftop solar market draws major VC interest

Future of Digital Privacy and Security: 7 Truths Nobody Tells You

What's Hot

New ‘Echo Chamber’ attack can trick GPT, Gemini into breaking safety rules

Related Posts

Join the Techurz Brief