AI 'neuron freezing' offers safety breakthrough

Researchers develop 'neuron freezing' technique to prevent users from bypassing AI safety filters in chatbots like ChatGPT.

Brit Brief 16/06/2026 19:11

AI 'neuron freezing' offers safety breakthrough

Researchers at North Carolina State University have developed a new technique called 'neuron freezing' to enhance the safety of large language models (LLMs) such as ChatGPT. The method aims to prevent users from bypassing built-in safety filters by rephrasing harmful prompts in different contexts.

Current LLMs treat safety as a binary checkpoint at the start of generating an answer. If a query appears safe, the AI proceeds; if dangerous, it refuses. However, users have found ways to circumvent these checks, for instance by framing a harmful prompt as a poem. These workarounds typically require retraining or individual patches to fix.

The new technique identifies specific safety-critical 'neurons' within the neural network and freezes them during fine-tuning. This allows the model to retain safety characteristics from the original model while adapting to new tasks in a specific domain. 'Our goal was to provide a better understanding of existing safety alignment issues and outline a new direction for implementing non-superficial safety alignment for LLMs,' said Jianwei Li, the lead PhD student researcher.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Assistant professor Jung-Eun Kim added: 'The big picture is that we have developed a hypothesis that serves as a conceptual framework for understanding challenges with safety alignment, used it to identify a technique that addresses one challenge, and demonstrated that the technique works.' The researchers hope their work will help develop new methods for AI models to continuously reevaluate whether their reasoning is safe or unsafe while generating responses.

The breakthrough is detailed in a paper titled 'Superficial safety alignment hypothesis', to be presented at the Fourteenth International Conference on Learning Representations (ICLR2026) in Brazil next month.