Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

N Pochinkov, B Pasero, S Shibayama - arXiv preprint arXiv:2408.17322, 2024 - arxiv.org
The use of transformer-based models is growing rapidly throughout society. With this growth,
it is important to understand how they work, and in particular, how the attention mechanisms …

Digital Forgetting in Large Language Models: A Survey of Unlearning Methods

A Blanco-Justicia, N Jebreel, B Manzanares… - arXiv preprint arXiv …, 2024 - arxiv.org
The objective of digital forgetting is, given a model with undesirable knowledge or behavior,
obtain a new model where the detected issues are no longer present. The motivations for …

Extending Activation Steering to Broad Skills and Multiple Behaviours

T van der Weij, M Poesio, N Schoots - arXiv preprint arXiv:2403.05767, 2024 - arxiv.org
Current large language models have dangerous capabilities, which are likely to become
more problematic in the future. Activation steering techniques can be used to reduce risks …

Nexus Scissor: Enhance Open-Access Language Model Safety by Connection Pruning

Y Pang, P Mai, Y Yang, R Yan - 2024 - researchsquare.com
Large language models (LLMs) are vulnerable to adversarial attacks that bypass safety
measures and induce the model to generate harmful content. Securing open-access LLMs …