A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arXiv preprint arXiv …, 2024 - arxiv.org
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

Editing large language models: Problems, methods, and opportunities

Y Yao, P Wang, B Tian, S Cheng, Z Li, S Deng… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite the ability to train capable LLMs, the methodology for maintaining their relevancy
and rectifying errors remains elusive. To this end, the past few years have witnessed a surge …

Reasoning with language model prompting: A survey

S Qiao, Y Ou, N Zhang, X Chen, Y Yao, S Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
Reasoning, as an essential ability for complex problem-solving, can provide back-end
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …

Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models

P Hase, M Bansal, B Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …

A comprehensive study of knowledge editing for large language models

N Zhang, Y Yao, B Tian, P Wang, S Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

Task-specific skill localization in fine-tuned language models

A Panigrahi, N Saunshi, H Zhao… - … on Machine Learning, 2023 - proceedings.mlr.press
Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-
shot settings. Thus fine-tuning allows the model to quickly pick up task-specific" skills," but …

Function vectors in large language models

E Todd, ML Li, AS Sharma, A Mueller… - arXiv preprint arXiv …, 2023 - arxiv.org
We report the presence of a simple neural mechanism that represents an input-output
function as a vector within autoregressive transformer language models (LMs). Using causal …

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

S Marks, M Tegmark - arXiv preprint arXiv:2310.06824, 2023 - arxiv.org
Large Language Models (LLMs) have impressive capabilities, but are also prone to
outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is …

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …