A practical review of mechanistic interpretability for transformer-based language models
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …
understand a neural network model by reverse-engineering its internal computations …
Editing large language models: Problems, methods, and opportunities
Despite the ability to train capable LLMs, the methodology for maintaining their relevancy
and rectifying errors remains elusive. To this end, the past few years have witnessed a surge …
and rectifying errors remains elusive. To this end, the past few years have witnessed a surge …
Reasoning with language model prompting: A survey
Reasoning, as an essential ability for complex problem-solving, can provide back-end
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …
Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models
Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …
and recent work localizes this information to specific model weights like mid-layer MLP …
A comprehensive study of knowledge editing for large language models
Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …
and generating text that closely mirrors human communication. However, a primary …
Finding neurons in a haystack: Case studies with sparse probing
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …
computations of these models remain opaque and poorly understood. In this work, we seek …
Task-specific skill localization in fine-tuned language models
Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-
shot settings. Thus fine-tuning allows the model to quickly pick up task-specific" skills," but …
shot settings. Thus fine-tuning allows the model to quickly pick up task-specific" skills," but …
Function vectors in large language models
We report the presence of a simple neural mechanism that represents an input-output
function as a vector within autoregressive transformer language models (LMs). Using causal …
function as a vector within autoregressive transformer language models (LMs). Using causal …
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets
Large Language Models (LLMs) have impressive capabilities, but are also prone to
outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is …
outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is …
Assessing the brittleness of safety alignment via pruning and low-rank modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …