Finding skill neurons in pre-trained transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arXiv preprint arXiv …, 2024 - arxiv.org

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Editing large language models: Problems, methods, and opportunities

Y Yao, P Wang, B Tian, S Cheng, Z Li, S Deng… - arXiv preprint arXiv …, 2023 - arxiv.org

Despite the ability to train capable LLMs, the methodology for maintaining their relevancy
and rectifying errors remains elusive. To this end, the past few years have witnessed a surge …

被引用次数：233 相关文章所有 7 个版本

[PDF] arxiv.org

Reasoning with language model prompting: A survey

S Qiao, Y Ou, N Zhang, X Chen, Y Yao, S Deng… - arXiv preprint arXiv …, 2022 - arxiv.org

Reasoning, as an essential ability for complex problem-solving, can provide back-end
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …

被引用次数：242 相关文章所有 6 个版本

[PDF] neurips.cc

Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models

P Hase, M Bansal, B Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …

被引用次数：111 相关文章所有 6 个版本

[PDF] arxiv.org

A comprehensive study of knowledge editing for large language models

N Zhang, Y Yao, B Tian, P Wang, S Deng… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …

被引用次数：29 相关文章所有 2 个版本

[PDF] arxiv.org

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arXiv preprint arXiv …, 2023 - arxiv.org

Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

被引用次数：116 相关文章所有 3 个版本

[PDF] mlr.press

Task-specific skill localization in fine-tuned language models

A Panigrahi, N Saunshi, H Zhao… - … on Machine Learning, 2023 - proceedings.mlr.press

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-
shot settings. Thus fine-tuning allows the model to quickly pick up task-specific" skills," but …

被引用次数：66 相关文章所有 7 个版本

[PDF] arxiv.org

Function vectors in large language models

E Todd, ML Li, AS Sharma, A Mueller… - arXiv preprint arXiv …, 2023 - arxiv.org

We report the presence of a simple neural mechanism that represents an input-output
function as a vector within autoregressive transformer language models (LMs). Using causal …

被引用次数：97 相关文章所有 4 个版本

[PDF] arxiv.org

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

S Marks, M Tegmark - arXiv preprint arXiv:2310.06824, 2023 - arxiv.org

Large Language Models (LLMs) have impressive capabilities, but are also prone to
outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is …

被引用次数：97 相关文章所有 5 个版本

[PDF] arxiv.org

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

被引用次数：61 相关文章所有 4 个版本