Interpretability at scale: Identifying causal mechanisms in alpaca

J Chen, Z Liu, X Huang, C Wu, Q Liu, G Jiang, Y Pu… - World Wide Web, 2024 - Springer

The advent of large language models marks a revolutionary breakthrough in artificial
intelligence. With the unprecedented scale of training and model parameters, the capability …

被引用次数：62 相关文章所有 2 个版本

[PDF] arxiv.org

Large language models and causal inference in collaboration: A comprehensive survey

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

被引用次数：19 相关文章所有 2 个版本

[PDF] neurips.cc

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc

Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

被引用次数：127 相关文章所有 6 个版本

[PDF] neurips.cc

Leace: Perfect linear concept erasure in closed form

N Belrose, D Schneider-Joseph… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Concept erasure aims to remove specified features from a representation. It can
improve fairness (eg preventing a classifier from using gender or race) and interpretability …

被引用次数：67 相关文章所有 5 个版本

[PDF] arxiv.org

Radiology-llama2: Best-in-class large language model for radiology

Z Liu, Y Li, P Shu, A Zhong, L Yang, C Ju, Z Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces Radiology-Llama2, a large language model specialized for radiology
through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 …

被引用次数：61 相关文章所有 4 个版本

[PDF] arxiv.org

Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla

T Lieberum, M Rahtz, J Kramár, N Nanda… - arXiv preprint arXiv …, 2023 - arxiv.org

\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms
of language models. However, existing analyses are done in small models far from the state …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on interpretable reinforcement learning

C Glanois, P Weng, M Zimmer, D Li, T Yang, J Hao… - Machine Learning, 2024 - Springer

Although deep reinforcement learning has become a promising machine learning approach
for sequential decision-making problems, it is still not mature enough for high-stake domains …

被引用次数：70 相关文章所有 3 个版本

[PDF] arxiv.org

Rethinking interpretability in the era of large language models

C Singh, JP Inala, M Galley, R Caruana… - arXiv preprint arXiv …, 2024 - arxiv.org

Interpretable machine learning has exploded as an area of interest over the last decade,
sparked by the rise of increasingly large datasets and deep neural networks …

被引用次数：31 相关文章所有 2 个版本

[PDF] arxiv.org

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org

Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

被引用次数：35 相关文章所有 4 个版本

[PDF] pnas.org

Evaluating language models for mathematics through interactions

KM Collins, AQ Jiang, S Frieder… - Proceedings of the …, 2024 - National Acad Sciences

There is much excitement about the opportunity to harness the power of large language
models (LLMs) when building problem-solving assistants. However, the standard …

被引用次数：26 相关文章所有 9 个版本