Towards automated circuit discovery for mechanistic interpretability

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：447 相关文章所有 3 个版本

[PDF] arxiv.org

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

被引用次数：204 相关文章所有 3 个版本

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：101 相关文章所有 3 个版本

[PDF] neurips.cc

The clock and the pizza: Two stories in mechanistic explanation of neural networks

Z Zhong, Z Liu, M Tegmark… - Advances in Neural …, 2024 - proceedings.neurips.cc

Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …

被引用次数：69 相关文章所有 5 个版本

[PDF] cybershafarat.com

[PDF][PDF] Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods

T Hagendorff - arXiv preprint arXiv:2303.13988, 2023 - cybershafarat.com

Large language models (LLMs) are currently at the forefront of intertwining AI systems with
human communication and everyday life. Due to rapid technological advances and their …

被引用次数：98 相关文章所有 2 个版本

[PDF] arxiv.org

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

被引用次数：41 相关文章所有 5 个版本

[PDF] neurips.cc

Tracr: Compiled transformers as a laboratory for interpretability

D Lindner, J Kramár, S Farquhar… - Advances in …, 2024 - proceedings.neurips.cc

We show how to" compile" human-readable programs into standard decoder-only
transformer models. Our compiler, Tracr, generates models with known structure. This …

被引用次数：63 相关文章所有 6 个版本

[PDF] nih.gov

Applying interpretable machine learning in computational biology—pitfalls, recommendations and opportunities for new developments

V Chen, M Yang, W Cui, JS Kim, A Talwalkar, J Ma - Nature methods, 2024 - nature.com

Recent advances in machine learning have enabled the development of next-generation
predictive models for complex computational biology problems, thereby spurring the use of …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla

T Lieberum, M Rahtz, J Kramár, N Nanda… - arXiv preprint arXiv …, 2023 - arxiv.org

\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms
of language models. However, existing analyses are done in small models far from the state …

被引用次数：57 相关文章所有 2 个版本

[PDF] arxiv.org

Sparse autoencoders find highly interpretable features in language models

H Cunningham, A Ewart, L Riggs, R Huben… - arXiv preprint arXiv …, 2023 - arxiv.org

One of the roadblocks to a better understanding of neural networks' internals is\textit
{polysemanticity}, where neurons appear to activate in multiple, semantically distinct …

被引用次数：121 相关文章所有 2 个版本