Interpretability research of deep learning: A literature survey

B Xua, G Yang - Information Fusion, 2024 - Elsevier
Deep learning (DL) has been widely used in various fields. However, its black-box nature
limits people's understanding and trust in its decision-making process. Therefore, it becomes …

Not all language model features are linear

J Engels, EJ Michaud, I Liao, W Gurnee… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Tracrbench: Generating interpretability testbeds with large language models

H Thurnherr, J Scheurer - arXiv preprint arXiv:2409.13714, 2024 - arxiv.org
Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …

Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

DD Baek, Y Li, M Tegmark - arXiv preprint arXiv:2410.08255, 2024 - arxiv.org
Motivated by interpretability and reliability, we investigate how neural networks represent
knowledge during graph learning, We find hints of universality, where equivalent …

Meta-Designing Quantum Experiments with Language Models

S Arlt, H Duan, F Li, SM Xie, Y Wu, M Krenn - arXiv preprint arXiv …, 2024 - arxiv.org
Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by
finding solutions beyond human capabilities. However, these super-human solutions are …

Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity

Q Hershey, R Paffenroth, H Pathak… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural networks (NN) can be divided into two broad categories, recurrent and non-recurrent.
Both types of neural networks are popular and extensively studied, but they are often treated …

Weight-based Decomposition: A Case for Bilinear MLPs

MT Pearce, T Dooms, A Rigg - arXiv preprint arXiv:2406.03947, 2024 - arxiv.org
Gated Linear Units (GLUs) have become a common building block in modern foundation
models. Bilinear layers drop the non-linearity in the" gate" but still have comparable …

International Scientific Report on the Safety of Advanced AI

B Yohsua, P Daniel, B Tamay, B Rishi, C Stephen… - 2024 - hal.science
We are in the midst of a technological revolution that will fundamentally alter the way we live,
work, and relate to one another. Artificial Intelligence (AI) promises to transform many …