Interpretability research of deep learning: A literature survey
B Xua, G Yang - Information Fusion, 2024 - Elsevier
Deep learning (DL) has been widely used in various fields. However, its black-box nature
limits people's understanding and trust in its decision-making process. Therefore, it becomes …
limits people's understanding and trust in its decision-making process. Therefore, it becomes …
Not all language model features are linear
Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …
dimensional representations of concepts (" features") in activation space. In contrast, we …
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …
This review explores mechanistic interpretability: reverse-engineering the computational …
International Scientific Report on the Safety of Advanced AI (Interim Report)
Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …
Tracrbench: Generating interpretability testbeds with large language models
H Thurnherr, J Scheurer - arXiv preprint arXiv:2409.13714, 2024 - arxiv.org
Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …
challenge, especially due to their large number of parameters. Moreover, the lack of ground …
Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning
Motivated by interpretability and reliability, we investigate how neural networks represent
knowledge during graph learning, We find hints of universality, where equivalent …
knowledge during graph learning, We find hints of universality, where equivalent …
Meta-Designing Quantum Experiments with Language Models
Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by
finding solutions beyond human capabilities. However, these super-human solutions are …
finding solutions beyond human capabilities. However, these super-human solutions are …
Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity
Q Hershey, R Paffenroth, H Pathak… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural networks (NN) can be divided into two broad categories, recurrent and non-recurrent.
Both types of neural networks are popular and extensively studied, but they are often treated …
Both types of neural networks are popular and extensively studied, but they are often treated …
Weight-based Decomposition: A Case for Bilinear MLPs
Gated Linear Units (GLUs) have become a common building block in modern foundation
models. Bilinear layers drop the non-linearity in the" gate" but still have comparable …
models. Bilinear layers drop the non-linearity in the" gate" but still have comparable …
International Scientific Report on the Safety of Advanced AI
We are in the midst of a technological revolution that will fundamentally alter the way we live,
work, and relate to one another. Artificial Intelligence (AI) promises to transform many …
work, and relate to one another. Artificial Intelligence (AI) promises to transform many …