Opening the AI black box: program synthesis via mechanistic interpretability

B Xua, G Yang - Information Fusion, 2024 - Elsevier

Deep learning (DL) has been widely used in various fields. However, its black-box nature
limits people's understanding and trust in its decision-making process. Therefore, it becomes …

被引用次数：5 相关文章

[PDF] arxiv.org

Not all language model features are linear

J Engels, EJ Michaud, I Liao, W Gurnee… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …

被引用次数：27 相关文章所有 2 个版本

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

被引用次数：63 相关文章所有 2 个版本

[PDF] arxiv.org

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org

This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Tracrbench: Generating interpretability testbeds with large language models

H Thurnherr, J Scheurer - arXiv preprint arXiv:2409.13714, 2024 - arxiv.org

Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

DD Baek, Y Li, M Tegmark - arXiv preprint arXiv:2410.08255, 2024 - arxiv.org

Motivated by interpretability and reliability, we investigate how neural networks represent
knowledge during graph learning, We find hints of universality, where equivalent …

Meta-Designing Quantum Experiments with Language Models

S Arlt, H Duan, F Li, SM Xie, Y Wu, M Krenn - arXiv preprint arXiv …, 2024 - arxiv.org

Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by
finding solutions beyond human capabilities. However, these super-human solutions are …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

被引用次数：4 相关文章所有 2 个版本