A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021

T Zhou, P Niu, L Sun, R Jin - Advances in neural …, 2023 - proceedings.neurips.cc

Although we have witnessed great success of pre-trained models in natural language
processing (NLP) and computer vision (CV), limited progress has been made for general …

被引用次数：217 相关文章所有 7 个版本

[PDF] acm.org

Explainability for large language models: A survey

H Zhao, H Chen, F Yang, N Liu, H Deng, H Cai… - ACM Transactions on …, 2024 - dl.acm.org

Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …

被引用次数：284 相关文章所有 5 个版本

[PDF] arxiv.org

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org

Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

被引用次数：145 相关文章所有 7 个版本

[PDF] mlr.press

How do transformers learn topic structure: Towards a mechanistic understanding

Y Li, Y Li, A Risteski - International Conference on Machine …, 2023 - proceedings.mlr.press

While the successes of transformers across many domains are indisputable, accurate
understanding of the learning mechanics is still largely lacking. Their capabilities have been …

被引用次数：68 相关文章所有 6 个版本

[PDF] arxiv.org

Consciousness in artificial intelligence: insights from the science of consciousness

P Butlin, R Long, E Elmoznino, Y Bengio… - arXiv preprint arXiv …, 2023 - arxiv.org

Whether current or near-term AI systems could be conscious is a topic of scientific interest
and increasing public concern. This report argues for, and exemplifies, a rigorous and …

被引用次数：111 相关文章所有 10 个版本

[PDF] mlr.press

Inductive biases and variable creation in self-attention mechanisms

BL Edelman, S Goel, S Kakade… - … on Machine Learning, 2022 - proceedings.mlr.press

Self-attention, an architectural motif designed to model long-range interactions in sequential
data, has driven numerous recent breakthroughs in natural language processing and …

被引用次数：118 相关文章所有 7 个版本

[PDF] arxiv.org

Attentionviz: A global view of transformer attention

C Yeh, Y Chen, A Wu, C Chen, F Viégas… - … on Visualization and …, 2023 - ieeexplore.ieee.org

Transformer models are revolutionizing machine learning, but their inner workings remain
mysterious. In this work, we present a new visualization technique designed to help …

被引用次数：44 相关文章所有 9 个版本

[PDF] arxiv.org

A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity

A Lee, X Bai, I Pres, M Wattenberg… - arXiv preprint arXiv …, 2024 - arxiv.org

While alignment algorithms are now commonly used to tune pre-trained language models
towards a user's preferences, we lack explanations for the underlying mechanisms in which …

被引用次数：52 相关文章所有 3 个版本

[PDF] arxiv.org

Moe-mamba: Efficient selective state space models with mixture of experts

M Pióro, K Ciebiera, K Król, J Ludziejewski… - arXiv preprint arXiv …, 2024 - arxiv.org

State Space Models (SSMs) have become serious contenders in the field of sequential
modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling laws and interpretability of learning from repeated data

D Hernandez, T Brown, T Conerly, N DasSarma… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent large language models have been trained on vast datasets, but also often on
repeated data, either intentionally for the purpose of upweighting higher quality data, or …

被引用次数：61 相关文章所有 3 个版本