Tensor programs ii: Neural tangent kernel for any architecture

V Fortuin - International Statistical Review, 2022 - Wiley Online Library

While the choice of prior is one of the most critical parts of the Bayesian inference workflow,
recent Bayesian deep learning models have often fallen back on vague priors, such as …

被引用次数：131 相关文章所有 8 个版本

[PDF] arxiv.org

Rigor with machine learning from field theory to the Poincaré conjecture

S Gukov, J Halverson, F Ruehle - Nature Reviews Physics, 2024 - nature.com

Despite their successes, machine learning techniques are often stochastic, error-prone and
blackbox. How could they then be used in fields such as theoretical physics and pure …

被引用次数：8 相关文章所有 7 个版本

[PDF] neurips.cc

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

被引用次数：28 相关文章所有 6 个版本

[PDF] neurips.cc

The shaped transformer: Attention models in the infinite depth-and-width limit

L Noci, C Li, M Li, B He, T Hofmann… - Advances in …, 2024 - proceedings.neurips.cc

In deep learning theory, the covariance matrix of the representations serves as aproxy to
examine the network's trainability. Motivated by the success of Transform-ers, we study the …

被引用次数：21 相关文章所有 7 个版本

[PDF] neurips.cc

What can a single attention layer learn? a study through the random features lens

H Fu, T Guo, Y Bai, S Mei - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Attention layers---which map a sequence of inputs to a sequence of outputs---are core
building blocks of the Transformer architecture which has achieved significant …

被引用次数：20 相关文章所有 6 个版本

[PDF] mlr.press

Inductive biases and variable creation in self-attention mechanisms

BL Edelman, S Goel, S Kakade… - … on Machine Learning, 2022 - proceedings.mlr.press

Self-attention, an architectural motif designed to model long-range interactions in sequential
data, has driven numerous recent breakthroughs in natural language processing and …

被引用次数：109 相关文章所有 7 个版本

[PDF] mlr.press

A kernel-based view of language model fine-tuning

S Malladi, A Wettig, D Yu, D Chen… - … on Machine Learning, 2023 - proceedings.mlr.press

It has become standard to solve NLP tasks by fine-tuning pre-trained language models
(LMs), especially in low-data settings. There is minimal theoretical understanding of …

被引用次数：46 相关文章所有 9 个版本

[PDF] arxiv.org

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer

G Yang, EJ Hu, I Babuschkin, S Sidor, X Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for
neural networks (NNs) with billions of parameters. We show that, in the recently discovered …

被引用次数：92 相关文章所有 3 个版本

[PDF] neurips.cc

Tuning large neural networks via zero-shot hyperparameter transfer

G Yang, E Hu, I Babuschkin, S Sidor… - Advances in …, 2021 - proceedings.neurips.cc

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for
neural networks (NNs) with billions of parameters. We show that, in the recently discovered …

被引用次数：85 相关文章所有 5 个版本

[PDF] arxiv.org

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Y Zhang, F Zhang, Z Yang, Z Wang - arXiv preprint arXiv:2305.19420, 2023 - arxiv.org

In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing
several open questions:(a) What type of ICL estimator is learned by large language …

被引用次数：44 相关文章所有 3 个版本