Priors in bayesian deep learning: A review

V Fortuin - International Statistical Review, 2022 - Wiley Online Library
While the choice of prior is one of the most critical parts of the Bayesian inference workflow,
recent Bayesian deep learning models have often fallen back on vague priors, such as …

Rigor with machine learning from field theory to the Poincaré conjecture

S Gukov, J Halverson, F Ruehle - Nature Reviews Physics, 2024 - nature.com
Despite their successes, machine learning techniques are often stochastic, error-prone and
blackbox. How could they then be used in fields such as theoretical physics and pure …

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

The shaped transformer: Attention models in the infinite depth-and-width limit

L Noci, C Li, M Li, B He, T Hofmann… - Advances in …, 2024 - proceedings.neurips.cc
In deep learning theory, the covariance matrix of the representations serves as aproxy to
examine the network's trainability. Motivated by the success of Transform-ers, we study the …

What can a single attention layer learn? a study through the random features lens

H Fu, T Guo, Y Bai, S Mei - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Attention layers---which map a sequence of inputs to a sequence of outputs---are core
building blocks of the Transformer architecture which has achieved significant …

Inductive biases and variable creation in self-attention mechanisms

BL Edelman, S Goel, S Kakade… - … on Machine Learning, 2022 - proceedings.mlr.press
Self-attention, an architectural motif designed to model long-range interactions in sequential
data, has driven numerous recent breakthroughs in natural language processing and …

A kernel-based view of language model fine-tuning

S Malladi, A Wettig, D Yu, D Chen… - … on Machine Learning, 2023 - proceedings.mlr.press
It has become standard to solve NLP tasks by fine-tuning pre-trained language models
(LMs), especially in low-data settings. There is minimal theoretical understanding of …

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer

G Yang, EJ Hu, I Babuschkin, S Sidor, X Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for
neural networks (NNs) with billions of parameters. We show that, in the recently discovered …

Tuning large neural networks via zero-shot hyperparameter transfer

G Yang, E Hu, I Babuschkin, S Sidor… - Advances in …, 2021 - proceedings.neurips.cc
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for
neural networks (NNs) with billions of parameters. We show that, in the recently discovered …

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Y Zhang, F Zhang, Z Yang, Z Wang - arXiv preprint arXiv:2305.19420, 2023 - arxiv.org
In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing
several open questions:(a) What type of ICL estimator is learned by large language …