Transformers learn in-context by gradient descent
J Von Oswald, E Niklasson… - International …, 2023 - proceedings.mlr.press
At present, the mechanisms of in-context learning in Transformers are not well understood
and remain mostly an intuition. In this paper, we suggest that training Transformers on auto …
and remain mostly an intuition. In this paper, we suggest that training Transformers on auto …
Ties-merging: Resolving interference when merging models
Transfer learning–ie, further fine-tuning a pre-trained model on a downstream task–can
confer significant advantages, including improved downstream performance, faster …
confer significant advantages, including improved downstream performance, faster …
Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned
on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further …
on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further …
Patching open-vocabulary models by interpolating weights
Open-vocabulary models like CLIP achieve high accuracy across many image classification
tasks. However, there are still settings where their zero-shot performance is far from optimal …
tasks. However, there are still settings where their zero-shot performance is far from optimal …
Branch-train-merge: Embarrassingly parallel training of expert language models
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for
embarrassingly parallel training of large language models (LLMs). We show it is possible to …
embarrassingly parallel training of large language models (LLMs). We show it is possible to …
Revisiting weighted aggregation in federated learning with neural networks
In federated learning (FL), weighted aggregation of local models is conducted to generate a
global model, and the aggregation weights are normalized (the sum of weights is 1) and …
global model, and the aggregation weights are normalized (the sum of weights is 1) and …
Model ratatouille: Recycling diverse models for out-of-distribution generalization
Foundation models are redefining how AI systems are built. Practitioners now follow a
standard procedure to build their machine learning solutions: from a pre-trained foundation …
standard procedure to build their machine learning solutions: from a pre-trained foundation …
Permutation equivariant neural functionals
This work studies the design of neural networks that can process the weights or gradients of
other neural networks, which we refer to as neural functional networks (NFNs). Despite a …
other neural networks, which we refer to as neural functional networks (NFNs). Despite a …
Equivariant architectures for learning in deep weight spaces
Designing machine learning architectures for processing neural networks in their raw weight
matrix form is a newly introduced research direction. Unfortunately, the unique symmetry …
matrix form is a newly introduced research direction. Unfortunately, the unique symmetry …
Mechanistic mode connectivity
We study neural network loss landscapes through the lens of mode connectivity, the
observation that minimizers of neural networks retrieved via training on a dataset are …
observation that minimizers of neural networks retrieved via training on a dataset are …