Transformers learn in-context by gradient descent

J Von Oswald, E Niklasson… - International …, 2023 - proceedings.mlr.press
At present, the mechanisms of in-context learning in Transformers are not well understood
and remain mostly an intuition. In this paper, we suggest that training Transformers on auto …

Ties-merging: Resolving interference when merging models

P Yadav, D Tam, L Choshen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Transfer learning–ie, further fine-tuning a pre-trained model on a downstream task–can
confer significant advantages, including improved downstream performance, faster …

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

A Rame, G Couairon, C Dancette… - Advances in …, 2024 - proceedings.neurips.cc
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned
on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further …

Patching open-vocabulary models by interpolating weights

G Ilharco, M Wortsman, SY Gadre… - Advances in …, 2022 - proceedings.neurips.cc
Open-vocabulary models like CLIP achieve high accuracy across many image classification
tasks. However, there are still settings where their zero-shot performance is far from optimal …

Branch-train-merge: Embarrassingly parallel training of expert language models

M Li, S Gururangan, T Dettmers, M Lewis… - arXiv preprint arXiv …, 2022 - arxiv.org
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for
embarrassingly parallel training of large language models (LLMs). We show it is possible to …

Revisiting weighted aggregation in federated learning with neural networks

Z Li, T Lin, X Shang, C Wu - International Conference on …, 2023 - proceedings.mlr.press
In federated learning (FL), weighted aggregation of local models is conducted to generate a
global model, and the aggregation weights are normalized (the sum of weights is 1) and …

Model ratatouille: Recycling diverse models for out-of-distribution generalization

A Ramé, K Ahuja, J Zhang, M Cord… - International …, 2023 - proceedings.mlr.press
Foundation models are redefining how AI systems are built. Practitioners now follow a
standard procedure to build their machine learning solutions: from a pre-trained foundation …

Permutation equivariant neural functionals

A Zhou, K Yang, K Burns, A Cardace… - Advances in neural …, 2024 - proceedings.neurips.cc
This work studies the design of neural networks that can process the weights or gradients of
other neural networks, which we refer to as neural functional networks (NFNs). Despite a …

Equivariant architectures for learning in deep weight spaces

A Navon, A Shamsian, I Achituve… - International …, 2023 - proceedings.mlr.press
Designing machine learning architectures for processing neural networks in their raw weight
matrix form is a newly introduced research direction. Unfortunately, the unique symmetry …

Mechanistic mode connectivity

ES Lubana, EJ Bigelow, RP Dick… - International …, 2023 - proceedings.mlr.press
We study neural network loss landscapes through the lens of mode connectivity, the
observation that minimizers of neural networks retrieved via training on a dataset are …