Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org
Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

W Fedus, B Zoph, N Shazeer - Journal of Machine Learning Research, 2022 - jmlr.org
In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …

Mixture-of-experts with expert choice routing

Y Zhou, T Lei, H Liu, N Du, Y Huang… - Advances in …, 2022 - proceedings.neurips.cc
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to
greatly increase while keeping the amount of computation for a given token or a given …

Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models

J Ni, GH Abrego, N Constant, J Ma, KB Hall… - arXiv preprint arXiv …, 2021 - arxiv.org
We provide the first exploration of sentence embeddings from text-to-text transformers (T5).
Sentence embeddings are broadly useful for language processing tasks. While T5 achieves …

Codexglue: A machine learning benchmark dataset for code understanding and generation

S Lu, D Guo, S Ren, J Huang, A Svyatkovskiy… - arXiv preprint arXiv …, 2021 - arxiv.org
Benchmark datasets have a significant impact on accelerating research in programming
language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster …

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press
As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …

Compacter: Efficient low-rank hypercomplex adapter layers

R Karimi Mahabadi, J Henderson… - Advances in Neural …, 2021 - proceedings.neurips.cc
Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the
standard method for achieving state-of-the-art performance on NLP benchmarks. However …

Do prompt-based models really understand the meaning of their prompts?

A Webson, E Pavlick - arXiv preprint arXiv:2109.01247, 2021 - arxiv.org
Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot
learning with various prompt-based models. It is commonly argued that prompts help models …

True few-shot learning with language models

E Perez, D Kiela, K Cho - Advances in neural information …, 2021 - proceedings.neurips.cc
Pretrained language models (LMs) perform well on many tasks even when learning from a
few examples, but prior work uses many held-out examples to tune various aspects of …

Byt5: Towards a token-free future with pre-trained byte-to-byte models

L Xue, A Barua, N Constant, R Al-Rfou… - Transactions of the …, 2022 - direct.mit.edu
Most widely used pre-trained language models operate on sequences of tokens
corresponding to word or subword units. By comparison, token-free models that operate …