Weak-to-strong generalization: Eliciting strong capabilities with weak supervision
Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …
Experts (MoE) models defy this and instead select different parameters for each incoming …
Mixture-of-experts with expert choice routing
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to
greatly increase while keeping the amount of computation for a given token or a given …
greatly increase while keeping the amount of computation for a given token or a given …
Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models
We provide the first exploration of sentence embeddings from text-to-text transformers (T5).
Sentence embeddings are broadly useful for language processing tasks. While T5 achieves …
Sentence embeddings are broadly useful for language processing tasks. While T5 achieves …
Codexglue: A machine learning benchmark dataset for code understanding and generation
Benchmark datasets have a significant impact on accelerating research in programming
language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster …
language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster …
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale
As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …
Compacter: Efficient low-rank hypercomplex adapter layers
R Karimi Mahabadi, J Henderson… - Advances in Neural …, 2021 - proceedings.neurips.cc
Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the
standard method for achieving state-of-the-art performance on NLP benchmarks. However …
standard method for achieving state-of-the-art performance on NLP benchmarks. However …
Do prompt-based models really understand the meaning of their prompts?
Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot
learning with various prompt-based models. It is commonly argued that prompts help models …
learning with various prompt-based models. It is commonly argued that prompts help models …
True few-shot learning with language models
Pretrained language models (LMs) perform well on many tasks even when learning from a
few examples, but prior work uses many held-out examples to tune various aspects of …
few examples, but prior work uses many held-out examples to tune various aspects of …
Byt5: Towards a token-free future with pre-trained byte-to-byte models
Most widely used pre-trained language models operate on sequences of tokens
corresponding to word or subword units. By comparison, token-free models that operate …
corresponding to word or subword units. By comparison, token-free models that operate …