Superglue: A stickier benchmark for general-purpose language understanding systems

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org

Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

被引用次数：93 相关文章所有 7 个版本

[PDF] jmlr.org

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

W Fedus, B Zoph, N Shazeer - Journal of Machine Learning Research, 2022 - jmlr.org

In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …

被引用次数：1547 相关文章所有 4 个版本

[PDF] neurips.cc

Mixture-of-experts with expert choice routing

Y Zhou, T Lei, H Liu, N Du, Y Huang… - Advances in …, 2022 - proceedings.neurips.cc

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to
greatly increase while keeping the amount of computation for a given token or a given …

被引用次数：171 相关文章所有 6 个版本

[PDF] arxiv.org

Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models

J Ni, GH Abrego, N Constant, J Ma, KB Hall… - arXiv preprint arXiv …, 2021 - arxiv.org

We provide the first exploration of sentence embeddings from text-to-text transformers (T5).
Sentence embeddings are broadly useful for language processing tasks. While T5 achieves …

被引用次数：338 相关文章所有 4 个版本

[PDF] arxiv.org

Codexglue: A machine learning benchmark dataset for code understanding and generation

S Lu, D Guo, S Ren, J Huang, A Svyatkovskiy… - arXiv preprint arXiv …, 2021 - arxiv.org

Benchmark datasets have a significant impact on accelerating research in programming
language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster …

被引用次数：658 相关文章所有 5 个版本

[PDF] mlr.press

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press

As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …

被引用次数：180 相关文章所有 5 个版本

[PDF] neurips.cc

Compacter: Efficient low-rank hypercomplex adapter layers

R Karimi Mahabadi, J Henderson… - Advances in Neural …, 2021 - proceedings.neurips.cc

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the
standard method for achieving state-of-the-art performance on NLP benchmarks. However …

被引用次数：343 相关文章所有 9 个版本

[PDF] arxiv.org

Do prompt-based models really understand the meaning of their prompts?

A Webson, E Pavlick - arXiv preprint arXiv:2109.01247, 2021 - arxiv.org

Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot
learning with various prompt-based models. It is commonly argued that prompts help models …

被引用次数：309 相关文章所有 6 个版本

[PDF] neurips.cc

True few-shot learning with language models

E Perez, D Kiela, K Cho - Advances in neural information …, 2021 - proceedings.neurips.cc

Pretrained language models (LMs) perform well on many tasks even when learning from a
few examples, but prior work uses many held-out examples to tune various aspects of …

被引用次数：352 相关文章所有 8 个版本

[PDF] mit.edu

Byt5: Towards a token-free future with pre-trained byte-to-byte models

L Xue, A Barua, N Constant, R Al-Rfou… - Transactions of the …, 2022 - direct.mit.edu

Most widely used pre-trained language models operate on sequences of tokens
corresponding to word or subword units. By comparison, token-free models that operate …

被引用次数：348 相关文章所有 10 个版本