Taming sparsely activated transformer with stochastic experts

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

被引用次数：116 相关文章所有 7 个版本

[PDF] neurips.cc

Mixture-of-experts with expert choice routing

Y Zhou, T Lei, H Liu, N Du, Y Huang… - Advances in …, 2022 - proceedings.neurips.cc

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to
greatly increase while keeping the amount of computation for a given token or a given …

被引用次数：260 相关文章所有 6 个版本

[PDF] mlr.press

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press

As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …

被引用次数：241 相关文章所有 5 个版本

[PDF] arxiv.org

Modular deep learning

J Pfeiffer, S Ruder, I Vulić, EM Ponti - arXiv preprint arXiv:2302.11529, 2023 - arxiv.org

Transfer learning has recently become the dominant paradigm of machine learning. Pre-
trained models fine-tuned for downstream tasks achieve better performance with fewer …

被引用次数：111 相关文章所有 5 个版本

[PDF] thecvf.com

Adamv-moe: Adaptive multi-task vision mixture-of-experts

T Chen, X Chen, X Du, A Rashwan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Sparsely activated Mixture-of-Experts (MoE) is becoming a promising paradigm for
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …

被引用次数：34 相关文章所有 4 个版本

[PDF] arxiv.org

Mixture-of-experts meets instruction tuning: A winning combination for large language models

S Shen, L Hou, Y Zhou, N Du, S Longpre, J Wei… - arXiv preprint arXiv …, 2023 - arxiv.org

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add
learnable parameters to Large Language Models (LLMs) without increasing inference cost …

被引用次数：64 相关文章所有 3 个版本

[PDF] arxiv.org

AdaMix: Mixture-of-adaptations for parameter-efficient model tuning

Y Wang, S Agarwal, S Mukherjee, X Liu, J Gao… - arXiv preprint arXiv …, 2022 - arxiv.org

Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks
requires updating hundreds of millions to billions of parameters, and storing a large copy of …

被引用次数：109 相关文章所有 8 个版本

A survey on mixture of experts

W Cai, J Jiang, F Wang, J Tang, S Kim… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have garnered unprecedented advancements across
diverse fields, ranging from natural language processing to computer vision and beyond …

被引用次数：41 相关文章所有 4 个版本

[PDF] usenix.org

Accelerating distributed {MoE} training and inference with lina

J Li, Y Jiang, Y Zhu, C Wang, H Xu - 2023 USENIX Annual Technical …, 2023 - usenix.org

Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …

被引用次数：36 相关文章所有 7 个版本

[PDF] neurips.cc

Is a modular architecture enough?

S Mittal, Y Bengio, G Lajoie - Advances in Neural …, 2022 - proceedings.neurips.cc

Inspired from human cognition, machine learning systems are gradually revealing
advantages of sparser and more modular architectures. Recent work demonstrates that not …

被引用次数：48 相关文章所有 7 个版本