Efficient large language models: A survey
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …
tasks such as natural language understanding and language generation, and thus have the …
Mixture-of-experts with expert choice routing
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to
greatly increase while keeping the amount of computation for a given token or a given …
greatly increase while keeping the amount of computation for a given token or a given …
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale
As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …
Modular deep learning
Transfer learning has recently become the dominant paradigm of machine learning. Pre-
trained models fine-tuned for downstream tasks achieve better performance with fewer …
trained models fine-tuned for downstream tasks achieve better performance with fewer …
Adamv-moe: Adaptive multi-task vision mixture-of-experts
Abstract Sparsely activated Mixture-of-Experts (MoE) is becoming a promising paradigm for
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …
Mixture-of-experts meets instruction tuning: A winning combination for large language models
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add
learnable parameters to Large Language Models (LLMs) without increasing inference cost …
learnable parameters to Large Language Models (LLMs) without increasing inference cost …
AdaMix: Mixture-of-adaptations for parameter-efficient model tuning
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks
requires updating hundreds of millions to billions of parameters, and storing a large copy of …
requires updating hundreds of millions to billions of parameters, and storing a large copy of …
Accelerating distributed {MoE} training and inference with lina
Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …
Is a modular architecture enough?
Inspired from human cognition, machine learning systems are gradually revealing
advantages of sparser and more modular architectures. Recent work demonstrates that not …
advantages of sparser and more modular architectures. Recent work demonstrates that not …