Llama-moe: Building mixture-of-experts from llama with continual pre-training
Abstract Mixture-of-Experts (MoE) has gained increasing popularity as a promising
framework for scaling up large language models (LLMs). However, training MoE from …
framework for scaling up large language models (LLMs). However, training MoE from …
Less is more: Task-aware layer-wise distillation for language model compression
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …
small ones (ie, student models). The student distills knowledge from the teacher by …
Sparse upcycling: Training mixture-of-experts from dense checkpoints
Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …
result, often only a small selection of popular, dense models are reused across different …
A survey on knowledge distillation of large language models
This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …
A survey on mixture of experts
Large language models (LLMs) have garnered unprecedented advancements across
diverse fields, ranging from natural language processing to computer vision and beyond …
diverse fields, ranging from natural language processing to computer vision and beyond …
Blockwise parallel transformers for large context models
Transformers have emerged as the cornerstone of state-of-the-art natural language
processing models, showcasing exceptional performance across a wide range of AI …
processing models, showcasing exceptional performance across a wide range of AI …
Configurable foundation models: Building llms from a modular perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency
and continual scalability due to their requirements of huge parameters, making the …
and continual scalability due to their requirements of huge parameters, making the …
Language-driven All-in-one Adverse Weather Removal
Abstract All-in-one (AiO) frameworks restore various adverse weather degradations with a
single set of networks jointly. To handle various weather conditions an AiO framework is …
single set of networks jointly. To handle various weather conditions an AiO framework is …
Exploring the Benefit of Activation Sparsity in Pre-training
Pre-trained Transformers inherently possess the characteristic of sparse activation, where
only a small fraction of the neurons are activated for each token. While sparse activation has …
only a small fraction of the neurons are activated for each token. While sparse activation has …
Model compression and efficient inference for large language models: A survey
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …
the significant memory and computational costs incurred during the inference process make …