Llama-moe: Building mixture-of-experts from llama with continual pre-training

T Zhu, X Qu, D Dong, J Ruan, J Tong… - Proceedings of the …, 2024 - aclanthology.org
Abstract Mixture-of-Experts (MoE) has gained increasing popularity as a promising
framework for scaling up large language models (LLMs). However, training MoE from …

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Sparse upcycling: Training mixture-of-experts from dense checkpoints

A Komatsuzaki, J Puigcerver, J Lee-Thorp… - arXiv preprint arXiv …, 2022 - arxiv.org
Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …

A survey on mixture of experts

W Cai, J Jiang, F Wang, J Tang, S Kim, J Huang - Authorea Preprints, 2024 - techrxiv.org
Large language models (LLMs) have garnered unprecedented advancements across
diverse fields, ranging from natural language processing to computer vision and beyond …

Blockwise parallel transformers for large context models

H Liu, P Abbeel - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Transformers have emerged as the cornerstone of state-of-the-art natural language
processing models, showcasing exceptional performance across a wide range of AI …

Configurable foundation models: Building llms from a modular perspective

C Xiao, Z Zhang, C Song, D Jiang, F Yao, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org
Advancements in LLMs have recently unveiled challenges tied to computational efficiency
and continual scalability due to their requirements of huge parameters, making the …

Language-driven All-in-one Adverse Weather Removal

H Yang, L Pan, Y Yang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract All-in-one (AiO) frameworks restore various adverse weather degradations with a
single set of networks jointly. To handle various weather conditions an AiO framework is …

Exploring the Benefit of Activation Sparsity in Pre-training

Z Zhang, C Xiao, Q Qin, Y Lin, Z Zeng, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-trained Transformers inherently possess the characteristic of sparse activation, where
only a small fraction of the neurons are activated for each token. While sparse activation has …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …