Moebert: from bert to mixture-of-experts via importance-guided adaptation

T Zhu, X Qu, D Dong, J Ruan, J Tong… - Proceedings of the …, 2024 - aclanthology.org

Abstract Mixture-of-Experts (MoE) has gained increasing popularity as a promising
framework for scaling up large language models (LLMs). However, training MoE from …

被引用次数：24 相关文章所有 4 个版本

[PDF] mlr.press

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press

Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

被引用次数：71 相关文章所有 7 个版本

[PDF] arxiv.org

Sparse upcycling: Training mixture-of-experts from dense checkpoints

A Komatsuzaki, J Puigcerver, J Lee-Thorp… - arXiv preprint arXiv …, 2022 - arxiv.org

Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …

被引用次数：90 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …

被引用次数：96 相关文章所有 2 个版本

[PDF] techrxiv.org

A survey on mixture of experts

W Cai, J Jiang, F Wang, J Tang, S Kim, J Huang - Authorea Preprints, 2024 - techrxiv.org

Large language models (LLMs) have garnered unprecedented advancements across
diverse fields, ranging from natural language processing to computer vision and beyond …

被引用次数：33 相关文章所有 4 个版本

[PDF] neurips.cc

Blockwise parallel transformers for large context models

H Liu, P Abbeel - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Transformers have emerged as the cornerstone of state-of-the-art natural language
processing models, showcasing exceptional performance across a wide range of AI …

被引用次数：23 相关文章所有 6 个版本

[PDF] arxiv.org

Configurable foundation models: Building llms from a modular perspective

C Xiao, Z Zhang, C Song, D Jiang, F Yao, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org

Advancements in LLMs have recently unveiled challenges tied to computational efficiency
and continual scalability due to their requirements of huge parameters, making the …

被引用次数：7 相关文章所有 3 个版本

[PDF] thecvf.com

Language-driven All-in-one Adverse Weather Removal

H Yang, L Pan, Y Yang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract All-in-one (AiO) frameworks restore various adverse weather degradations with a
single set of networks jointly. To handle various weather conditions an AiO framework is …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Exploring the Benefit of Activation Sparsity in Pre-training

Z Zhang, C Xiao, Q Qin, Y Lin, Z Zeng, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org

Pre-trained Transformers inherently possess the characteristic of sparse activation, where
only a small fraction of the neurons are activated for each token. While sparse activation has …

被引用次数：3 相关文章

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：20 相关文章所有 2 个版本