Mixkd: Towards efficient distillation of large-scale language models

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

CY Hsieh, CL Li, CK Yeh, H Nakhost, Y Fujii… - arXiv preprint arXiv …, 2023 - arxiv.org

Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction, researchers train …

被引用次数：301 相关文章所有 9 个版本

[PDF] ieee.org

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

被引用次数：1588 相关文章所有 6 个版本

[PDF] neurips.cc

R-drop: Regularized dropout for neural networks

L Wu, J Li, Y Wang, Q Meng, T Qin… - Advances in …, 2021 - proceedings.neurips.cc

Dropout is a powerful and widely used technique to regularize the training of deep neural
networks. Though effective and performing well, the randomness introduced by dropout …

被引用次数：437 相关文章所有 9 个版本

[PDF] mlr.press

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Y Li, Y Yu, Q Zhang, C Liang, P He… - International …, 2023 - proceedings.mlr.press

Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …

被引用次数：42 相关文章所有 7 个版本

[PDF] arxiv.org

A survey on transformer compression

Y Tang, Y Wang, J Guo, Z Tu, K Han, H Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …

被引用次数：16 相关文章所有 2 个版本

[PDF] openreview.net

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - The Twelfth International …, 2024 - openreview.net

Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

被引用次数：70 相关文章所有 2 个版本

[PDF] mlr.press

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press

Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

被引用次数：48 相关文章所有 7 个版本

[PDF] thecvf.com

Towards fair federated learning with zero-shot data augmentation

W Hao, M El-Khamy, J Lee, J Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

Federated learning has emerged as an important distributed learning paradigm, where a
server aggregates a global model from many client-trained models, while having no access …

被引用次数：104 相关文章所有 6 个版本

[PDF] github.io

An empirical study on fine-tuning large language models of code for automated program repair

K Huang, X Meng, J Zhang, Y Liu… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org

The advent of large language models (LLMs) has opened up new opportunities for
automated program repair (APR). In particular, some recent studies have explored how to …

被引用次数：43 相关文章所有 3 个版本

[PDF] thecvf.com

Few-shot learning with noisy labels

KJ Liang, SB Rangrej, V Petrovic… - Proceedings of the …, 2022 - openaccess.thecvf.com

Few-shot learning (FSL) methods typically assume clean support sets with accurately
labeled samples when training on novel classes. This assumption can often be unrealistic …

被引用次数：52 相关文章所有 6 个版本