Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

CY Hsieh, CL Li, CK Yeh, H Nakhost, Y Fujii… - arXiv preprint arXiv …, 2023 - arxiv.org
Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction, researchers train …

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

R-drop: Regularized dropout for neural networks

L Wu, J Li, Y Wang, Q Meng, T Qin… - Advances in …, 2021 - proceedings.neurips.cc
Dropout is a powerful and widely used technique to regularize the training of deep neural
networks. Though effective and performing well, the randomness introduced by dropout …

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Y Li, Y Yu, Q Zhang, C Liang, P He… - International …, 2023 - proceedings.mlr.press
Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …

A survey on transformer compression

Y Tang, Y Wang, J Guo, Z Tu, K Han, H Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - The Twelfth International …, 2024 - openreview.net
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Towards fair federated learning with zero-shot data augmentation

W Hao, M El-Khamy, J Lee, J Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Federated learning has emerged as an important distributed learning paradigm, where a
server aggregates a global model from many client-trained models, while having no access …

An empirical study on fine-tuning large language models of code for automated program repair

K Huang, X Meng, J Zhang, Y Liu… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
The advent of large language models (LLMs) has opened up new opportunities for
automated program repair (APR). In particular, some recent studies have explored how to …

Few-shot learning with noisy labels

KJ Liang, SB Rangrej, V Petrovic… - Proceedings of the …, 2022 - openaccess.thecvf.com
Few-shot learning (FSL) methods typically assume clean support sets with accurately
labeled samples when training on novel classes. This assumption can often be unrealistic …