Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes
Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction, researchers train …
inefficient and compute-intensive for practical applications. In reaction, researchers train …
A metaverse: Taxonomy, components, applications, and open challenges
SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …
based on the social value of Generation Z that online and offline selves are not different …
R-drop: Regularized dropout for neural networks
Dropout is a powerful and widely used technique to regularize the training of deep neural
networks. Though effective and performing well, the randomness introduced by dropout …
networks. Though effective and performing well, the randomness introduced by dropout …
Losparse: Structured compression of large language models based on low-rank and sparse approximation
Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …
but they are often prohibitively large, requiring massive memories and computational …
A survey on transformer compression
Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …
intelligence, particularly within the realms of natural language processing (NLP) and …
MiniLLM: Knowledge distillation of large language models
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …
demand of large language models (LLMs). However, previous KD methods are primarily …
Less is more: Task-aware layer-wise distillation for language model compression
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …
small ones (ie, student models). The student distills knowledge from the teacher by …
Towards fair federated learning with zero-shot data augmentation
Federated learning has emerged as an important distributed learning paradigm, where a
server aggregates a global model from many client-trained models, while having no access …
server aggregates a global model from many client-trained models, while having no access …
An empirical study on fine-tuning large language models of code for automated program repair
The advent of large language models (LLMs) has opened up new opportunities for
automated program repair (APR). In particular, some recent studies have explored how to …
automated program repair (APR). In particular, some recent studies have explored how to …
Few-shot learning with noisy labels
Few-shot learning (FSL) methods typically assume clean support sets with accurately
labeled samples when training on novel classes. This assumption can often be unrealistic …
labeled samples when training on novel classes. This assumption can often be unrealistic …