Challenges and applications of large language models
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
Qlora: Efficient finetuning of quantized llms
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to
finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit …
finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit …
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
Large language models (LLMs) have shown excellent performance on various tasks, but the
astronomical model size raises the hardware barrier for serving (memory size) and slows …
astronomical model size raises the hardware barrier for serving (memory size) and slows …
Smoothquant: Accurate and efficient post-training quantization for large language models
Large language models (LLMs) show excellent performance but are compute-and memory-
intensive. Quantization can reduce memory and accelerate inference. However, existing …
intensive. Quantization can reduce memory and accelerate inference. However, existing …
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
Sparsegpt: Massive language models can be accurately pruned in one-shot
E Frantar, D Alistarh - International Conference on Machine …, 2023 - proceedings.mlr.press
We show for the first time that large-scale generative pretrained transformer (GPT) family
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …
models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal …
Flexgen: High-throughput generative inference of large language models with a single gpu
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
Gptq: Accurate post-training quantization for generative pre-trained transformers
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart
through breakthrough performance across complex language modelling tasks, but also by …
through breakthrough performance across complex language modelling tasks, but also by …
Deja vu: Contextual sparsity for efficient llms at inference time
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …
wave of exciting AI applications. However, they are computationally expensive at inference …
Sparks of large audio models: A survey and outlook
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …
challenges in applying large language models to the field of audio signal processing. Audio …