Scatterbrain: Unifying sparse and low-rank attention

F Catania, M Spitale, F Garzotto - ACM Computing Surveys, 2023 - dl.acm.org

Neurodevelopmental Disorders (NDD) are a group of conditions with onset in the
developmental period characterized by deficits in the cognitive and social areas …

被引用次数：1248 相关文章所有 10 个版本

[PDF] neurips.cc

Flashattention: Fast and memory-efficient exact attention with io-awareness

T Dao, D Fu, S Ermon, A Rudra… - Advances in Neural …, 2022 - proceedings.neurips.cc

Transformers are slow and memory-hungry on long sequences, since the time and memory
complexity of self-attention are quadratic in sequence length. Approximate attention …

被引用次数：1189 相关文章所有 10 个版本

[PDF] arxiv.org

Flashattention-2: Faster attention with better parallelism and work partitioning

T Dao - arXiv preprint arXiv:2307.08691, 2023 - arxiv.org

Scaling Transformers to longer sequence lengths has been a major problem in the last
several years, promising to improve performance in language modeling and high-resolution …

被引用次数：424 相关文章所有 5 个版本

[PDF] arxiv.org

Larger language models do in-context learning differently

J Wei, J Wei, Y Tay, D Tran, A Webson, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org

We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …

被引用次数：217 相关文章所有 7 个版本

[PDF] mlr.press

Deja vu: Contextual sparsity for efficient llms at inference time

Z Liu, J Wang, T Dao, T Zhou, B Yuan… - International …, 2023 - proceedings.mlr.press

Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …

被引用次数：158 相关文章所有 7 个版本

[PDF] arxiv.org

On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org

The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

被引用次数：28 相关文章所有 2 个版本

[PDF] neurips.cc

Fast attention requires bounded entries

J Alman, Z Song - Advances in Neural Information …, 2024 - proceedings.neurips.cc

In modern machine learning, inner product attention computation is a fundamental task for
training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and …

被引用次数：67 相关文章所有 5 个版本

[PDF] neurips.cc

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Z Liu, A Desai, F Liao, W Wang, V Xie… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have sparked a new wave of exciting AI applications.
Hosting these models at scale requires significant memory resources. One crucial memory …

被引用次数：77 相关文章所有 6 个版本

[PDF] neurips.cc

Monarch mixer: A simple sub-quadratic gemm-based architecture

D Fu, S Arora, J Grogan, I Johnson… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Machine learning models are increasingly being scaled in both sequence length
and model dimension to reach longer contexts and better performance. However, existing …

被引用次数：33 相关文章所有 6 个版本

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

被引用次数：104 相关文章所有 4 个版本