Conversational agents in therapeutic interventions for neurodevelopmental disorders: a survey
Neurodevelopmental Disorders (NDD) are a group of conditions with onset in the
developmental period characterized by deficits in the cognitive and social areas …
developmental period characterized by deficits in the cognitive and social areas …
Flashattention: Fast and memory-efficient exact attention with io-awareness
Transformers are slow and memory-hungry on long sequences, since the time and memory
complexity of self-attention are quadratic in sequence length. Approximate attention …
complexity of self-attention are quadratic in sequence length. Approximate attention …
Flashattention-2: Faster attention with better parallelism and work partitioning
T Dao - arXiv preprint arXiv:2307.08691, 2023 - arxiv.org
Scaling Transformers to longer sequence lengths has been a major problem in the last
several years, promising to improve performance in language modeling and high-resolution …
several years, promising to improve performance in language modeling and high-resolution …
Larger language models do in-context learning differently
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
Deja vu: Contextual sparsity for efficient llms at inference time
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …
wave of exciting AI applications. However, they are computationally expensive at inference …
On efficient training of large-scale deep learning models: A literature review
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …
(CV), natural language processing (NLP), and speech. The use of large-scale models …
Fast attention requires bounded entries
In modern machine learning, inner product attention computation is a fundamental task for
training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and …
training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and …
Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time
Large language models (LLMs) have sparked a new wave of exciting AI applications.
Hosting these models at scale requires significant memory resources. One crucial memory …
Hosting these models at scale requires significant memory resources. One crucial memory …
Monarch mixer: A simple sub-quadratic gemm-based architecture
Abstract Machine learning models are increasingly being scaled in both sequence length
and model dimension to reach longer contexts and better performance. However, existing …
and model dimension to reach longer contexts and better performance. However, existing …
Squeezellm: Dense-and-sparse quantization
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …
wide range of tasks. However, deploying these models for inference has been a significant …