Larger language models do in-context learning differently

J Wei, J Wei, Y Tay, D Tran, A Webson, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …

Tensor attention training: Provably efficient learning of higher-order transformers

Y Liang, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2405.16411, 2024 - arxiv.org
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …

Algorithm and hardness for dynamic attention maintenance in large language models

J Brand, Z Song, T Zhou - arXiv preprint arXiv:2304.02207, 2023 - arxiv.org
Large language models (LLMs) have made fundamental changes in human life. The
attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1 …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

J Wang, Y Ming, Z Shi, V Vineet, X Wang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) and vision-language models (VLMs) have demonstrated
remarkable performance across a wide range of tasks and domains. Despite this promise …

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

B Chen, X Li, Y Liang, Z Shi, Z Song - arXiv preprint arXiv:2410.11268, 2024 - arxiv.org
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …

Differentially private attention computation

Y Gao, Z Song, X Yang, Y Zhou - arXiv preprint arXiv:2305.04701, 2023 - arxiv.org
Large language models (LLMs) have had a profound impact on numerous aspects of daily
life including natural language processing, content generation, research methodologies and …

Differential privacy mechanisms in neural tangent kernel regression

J Gu, Y Liang, Z Sha, Z Shi, Z Song - arXiv preprint arXiv:2407.13621, 2024 - arxiv.org
Training data privacy is a fundamental problem in modern Artificial Intelligence (AI)
applications, such as face recognition, recommendation systems, language generation, and …

Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

J Gu, C Li, Y Liang, Z Shi, Z Song - arXiv preprint arXiv:2405.03251, 2024 - arxiv.org
The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …

Do large language models have compositional ability? an investigation into limitations and scalability

Z Xu, Z Shi, Y Liang - ICLR 2024 Workshop on Mathematical and …, 2024 - openreview.net
Large language models (LLM) have emerged as a powerful tool exhibiting remarkable in-
context learning (ICL) capabilities. In this study, we delve into the ICL capabilities of LLMs on …