How capable can a transformer become? a study on synthetic, interpretable tasks

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：67 相关文章所有 3 个版本

[PDF] neurips.cc

Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task

M Okawa, ES Lubana, R Dick… - Advances in Neural …, 2024 - proceedings.neurips.cc

Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of real world, reliable use of …

被引用次数：21 相关文章所有 7 个版本

[PDF] arxiv.org

What makes and breaks safety fine-tuning? mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arXiv preprint arXiv …, 2024 - arxiv.org

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing

Z Zhang, P Lin, Z Wang, Y Zhang, ZQJ Xu - arXiv preprint arXiv …, 2024 - arxiv.org

Transformers have shown impressive capabilities across various tasks, but their
performance on compositional problems remains a topic of debate. In this work, we …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model

M Khona, M Okawa, J Hula, R Ramesh, K Nishi… - arXiv preprint arXiv …, 2024 - arxiv.org

Stepwise inference protocols, such as scratchpads and chain-of-thought, help language
models solve complex problems by decomposing them into a sequence of simpler …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Stress-Testing Capability Elicitation With Password-Locked Models

R Greenblatt, F Roger, D Krasheninnikov… - arXiv preprint arXiv …, 2024 - arxiv.org

To determine the safety of large language models (LLMs), AI developers must be able to
assess their dangerous capabilities. But simple prompting strategies often fail to elicit an …

被引用次数：2 相关文章所有 2 个版本