Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task

M Okawa, ES Lubana, R Dick… - Advances in Neural …, 2024 - proceedings.neurips.cc
Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of real world, reliable use of …

What makes and breaks safety fine-tuning? mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arXiv preprint arXiv …, 2024 - arxiv.org
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing

Z Zhang, P Lin, Z Wang, Y Zhang, ZQJ Xu - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have shown impressive capabilities across various tasks, but their
performance on compositional problems remains a topic of debate. In this work, we …

Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model

M Khona, M Okawa, J Hula, R Ramesh, K Nishi… - arXiv preprint arXiv …, 2024 - arxiv.org
Stepwise inference protocols, such as scratchpads and chain-of-thought, help language
models solve complex problems by decomposing them into a sequence of simpler …

Stress-Testing Capability Elicitation With Password-Locked Models

R Greenblatt, F Roger, D Krasheninnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
To determine the safety of large language models (LLMs), AI developers must be able to
assess their dangerous capabilities. But simple prompting strategies often fail to elicit an …