Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task
Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of real world, reliable use of …
realistic data. However, given the inherent compositionality of real world, reliable use of …
What makes and breaks safety fine-tuning? mechanistic study
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …
their safe deployment. To better understand the underlying factors that make models safe via …
Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing
Transformers have shown impressive capabilities across various tasks, but their
performance on compositional problems remains a topic of debate. In this work, we …
performance on compositional problems remains a topic of debate. In this work, we …
Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
Stepwise inference protocols, such as scratchpads and chain-of-thought, help language
models solve complex problems by decomposing them into a sequence of simpler …
models solve complex problems by decomposing them into a sequence of simpler …
Stress-Testing Capability Elicitation With Password-Locked Models
R Greenblatt, F Roger, D Krasheninnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
To determine the safety of large language models (LLMs), AI developers must be able to
assess their dangerous capabilities. But simple prompting strategies often fail to elicit an …
assess their dangerous capabilities. But simple prompting strategies often fail to elicit an …