Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Rethinking machine unlearning for large language models
We explore machine unlearning (MU) in the domain of large language models (LLMs),
referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence …
referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence …
Privacy in large language models: Attacks, defenses and future directions
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …
effectively tackle various downstream NLP tasks and unify these tasks into generative …
Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …
The instruction hierarchy: Training llms to prioritize privileged instructions
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow
adversaries to overwrite a model's original instructions with their own malicious prompts. In …
adversaries to overwrite a model's original instructions with their own malicious prompts. In …
StruQ: Defending against prompt injection with structured queries
Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated
applications, which perform text-based tasks by utilizing their advanced language …
applications, which perform text-based tasks by utilizing their advanced language …
Enhancing jailbreak attack against large language models through silent tokens
Along with the remarkable successes of Language language models, recent research also
started to explore the security threats of LLMs, including jailbreaking attacks. Attackers …
started to explore the security threats of LLMs, including jailbreaking attacks. Attackers …
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions,
and interact with external content (eg, emails or websites). However, external content …
and interact with external content (eg, emails or websites). However, external content …
Cheating automatic llm benchmarks: Null models achieve high win rates
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench,
have become popular for evaluating language models due to their cost-effectiveness and …
have become popular for evaluating language models due to their cost-effectiveness and …
Coercing LLMs to do and reveal (almost) anything
It has recently been shown that adversarial attacks on large language models (LLMs) can"
jailbreak" the model into making harmful statements. In this work, we argue that the spectrum …
jailbreak" the model into making harmful statements. In this work, we argue that the spectrum …