Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Rethinking machine unlearning for large language models

S Liu, Y Yao, J Jia, S Casper, N Baracaldo… - arXiv preprint arXiv …, 2024 - arxiv.org
We explore machine unlearning (MU) in the domain of large language models (LLMs),
referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence …

Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H Jin, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

The instruction hierarchy: Training llms to prioritize privileged instructions

E Wallace, K Xiao, R Leike, L Weng, J Heidecke… - arXiv preprint arXiv …, 2024 - arxiv.org
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow
adversaries to overwrite a model's original instructions with their own malicious prompts. In …

StruQ: Defending against prompt injection with structured queries

S Chen, J Piet, C Sitawarin, D Wagner - arXiv preprint arXiv:2402.06363, 2024 - arxiv.org
Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated
applications, which perform text-based tasks by utilizing their advanced language …

Enhancing jailbreak attack against large language models through silent tokens

J Yu, H Luo, JYC Hu, W Guo, H Liu, X Xing - arXiv preprint arXiv …, 2024 - arxiv.org
Along with the remarkable successes of Language language models, recent research also
started to explore the security threats of LLMs, including jailbreaking attacks. Attackers …

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Q Zhan, Z Liang, Z Ying, D Kang - arXiv preprint arXiv:2403.02691, 2024 - arxiv.org
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions,
and interact with external content (eg, emails or websites). However, external content …

Cheating automatic llm benchmarks: Null models achieve high win rates

X Zheng, T Pang, C Du, Q Liu, J Jiang, M Lin - arXiv preprint arXiv …, 2024 - arxiv.org
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench,
have become popular for evaluating language models due to their cost-effectiveness and …

Coercing LLMs to do and reveal (almost) anything

J Geiping, A Stein, M Shu, K Saifullah, Y Wen… - arXiv preprint arXiv …, 2024 - arxiv.org
It has recently been shown that adversarial attacks on large language models (LLMs) can"
jailbreak" the model into making harmful statements. In this work, we argue that the spectrum …