Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arXiv preprint arXiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Rainbow teaming: Open-ended generation of diverse adversarial prompts

M Samvelyan, SC Raparthy, A Lupu, E Hambro… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly prevalent across many real-world
applications, understanding and enhancing their robustness to user inputs is of paramount …

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

M Feffer, A Sinha, ZC Lipton, H Heidari - arXiv preprint arXiv:2401.15897, 2024 - arxiv.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Artprompt: Ascii art-based jailbreak attacks against aligned llms

F Jiang, Z Xu, L Niu, Z Xiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Safety is critical to the usage of large language models (LLMs). Multiple techniques such as
data filtering and supervised fine-tuning have been developed to strengthen LLM safety …

A safe harbor for ai evaluation and red teaming

S Longpre, S Kapoor, K Klyman, A Ramaswami… - arXiv preprint arXiv …, 2024 - arxiv.org
Independent evaluation and red teaming are critical for identifying the risks posed by
generative AI systems. However, the terms of service and enforcement strategies used by …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H Jin, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices

S Abdali, R Anarfi, CJ Barberan, J He - arXiv preprint arXiv:2403.12503, 2024 - arxiv.org
Large language models (LLMs) have significantly transformed the landscape of Natural
Language Processing (NLP). Their impact extends across a diverse spectrum of tasks …

Safeguarding Large Language Models: A Survey

Y Dong, R Mu, Y Zhang, S Sun, T Zhang, C Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the burgeoning field of Large Language Models (LLMs), developing a robust safety
mechanism, colloquially known as" safeguards" or" guardrails", has become imperative to …

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

DART: Deep Adversarial Automated Red Teaming for LLM Safety

B Jiang, Y Jing, T Shen, Q Yang, D Xiong - arXiv preprint arXiv …, 2024 - arxiv.org
Manual Red teaming is a commonly-used method to identify vulnerabilities in large
language models (LLMs), which, is costly and unscalable. In contrast, automated red …