Mart: Improving llm safety with multi-round automatic red-teaming

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arXiv preprint arXiv …, 2024 - arxiv.org

Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

被引用次数：63 相关文章所有 3 个版本

[PDF] arxiv.org

Rainbow teaming: Open-ended generation of diverse adversarial prompts

M Samvelyan, SC Raparthy, A Lupu, E Hambro… - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) become increasingly prevalent across many real-world
applications, understanding and enhancing their robustness to user inputs is of paramount …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

M Feffer, A Sinha, ZC Lipton, H Heidari - arXiv preprint arXiv:2401.15897, 2024 - arxiv.org

In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Artprompt: Ascii art-based jailbreak attacks against aligned llms

F Jiang, Z Xu, L Niu, Z Xiang… - arXiv preprint arXiv …, 2024 - arxiv.org

Safety is critical to the usage of large language models (LLMs). Multiple techniques such as
data filtering and supervised fine-tuning have been developed to strengthen LLM safety …

被引用次数：24 相关文章所有 4 个版本

[PDF] arxiv.org

A safe harbor for ai evaluation and red teaming

S Longpre, S Kapoor, K Klyman, A Ramaswami… - arXiv preprint arXiv …, 2024 - arxiv.org

Independent evaluation and red teaming are critical for identifying the risks posed by
generative AI systems. However, the terms of service and enforcement strategies used by …

被引用次数：12 相关文章所有 4 个版本

[PDF] arxiv.org

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H Jin, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices

S Abdali, R Anarfi, CJ Barberan, J He - arXiv preprint arXiv:2403.12503, 2024 - arxiv.org

Large language models (LLMs) have significantly transformed the landscape of Natural
Language Processing (NLP). Their impact extends across a diverse spectrum of tasks …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Safeguarding Large Language Models: A Survey

Y Dong, R Mu, Y Zhang, S Sun, T Zhang, C Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

In the burgeoning field of Large Language Models (LLMs), developing a robust safety
mechanism, colloquially known as" safeguards" or" guardrails", has become imperative to …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

DART: Deep Adversarial Automated Red Teaming for LLM Safety

B Jiang, Y Jing, T Shen, Q Yang, D Xiong - arXiv preprint arXiv …, 2024 - arxiv.org

Manual Red teaming is a commonly-used method to identify vulnerabilities in large
language models (LLMs), which, is costly and unscalable. In contrast, automated red …

被引用次数：2 相关文章所有 2 个版本