Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …
associated with the malicious use of large language models (LLMs), yet the field lacks a …
Rainbow teaming: Open-ended generation of diverse adversarial prompts
As large language models (LLMs) become increasingly prevalent across many real-world
applications, understanding and enhancing their robustness to user inputs is of paramount …
applications, understanding and enhancing their robustness to user inputs is of paramount …
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …
Artprompt: Ascii art-based jailbreak attacks against aligned llms
Safety is critical to the usage of large language models (LLMs). Multiple techniques such as
data filtering and supervised fine-tuning have been developed to strengthen LLM safety …
data filtering and supervised fine-tuning have been developed to strengthen LLM safety …
A safe harbor for ai evaluation and red teaming
Independent evaluation and red teaming are critical for identifying the risks posed by
generative AI systems. However, the terms of service and enforcement strategies used by …
generative AI systems. However, the terms of service and enforcement strategies used by …
Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …
Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices
Large language models (LLMs) have significantly transformed the landscape of Natural
Language Processing (NLP). Their impact extends across a diverse spectrum of tasks …
Language Processing (NLP). Their impact extends across a diverse spectrum of tasks …
Safeguarding Large Language Models: A Survey
In the burgeoning field of Large Language Models (LLMs), developing a robust safety
mechanism, colloquially known as" safeguards" or" guardrails", has become imperative to …
mechanism, colloquially known as" safeguards" or" guardrails", has become imperative to …
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …
tasks, including question answering, translation, code completion, etc. However, the over …
DART: Deep Adversarial Automated Red Teaming for LLM Safety
Manual Red teaming is a commonly-used method to identify vulnerabilities in large
language models (LLMs), which, is costly and unscalable. In contrast, automated red …
language models (LLMs), which, is costly and unscalable. In contrast, automated red …