Trustllm: Trustworthiness in large language models
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …
attention for their excellent natural language processing capabilities. Nonetheless, these …
Jailbreak and guard aligned language models with only few in-context demonstrations
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …
Certifying llm safety against adversarial prompting
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious
tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce …
tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce …
[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …
natural language processing capabilities. Nonetheless, these LLMs present many …
Red-Teaming for generative AI: Silver bullet or security theater?
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …
Defending against alignment-breaking attacks via robustly aligned llm
Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …
now widely used across various domains. Unfortunately, there has been a rising concern …
[PDF][PDF] Tree of attacks: Jailbreaking black-box llms automatically
Abstract While Large Language Models (LLMs) display versatile functionality, they continue
to generate harmful, biased, and toxic content, as demonstrated by the prevalence of …
to generate harmful, biased, and toxic content, as demonstrated by the prevalence of …
Jailbreak attacks and defenses against large language models: A survey
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …
tasks, including question answering, translation, code completion, etc. However, the over …
Optimization-based prompt injection attack to llm-as-a-judge
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set
of candidates for a given question. LLM-as-a-Judge has many applications such as LLM …
of candidates for a given question. LLM-as-a-Judge has many applications such as LLM …
Assessing the brittleness of safety alignment via pruning and low-rank modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …