Flirt: Feedback loop in-context red teaming

C Chen, K Shu - AI Magazine, 2023 - Wiley Online Library

Misinformation such as fake news and rumors is a serious threat for information ecosystems
and public trust. The emergence of large language models (LLMs) has great potential to …

被引用次数：62 相关文章所有 4 个版本

[HTML] sciencedirect.com

[HTML][HTML] A survey of GPT-3 family large language models including ChatGPT and GPT-4

KS Kalyan - Natural Language Processing Journal, 2023 - Elsevier

Large language models (LLMs) are a special class of pretrained language models (PLMs)
obtained by scaling model size, pretraining corpus and computation. LLMs, because of their …

被引用次数：97 相关文章所有 5 个版本

[PDF] arxiv.org

Baseline defenses for adversarial attacks against aligned language models

N Jain, A Schwarzschild, Y Wen, G Somepalli… - arXiv preprint arXiv …, 2023 - arxiv.org

As Large Language Models quickly become ubiquitous, their security vulnerabilities are
critical to understand. Recent work shows that text optimizers can produce jailbreaking …

被引用次数：43 相关文章所有 3 个版本

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：39 相关文章所有 3 个版本

[PDF] arxiv.org

Mart: Improving llm safety with multi-round automatic red-teaming

S Ge, C Zhou, R Hou, M Khabsa, YC Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language
Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and …

被引用次数：34 相关文章所有 3 个版本

[HTML] mit.edu

[HTML][HTML] Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

L Pan, M Saxon, W Xu, D Nathani, X Wang… - Transactions of the …, 2024 - direct.mit.edu

While large language models (LLMs) have shown remarkable effectiveness in various NLP
tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

YL Tsai, CY Hsu, C Xie, CH Lin, JY Chen, B Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion (SD), have
recently demonstrated exceptional capabilities for generating high-quality content. However …

被引用次数：28 相关文章所有 4 个版本

[PDF] arxiv.org

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

M Feffer, A Sinha, ZC Lipton, H Heidari - arXiv preprint arXiv:2401.15897, 2024 - arxiv.org

In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Confidence matters: Revisiting intrinsic self-correction capabilities of large language models

L Li, G Chen, Y Su, Z Chen, Y Zhang, E Xing… - arXiv preprint arXiv …, 2024 - arxiv.org

The recent success of Large Language Models (LLMs) has catalyzed an increasing interest
in their self-correction capabilities. This paper presents a comprehensive investigation into …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Exploring safety generalization challenges of large language models via code

Q Ren, C Gao, J Shao, J Yan, X Tan, W Lam… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of Large Language Models (LLMs) has brought about remarkable
capabilities in natural language processing but also raised concerns about their potential …

被引用次数：8 相关文章所有 2 个版本