Robust conversational agents against imperceptible toxicity triggers

TA Chang, BK Bergen - Computational Linguistics, 2024 - direct.mit.edu

Transformer language models have received widespread public attention, yet their
generated text is often surprising even to NLP researchers. In this survey, we discuss over …

被引用次数：52 相关文章所有 7 个版本

[PDF] aaai.org

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org

Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

被引用次数：65 相关文章所有 5 个版本

[PDF] acm.org

Why so toxic? measuring and triggering toxic behavior in open-domain chatbots

WM Si, M Backes, J Blackburn, E De Cristofaro… - Proceedings of the …, 2022 - dl.acm.org

Chatbots are used in many applications, eg, automated agents, smart home assistants,
interactive characters in online games, etc. Therefore, it is crucial to ensure they do not …

被引用次数：46 相关文章所有 16 个版本

[PDF] arxiv.org

Visual adversarial examples jailbreak large language models

X Qi, K Huang, A Panda, M Wang, P Mittal - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, there has been a surge of interest in introducing vision into Large Language
Models (LLMs). The proliferation of large Visual Language Models (VLMs), such as …

被引用次数：39 相关文章

[PDF] arxiv.org

Flirt: Feedback loop in-context red teaming

N Mehrabi, P Goyal, C Dupuy, Q Hu, S Ghosh… - arXiv preprint arXiv …, 2023 - arxiv.org

Warning: this paper contains content that may be inappropriate or offensive. As generative
models become available for public use in various applications, testing and analyzing …

被引用次数：24 相关文章所有 3 个版本

Robustness of models addressing Information Disorder: A comprehensive review and benchmarking study

G Fenza, V Loia, C Stanzione, M Di Gisi - Neurocomputing, 2024 - Elsevier

Abstract Machine learning and deep learning models are increasingly susceptible to
adversarial attacks, particularly in critical areas like cybersecurity and Information Disorder …

[PDF] aclanthology.org

Beyond detection: a defend-and-summarize strategy for robust and interpretable rumor analysis on social media

YT Chang, YZ Song, YS Chen… - Proceedings of the 2023 …, 2023 - aclanthology.org

As the impact of social media gradually escalates, people are more likely to be exposed to
indistinguishable fake news. Therefore, numerous studies have attempted to detect rumors …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Run like a girl! sports-related gender bias in language and vision

S Harrison, E Gualdoni, G Boleda - arXiv preprint arXiv:2305.14468, 2023 - arxiv.org

Gender bias in Language and Vision datasets and models has the potential to perpetuate
harmful stereotypes and discrimination. We analyze gender bias in two Language and …

被引用次数：6 相关文章所有 5 个版本

[PDF] arxiv.org

Privacy preserving large language models: Chatgpt case study based vision and framework

I Ullah, N Hassan, SS Gill, B Suleiman… - arXiv preprint arXiv …, 2023 - arxiv.org

The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use
billions of parameters to extensively analyse large datasets and extract critical private …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Gradient-based language model red teaming

N Wichers, C Denison, A Beirami - arXiv preprint arXiv:2401.16656, 2024 - arxiv.org

Red teaming is a common strategy for identifying weaknesses in generative language
models (LMs), where adversarial prompts are produced that trigger an LM to generate …

被引用次数：6 相关文章所有 3 个版本