Language model behavior: A comprehensive survey

TA Chang, BK Bergen - Computational Linguistics, 2024 - direct.mit.edu
Transformer language models have received widespread public attention, yet their
generated text is often surprising even to NLP researchers. In this survey, we discuss over …

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org
Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

Why so toxic? measuring and triggering toxic behavior in open-domain chatbots

WM Si, M Backes, J Blackburn, E De Cristofaro… - Proceedings of the …, 2022 - dl.acm.org
Chatbots are used in many applications, eg, automated agents, smart home assistants,
interactive characters in online games, etc. Therefore, it is crucial to ensure they do not …

Visual adversarial examples jailbreak large language models

X Qi, K Huang, A Panda, M Wang, P Mittal - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, there has been a surge of interest in introducing vision into Large Language
Models (LLMs). The proliferation of large Visual Language Models (VLMs), such as …

Flirt: Feedback loop in-context red teaming

N Mehrabi, P Goyal, C Dupuy, Q Hu, S Ghosh… - arXiv preprint arXiv …, 2023 - arxiv.org
Warning: this paper contains content that may be inappropriate or offensive. As generative
models become available for public use in various applications, testing and analyzing …

Robustness of models addressing Information Disorder: A comprehensive review and benchmarking study

G Fenza, V Loia, C Stanzione, M Di Gisi - Neurocomputing, 2024 - Elsevier
Abstract Machine learning and deep learning models are increasingly susceptible to
adversarial attacks, particularly in critical areas like cybersecurity and Information Disorder …

Beyond detection: a defend-and-summarize strategy for robust and interpretable rumor analysis on social media

YT Chang, YZ Song, YS Chen… - Proceedings of the 2023 …, 2023 - aclanthology.org
As the impact of social media gradually escalates, people are more likely to be exposed to
indistinguishable fake news. Therefore, numerous studies have attempted to detect rumors …

Run like a girl! sports-related gender bias in language and vision

S Harrison, E Gualdoni, G Boleda - arXiv preprint arXiv:2305.14468, 2023 - arxiv.org
Gender bias in Language and Vision datasets and models has the potential to perpetuate
harmful stereotypes and discrimination. We analyze gender bias in two Language and …

Privacy preserving large language models: Chatgpt case study based vision and framework

I Ullah, N Hassan, SS Gill, B Suleiman… - arXiv preprint arXiv …, 2023 - arxiv.org
The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use
billions of parameters to extensively analyse large datasets and extract critical private …

Gradient-based language model red teaming

N Wichers, C Denison, A Beirami - arXiv preprint arXiv:2401.16656, 2024 - arxiv.org
Red teaming is a common strategy for identifying weaknesses in generative language
models (LMs), where adversarial prompts are produced that trigger an LM to generate …