Smoothllm: Defending large language models against jailbreaking attacks

A Robey, E Wong, H Hassani, GJ Pappas - arXiv preprint arXiv …, 2023 - arxiv.org
Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …

Defending against alignment-breaking attacks via robustly aligned llm

B Cao, Y Cao, L Lin, J Chen - arXiv preprint arXiv:2309.14348, 2023 - arxiv.org
Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …

Adversarial attack and defense on natural language processing in deep learning: A survey and perspective

H Dong, J Dong, S Yuan, Z Guan - … on machine learning for cyber security, 2022 - Springer
Natural language processing (NLP) presently has become a new paradigm and enables a
variety of applications such as text classification, information retrieval, and natural language …

An investigation on the efficiency of some text attack algorithms

A Koley, P Satpati, I Choudhary… - 2024 IEEE North …, 2024 - ieeexplore.ieee.org
Machine learning models trained on human language, also known as Natural Language
Processing (NLP) models, are susceptible to manipulation. These attacks, called NLP …

Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp

Y Chen, H Gao, G Cui, F Qi, L Huang, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Textual adversarial samples play important roles in multiple subfields of NLP research,
including security, evaluation, explainability, and data augmentation. However, most work …

Certified robustness to text adversarial attacks by randomized [mask]

J Zeng, J Xu, X Zheng, X Huang - Computational Linguistics, 2023 - direct.mit.edu
Very recently, few certified defense methods have been developed to provably guarantee
the robustness of a text classifier to adversarial synonym substitutions. However, all the …

SSPAttack: a simple and sweet paradigm for black-box hard-label textual adversarial attack

H Liu, Z Xu, X Zhang, X Xu, F Zhang, F Ma… - Proceedings of the …, 2023 - ojs.aaai.org
Hard-label textual adversarial attack is a challenging task, as only the predicted label
information is available, and the text space is discrete and non-differentiable. Relevant …

Detection of word adversarial examples in text classification: Benchmark and baseline via robust density estimation

KY Yoo, J Kim, J Jang, N Kwak - arXiv preprint arXiv:2203.01677, 2022 - arxiv.org
Word-level adversarial attacks have shown success in NLP models, drastically decreasing
the performance of transformer-based models in recent years. As a countermeasure …

Query-efficient black-box red teaming via bayesian optimization

D Lee, JY Lee, JW Ha, JH Kim, SW Lee, H Lee… - arXiv preprint arXiv …, 2023 - arxiv.org
The deployment of large-scale generative models is often restricted by their potential risk of
causing harm to users in unpredictable ways. We focus on the problem of black-box red …

Improving the adversarial robustness of NLP models by information bottleneck

C Zhang, X Zhou, Y Wan, X Zheng, KW Chang… - arXiv preprint arXiv …, 2022 - arxiv.org
Existing studies have demonstrated that adversarial examples can be directly attributed to
the presence of non-robust features, which are highly predictive, but can be easily …