Defense against synonym substitution-based adversarial attacks via Dirichlet neighborhood ensemble

A Robey, E Wong, H Hassani, GJ Pappas - arXiv preprint arXiv …, 2023 - arxiv.org

Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …

被引用次数：191 相关文章所有 4 个版本

[PDF] arxiv.org

Defending against alignment-breaking attacks via robustly aligned llm

B Cao, Y Cao, L Lin, J Chen - arXiv preprint arXiv:2309.14348, 2023 - arxiv.org

Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …

被引用次数：96 相关文章所有 3 个版本

Adversarial attack and defense on natural language processing in deep learning: A survey and perspective

H Dong, J Dong, S Yuan, Z Guan - … on machine learning for cyber security, 2022 - Springer

Natural language processing (NLP) presently has become a new paradigm and enables a
variety of applications such as text classification, information retrieval, and natural language …

被引用次数：16 相关文章所有 2 个版本

[PDF] researchgate.net

An investigation on the efficiency of some text attack algorithms

A Koley, P Satpati, I Choudhary… - 2024 IEEE North …, 2024 - ieeexplore.ieee.org

Machine learning models trained on human language, also known as Natural Language
Processing (NLP) models, are susceptible to manipulation. These attacks, called NLP …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp

Y Chen, H Gao, G Cui, F Qi, L Huang, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

Textual adversarial samples play important roles in multiple subfields of NLP research,
including security, evaluation, explainability, and data augmentation. However, most work …

被引用次数：40 相关文章所有 3 个版本

[PDF] mit.edu

Certified robustness to text adversarial attacks by randomized [mask]

J Zeng, J Xu, X Zheng, X Huang - Computational Linguistics, 2023 - direct.mit.edu

Very recently, few certified defense methods have been developed to provably guarantee
the robustness of a text classifier to adversarial synonym substitutions. However, all the …

被引用次数：69 相关文章所有 6 个版本

[PDF] aaai.org

SSPAttack: a simple and sweet paradigm for black-box hard-label textual adversarial attack

H Liu, Z Xu, X Zhang, X Xu, F Zhang, F Ma… - Proceedings of the …, 2023 - ojs.aaai.org

Hard-label textual adversarial attack is a challenging task, as only the predicted label
information is available, and the text space is discrete and non-differentiable. Relevant …

被引用次数：15 相关文章所有 3 个版本

[PDF] arxiv.org

Detection of word adversarial examples in text classification: Benchmark and baseline via robust density estimation

KY Yoo, J Kim, J Jang, N Kwak - arXiv preprint arXiv:2203.01677, 2022 - arxiv.org

Word-level adversarial attacks have shown success in NLP models, drastically decreasing
the performance of transformer-based models in recent years. As a countermeasure …

被引用次数：49 相关文章所有 5 个版本

[PDF] arxiv.org

Query-efficient black-box red teaming via bayesian optimization

D Lee, JY Lee, JW Ha, JH Kim, SW Lee, H Lee… - arXiv preprint arXiv …, 2023 - arxiv.org

The deployment of large-scale generative models is often restricted by their potential risk of
causing harm to users in unpredictable ways. We focus on the problem of black-box red …

被引用次数：19 相关文章所有 6 个版本

[PDF] arxiv.org

Improving the adversarial robustness of NLP models by information bottleneck

C Zhang, X Zhou, Y Wan, X Zheng, KW Chang… - arXiv preprint arXiv …, 2022 - arxiv.org

Existing studies have demonstrated that adversarial examples can be directly attributed to
the presence of non-robust features, which are highly predictive, but can be easily …

被引用次数：31 相关文章所有 3 个版本