Smoothllm: Defending large language models against jailbreaking attacks
Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …
Defending against alignment-breaking attacks via robustly aligned llm
Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …
now widely used across various domains. Unfortunately, there has been a rising concern …
Adversarial attack and defense on natural language processing in deep learning: A survey and perspective
Natural language processing (NLP) presently has become a new paradigm and enables a
variety of applications such as text classification, information retrieval, and natural language …
variety of applications such as text classification, information retrieval, and natural language …
An investigation on the efficiency of some text attack algorithms
A Koley, P Satpati, I Choudhary… - 2024 IEEE North …, 2024 - ieeexplore.ieee.org
Machine learning models trained on human language, also known as Natural Language
Processing (NLP) models, are susceptible to manipulation. These attacks, called NLP …
Processing (NLP) models, are susceptible to manipulation. These attacks, called NLP …
Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp
Textual adversarial samples play important roles in multiple subfields of NLP research,
including security, evaluation, explainability, and data augmentation. However, most work …
including security, evaluation, explainability, and data augmentation. However, most work …
Certified robustness to text adversarial attacks by randomized [mask]
Very recently, few certified defense methods have been developed to provably guarantee
the robustness of a text classifier to adversarial synonym substitutions. However, all the …
the robustness of a text classifier to adversarial synonym substitutions. However, all the …
SSPAttack: a simple and sweet paradigm for black-box hard-label textual adversarial attack
Hard-label textual adversarial attack is a challenging task, as only the predicted label
information is available, and the text space is discrete and non-differentiable. Relevant …
information is available, and the text space is discrete and non-differentiable. Relevant …
Detection of word adversarial examples in text classification: Benchmark and baseline via robust density estimation
Word-level adversarial attacks have shown success in NLP models, drastically decreasing
the performance of transformer-based models in recent years. As a countermeasure …
the performance of transformer-based models in recent years. As a countermeasure …
Query-efficient black-box red teaming via bayesian optimization
The deployment of large-scale generative models is often restricted by their potential risk of
causing harm to users in unpredictable ways. We focus on the problem of black-box red …
causing harm to users in unpredictable ways. We focus on the problem of black-box red …
Improving the adversarial robustness of NLP models by information bottleneck
Existing studies have demonstrated that adversarial examples can be directly attributed to
the presence of non-robust features, which are highly predictive, but can be easily …
the presence of non-robust features, which are highly predictive, but can be easily …