Textshield: Beyond successfully detecting adversarial sentences in text classification

L Shen, S Chen, L Song, L Jin, B Peng, H Mi… - arXiv preprint arXiv …, 2023 - arxiv.org

Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves
optimizing against a Reward Model (RM), which itself is trained to reflect human preferences …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Interpretability and transparency-driven detection and transformation of textual adversarial examples (it-dt)

B Sabir, MA Babar, S Abuadbba - arXiv preprint arXiv:2307.01225, 2023 - arxiv.org

Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have shown
impressive performance in NLP. However, their vulnerability to adversarial examples poses …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Enhancing adversarial robustness in Natural Language Inference using explanations

A Koulakos, M Lymperaiou, G Filandrianos… - arXiv preprint arXiv …, 2024 - arxiv.org

The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits
of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the …

The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples

H Yang, K Li - arXiv preprint arXiv:2305.04067, 2023 - arxiv.org

Recent studies have revealed the vulnerability of pre-trained language models to
adversarial attacks. Existing adversarial defense techniques attempt to reconstruct …

被引用次数：1 相关文章所有 3 个版本

[PDF] researchgate.net

[PDF][PDF] Adversarial Attack Against Different Pre-trained Language Models and its Defense

SK Dutta - researchgate.net

The rapid advancement of natural language processing (NLP) models has revolutionized
various domains, from sentiment analysis to language translation. However, this progress …