The trickle-down impact of reward (in-) consistency on rlhf

L Shen, S Chen, L Song, L Jin, B Peng, H Mi… - arXiv preprint arXiv …, 2023 - arxiv.org
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves
optimizing against a Reward Model (RM), which itself is trained to reflect human preferences …

Interpretability and transparency-driven detection and transformation of textual adversarial examples (it-dt)

B Sabir, MA Babar, S Abuadbba - arXiv preprint arXiv:2307.01225, 2023 - arxiv.org
Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have shown
impressive performance in NLP. However, their vulnerability to adversarial examples poses …

Enhancing adversarial robustness in Natural Language Inference using explanations

A Koulakos, M Lymperaiou, G Filandrianos… - arXiv preprint arXiv …, 2024 - arxiv.org
The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits
of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the …

The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples

H Yang, K Li - arXiv preprint arXiv:2305.04067, 2023 - arxiv.org
Recent studies have revealed the vulnerability of pre-trained language models to
adversarial attacks. Existing adversarial defense techniques attempt to reconstruct …

[PDF][PDF] Adversarial Attack Against Different Pre-trained Language Models and its Defense

SK Dutta - researchgate.net
The rapid advancement of natural language processing (NLP) models has revolutionized
various domains, from sentiment analysis to language translation. However, this progress …