The trickle-down impact of reward (in-) consistency on rlhf
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves
optimizing against a Reward Model (RM), which itself is trained to reflect human preferences …
optimizing against a Reward Model (RM), which itself is trained to reflect human preferences …
Interpretability and transparency-driven detection and transformation of textual adversarial examples (it-dt)
Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have shown
impressive performance in NLP. However, their vulnerability to adversarial examples poses …
impressive performance in NLP. However, their vulnerability to adversarial examples poses …
Enhancing adversarial robustness in Natural Language Inference using explanations
A Koulakos, M Lymperaiou, G Filandrianos… - arXiv preprint arXiv …, 2024 - arxiv.org
The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits
of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the …
of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the …
The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples
Recent studies have revealed the vulnerability of pre-trained language models to
adversarial attacks. Existing adversarial defense techniques attempt to reconstruct …
adversarial attacks. Existing adversarial defense techniques attempt to reconstruct …
[PDF][PDF] Adversarial Attack Against Different Pre-trained Language Models and its Defense
SK Dutta - researchgate.net
The rapid advancement of natural language processing (NLP) models has revolutionized
various domains, from sentiment analysis to language translation. However, this progress …
various domains, from sentiment analysis to language translation. However, this progress …