The trickle-down impact of reward (in-) consistency on rlhf

L Shen, S Chen, L Song, L Jin, B Peng, H Mi… - arXiv preprint arXiv …, 2023 - arxiv.org
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves
optimizing against a Reward Model (RM), which itself is trained to reflect human preferences …

SA-SGRU: combining improved self-attention and skip-GRU for text classification

Y Huang, X Dai, J Yu, Z Huang - Applied Sciences, 2023 - mdpi.com
When reading texts for text classification tasks, a large number of words are irrelevant, and
in text classification tasks, the traditional self-attention mechanism has the problem of weight …

Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

J Becker, JP Wahle, B Gipp, T Ruas - arXiv preprint arXiv:2405.15604, 2024 - arxiv.org
Text generation has become more accessible than ever, and the increasing interest in these
systems, especially those using large language models, has spurred an increasing number …

Textshield: Beyond successfully detecting adversarial sentences in text classification

L Shen, Z Zhang, H Jiang, Y Chen - arXiv preprint arXiv:2302.02023, 2023 - arxiv.org
Adversarial attack serves as a major challenge for neural network models in NLP, which
precludes the model's deployment in safety-critical applications. A recent line of work …

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

T Lu, L Shen, X Yang, W Tan, B Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) involves training policy models
(PMs) and reward models (RMs) to align language models with human preferences. Instead …