Hybridflow: A flexible and efficient rlhf framework

G Sheng, C Zhang, Z Ye, X Wu, W Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language
Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node …