Beavertails: Towards improved safety alignment of llm via a human-preference dataset J Ji, M Liu, J Dai, X Pan, C Zhang, C Bian, B Chen, R Sun, Y Wang, ... Advances in Neural Information Processing Systems 36, 2024 | 100 | 2024 |
Ai alignment: A comprehensive survey J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang, Y Duan, Z He, J Zhou, ... arXiv preprint arXiv:2310.19852, 2023 | 85 | 2023 |
Aligner: Achieving efficient alignment through weak-to-strong correction J Ji, B Chen, H Lou, D Hong, B Zhang, X Pan, J Dai, Y Yang arXiv preprint arXiv:2402.02416, 2024 | 13 | 2024 |
Language Models Resist Alignment Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li ... https://ui.adsabs.harvard.edu/abs/2024arXiv240606144J/abstract, 2024 | | 2024 |
Efficient Model-agnostic Alignment via Bayesian Persuasion F Bai, M Wang, Z Zhang, B Chen, Y Xu, Y Wen, Y Yang arXiv preprint arXiv:2405.18718, 2024 | | 2024 |