Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

The wmdp benchmark: Measuring and reducing malicious use with unlearning

N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti… - arXiv preprint arXiv …, 2024 - arxiv.org
The White House Executive Order on Artificial Intelligence highlights the risks of large
language models (LLMs) empowering malicious actors in developing biological, cyber, and …

Detecting pretraining data from large language models

W Shi, A Ajith, M Xia, Y Huang, D Liu, T Blevins… - arXiv preprint arXiv …, 2023 - arxiv.org
Although large language models (LLMs) are widely deployed, the data used to train them is
rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but …

Muse: Machine unlearning six-way evaluation for language models

W Shi, J Lee, Y Huang, S Malladi, J Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models (LMs) are trained on vast amounts of text data, which may include private
and copyrighted content. Data owners may request the removal of their data from a trained …

Copyright protection in generative ai: A technical perspective

J Ren, H Xu, P He, Y Cui, S Zeng, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative AI has witnessed rapid advancement in recent years, expanding their
capabilities to create synthesized content such as text, images, audio, and code. The high …

Perils and opportunities in using large language models in psychological research

S Abdurahman, M Atari, F Karimi-Malekabadi… - PNAS …, 2024 - academic.oup.com
The emergence of large language models (LLMs) has sparked considerable interest in their
potential application in psychological research, mainly as a model of the human psyche or …

Knowledgeable preference alignment for llms in domain-specific question answering

Y Zhang, Z Chen, Y Fang, Y Lu, F Li, W Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Deploying large language models (LLMs) to real scenarios for domain-specific question
answering (QA) is a key thrust for LLM applications, which poses numerous challenges …

Knowledge unlearning for llms: Tasks, methods, and challenges

N Si, H Zhang, H Chang, W Zhang, D Qu… - arXiv preprint arXiv …, 2023 - arxiv.org
In recent years, large language models (LLMs) have spurred a new research paradigm in
natural language processing. Despite their excellent capability in knowledge-based …

Negative preference optimization: From catastrophic collapse to effective unlearning

R Zhang, L Lin, Y Bai, S Mei - arXiv preprint arXiv:2404.05868, 2024 - arxiv.org
Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data
during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from …

Towards safer large language models through machine unlearning

Z Liu, G Dou, Z Tan, Y Tian, M Jiang - arXiv preprint arXiv:2402.10058, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has demonstrated their vast
potential across various domains, attributed to their extensive pretraining knowledge and …