Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

A review on language models as knowledge bases

B AlKhamissi, M Li, A Celikyilmaz, M Diab… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, there has been a surge of interest in the NLP community on the use of pretrained
Language Models (LMs) as Knowledge Bases (KBs). Researchers have shown that LMs …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

A Srivastava, A Rastogi, A Rao, AAM Shoeb… - arXiv preprint arXiv …, 2022 - arxiv.org
Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative impact, these new …

Larger language models do in-context learning differently

J Wei, J Wei, Y Tay, D Tran, A Webson, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …

Mass-editing memory in a transformer

K Meng, AS Sharma, A Andonian, Y Belinkov… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent work has shown exciting promise in updating large language models with new
memories, so as to replace obsolete information or add specialized knowledge. However …

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Locating and editing factual associations in GPT

K Meng, D Bau, A Andonian… - Advances in Neural …, 2022 - proceedings.neurips.cc
We analyze the storage and recall of factual associations in autoregressive transformer
language models, finding evidence that these associations correspond to localized, directly …

Explainability for large language models: A survey

H Zhao, H Chen, F Yang, N Liu, H Deng, H Cai… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …

Editing large language models: Problems, methods, and opportunities

Y Yao, P Wang, B Tian, S Cheng, Z Li, S Deng… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite the ability to train capable LLMs, the methodology for maintaining their relevancy
and rectifying errors remains elusive. To this end, the past few years have witnessed a surge …