Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The clock and the pizza: Two stories in mechanistic explanation of neural networks

Z Zhong, Z Liu, M Tegmark… - Advances in Neural …, 2024 - proceedings.neurips.cc
Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …

[PDF][PDF] Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods

T Hagendorff - arXiv preprint arXiv:2303.13988, 2023 - cybershafarat.com
Large language models (LLMs) are currently at the forefront of intertwining AI systems with
human communication and everyday life. Due to rapid technological advances and their …

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Tracr: Compiled transformers as a laboratory for interpretability

D Lindner, J Kramár, S Farquhar… - Advances in …, 2024 - proceedings.neurips.cc
We show how to" compile" human-readable programs into standard decoder-only
transformer models. Our compiler, Tracr, generates models with known structure. This …

Applying interpretable machine learning in computational biology—pitfalls, recommendations and opportunities for new developments

V Chen, M Yang, W Cui, JS Kim, A Talwalkar, J Ma - Nature methods, 2024 - nature.com
Recent advances in machine learning have enabled the development of next-generation
predictive models for complex computational biology problems, thereby spurring the use of …

Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla

T Lieberum, M Rahtz, J Kramár, N Nanda… - arXiv preprint arXiv …, 2023 - arxiv.org
\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms
of language models. However, existing analyses are done in small models far from the state …

Sparse autoencoders find highly interpretable features in language models

H Cunningham, A Ewart, L Riggs, R Huben… - arXiv preprint arXiv …, 2023 - arxiv.org
One of the roadblocks to a better understanding of neural networks' internals is\textit
{polysemanticity}, where neurons appear to activate in multiple, semantically distinct …