Challenges and applications of large language models
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
The clock and the pizza: Two stories in mechanistic explanation of neural networks
Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …
algorithms? Several recent studies, on tasks ranging from group operations to in-context …
[PDF][PDF] Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods
T Hagendorff - arXiv preprint arXiv:2303.13988, 2023 - cybershafarat.com
Large language models (LLMs) are currently at the forefront of intertwining AI systems with
human communication and everyday life. Due to rapid technological advances and their …
human communication and everyday life. Due to rapid technological advances and their …
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …
decomposition of a neural network's latent representations into seemingly interpretable …
Tracr: Compiled transformers as a laboratory for interpretability
We show how to" compile" human-readable programs into standard decoder-only
transformer models. Our compiler, Tracr, generates models with known structure. This …
transformer models. Our compiler, Tracr, generates models with known structure. This …
Applying interpretable machine learning in computational biology—pitfalls, recommendations and opportunities for new developments
Recent advances in machine learning have enabled the development of next-generation
predictive models for complex computational biology problems, thereby spurring the use of …
predictive models for complex computational biology problems, thereby spurring the use of …
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla
\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms
of language models. However, existing analyses are done in small models far from the state …
of language models. However, existing analyses are done in small models far from the state …
Sparse autoencoders find highly interpretable features in language models
H Cunningham, A Ewart, L Riggs, R Huben… - arXiv preprint arXiv …, 2023 - arxiv.org
One of the roadblocks to a better understanding of neural networks' internals is\textit
{polysemanticity}, where neurons appear to activate in multiple, semantically distinct …
{polysemanticity}, where neurons appear to activate in multiple, semantically distinct …