Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Kan 2.0: Kolmogorov-arnold networks meet science

Z Liu, P Ma, Y Wang, W Matusik, M Tegmark - arXiv preprint arXiv …, 2024 - arxiv.org
A major challenge of AI+ Science lies in their inherent incompatibility: today's AI is primarily
based on connectionism, while science depends on symbolism. To bridge the two worlds …

Roselora: Row and column-wise sparse low-rank adaptation of pre-trained language model for knowledge editing and fine-tuning

H Wang, T Liu, R Li, M Cheng, T Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-trained language models, trained on large-scale corpora, demonstrate strong
generalizability across various NLP tasks. Fine-tuning these models for specific tasks …

Recurrent neural networks learn to store and generate sequences using non-linear representations

R Csordás, C Potts, CD Manning, A Geiger - arXiv preprint arXiv …, 2024 - arxiv.org
The Linear Representation Hypothesis (LRH) states that neural networks learn to encode
concepts as directions in activation space, and a strong version of the LRH states that …

Relational composition in neural networks: A survey and call to action

M Wattenberg, FB Viégas - arXiv preprint arXiv:2407.14662, 2024 - arxiv.org
Many neural nets appear to represent data as linear combinations of" feature vectors."
Algorithms for discovering these vectors have seen impressive recent success. However, we …

Disentangling dense embeddings with sparse autoencoders

C O'Neill, C Ye, K Iyer, JF Wu - arXiv preprint arXiv:2408.00657, 2024 - arxiv.org
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from
complex neural networks. We present one of the first applications of SAEs to dense text …

Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small

M Chaudhary, A Geiger - arXiv preprint arXiv:2409.04478, 2024 - arxiv.org
A popular new method in mechanistic interpretability is to train high-dimensional sparse
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

O Shorinwa, Z Mei, J Lidard, AZ Ren… - arXiv preprint arXiv …, 2024 - arxiv.org
The remarkable performance of large language models (LLMs) in content generation,
coding, and common-sense reasoning has spurred widespread integration into many facets …

Robust ai-generated text detection by restricted embeddings

K Kuznetsov, E Tulchinskii, L Kushnareva… - arXiv preprint arXiv …, 2024 - arxiv.org
Growing amount and quality of AI-generated texts makes detecting such content more
difficult. In most real-world scenarios, the domain (style and topic) of generated data and the …

Sok: On finding common ground in loss landscapes using deep model merging techniques

A Khan, T Nief, N Hudson, M Sakarvadia… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding neural networks is crucial to creating reliable and trustworthy deep learning
models. Most contemporary research in interpretability analyzes just one model at a time via …