Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …
decomposition of a neural network's latent representations into seemingly interpretable …
Kan 2.0: Kolmogorov-arnold networks meet science
A major challenge of AI+ Science lies in their inherent incompatibility: today's AI is primarily
based on connectionism, while science depends on symbolism. To bridge the two worlds …
based on connectionism, while science depends on symbolism. To bridge the two worlds …
Roselora: Row and column-wise sparse low-rank adaptation of pre-trained language model for knowledge editing and fine-tuning
Pre-trained language models, trained on large-scale corpora, demonstrate strong
generalizability across various NLP tasks. Fine-tuning these models for specific tasks …
generalizability across various NLP tasks. Fine-tuning these models for specific tasks …
Recurrent neural networks learn to store and generate sequences using non-linear representations
The Linear Representation Hypothesis (LRH) states that neural networks learn to encode
concepts as directions in activation space, and a strong version of the LRH states that …
concepts as directions in activation space, and a strong version of the LRH states that …
Relational composition in neural networks: A survey and call to action
M Wattenberg, FB Viégas - arXiv preprint arXiv:2407.14662, 2024 - arxiv.org
Many neural nets appear to represent data as linear combinations of" feature vectors."
Algorithms for discovering these vectors have seen impressive recent success. However, we …
Algorithms for discovering these vectors have seen impressive recent success. However, we …
Disentangling dense embeddings with sparse autoencoders
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from
complex neural networks. We present one of the first applications of SAEs to dense text …
complex neural networks. We present one of the first applications of SAEs to dense text …
Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small
M Chaudhary, A Geiger - arXiv preprint arXiv:2409.04478, 2024 - arxiv.org
A popular new method in mechanistic interpretability is to train high-dimensional sparse
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …
A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions
The remarkable performance of large language models (LLMs) in content generation,
coding, and common-sense reasoning has spurred widespread integration into many facets …
coding, and common-sense reasoning has spurred widespread integration into many facets …
Robust ai-generated text detection by restricted embeddings
K Kuznetsov, E Tulchinskii, L Kushnareva… - arXiv preprint arXiv …, 2024 - arxiv.org
Growing amount and quality of AI-generated texts makes detecting such content more
difficult. In most real-world scenarios, the domain (style and topic) of generated data and the …
difficult. In most real-world scenarios, the domain (style and topic) of generated data and the …
Sok: On finding common ground in loss landscapes using deep model merging techniques
A Khan, T Nief, N Hudson, M Sakarvadia… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding neural networks is crucial to creating reliable and trustworthy deep learning
models. Most contemporary research in interpretability analyzes just one model at a time via …
models. Most contemporary research in interpretability analyzes just one model at a time via …