Eliciting Language Model Behaviors using Reverse Language Models J Pfau, A Infanger, A Sheshadri, A Panda, J Michael, C Huebner NeurIPS SOLAR Workshop, 2023 | 6 | 2023 |
A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task J Brinkmann, A Sheshadri, V Levoso, P Swoboda, C Bartelt arXiv preprint arXiv:2402.11917, 2024 | 5 | 2024 |
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv preprint arXiv:2407.15549, 2024 | 1 | 2024 |
Robust Unlearning via Mechanistic Localizations PH Guo, A Syed, A Sheshadri, A Ewart, GK Dziugaite ICML 2024 Workshop on Mechanistic Interpretability, 2024 | | 2024 |
Robust Knowledge Unlearning via Mechanistic Localizations PH Guo, A Syed, A Sheshadri, A Ewart, GK Dziugaite ICML 2024 Next Generation of AI Safety Workshop, 0 | | |
Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task J Brinkmann, A Sheshadri, V Levoso, P Swoboda, C Bartelt ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation …, 0 | | |