Eliciting latent predictions from transformers with the tuned lens N Belrose, Z Furman, L Smith, D Halawi, I Ostrovsky, L McKinney, ... arXiv preprint arXiv:2303.08112, 2023 | 89 | 2023 |
Overthinking the truth: Understanding how language models process false demonstrations D Halawi, JS Denain, J Steinhardt ICLR 2024, 2023 | 23 | 2023 |
Approaching Human-Level Forecasting with Language Models D Halawi, F Zhang, C Yueh-Han, J Steinhardt arXiv preprint arXiv:2402.18563, 2024 | 8 | 2024 |
Verifying source citations in the hadith literature M Syed, D Halawi, B Sadeghi, N Saquib Journal of Medieval Worlds 1 (3), 5-20, 2019 | 4 | 2019 |
Trophic analysis of a historical network reveals temporal information C Shuaib, M Syed, D Halawi, N Saquib Applied Network Science 7 (1), 31, 2022 | 3 | 2022 |
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation D Halawi, A Wei, E Wallace, TT Wang, N Haghtalab, J Steinhardt ICML 2024, 2024 | | 2024 |
Dominion: A New Frontier for AI Research D Halawi, A Sarmasi, S Saltzen, J McCoy CoRL 2022: Workshop on Strategic Multi-Agent Interactions, 2022 | | 2022 |