Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 280 | 2023 |
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets S Marks, M Tegmark arXiv preprint arXiv:2310.06824, 2023 | 52 | 2023 |
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models S Marks, C Rager, EJ Michaud, Y Belinkov, D Bau, A Mueller arXiv preprint arXiv:2403.19647, 2024 | 22 | 2024 |
Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint ARXIV.2307.15217, 0 | 7 | |
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (arXiv: 2307.15217). arXiv S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... | 5 | 2023 |
& Hadfield-Menell, D.(2023). Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 0 | 5 | |
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... arXiv preprint arXiv:2406.10162, 2024 | 4 | 2024 |
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data J Treutlein, D Choi, J Betley, C Anil, S Marks, RB Grosse, O Evans arXiv preprint arXiv:2406.14546, 2024 | 2 | 2024 |
Measuring progress in dictionary learning for language model interpretability with board game models A Karvonen, B Wright, C Rager, R Angell, J Brinkmann, L Smith, ... arXiv preprint arXiv:2408.00113, 2024 | 1 | 2024 |
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability A Mueller, J Brinkmann, M Li, S Marks, K Pal, N Prakash, C Rager, ... arXiv preprint arXiv:2408.01416, 2024 | | 2024 |
NNsight and NDIF: Democratizing Access to Foundation Model Internals J Fiotto-Kaufman, AR Loftus, E Todd, J Brinkmann, C Juang, K Pal, ... arXiv preprint arXiv:2407.14561, 2024 | | 2024 |
Prismatic -crystals and Lubin-Tate -modules S Marks arXiv preprint arXiv:2303.07620, 2023 | | 2023 |
Laurent F-Crystals and Lubin-Tate (φq, Γ)-Modules S Marks Harvard University, 2023 | | 2023 |
p-adic Modular Formsa la Serre S Marks | | 2020 |
Derivatives of p-adic Siegel Eisenstein series and p-adic degrees of arithmetic cycles SP Marks Princeton University, 2019 | | 2019 |
p-Adic Properties of Hauptmoduln with Applications to Moonshine RC Chen, S Marks, M Tyler SIGMA. Symmetry, Integrability and Geometry: Methods and Applications 15, 033, 2019 | | 2019 |
Prismatic F-crystals and Lubin-Tate (φq, Γ)-modules S Marks | | |