Stephen Casper

Cited by

	All	Since 2019
Citations	794	793
h-index	13	13
i10-index	15	15

540

270

135

405

202020212022202320246 14 42 209 522

Public access

View all

1 article

0 articles

available

not available

Based on funding mandates

Co-authors

Dylan Hadfield-MenellMassachusetts Institute of TechnologyVerified email at csail.mit.edu
Gabriel KreimanProfessor, Harvard Medical School and Children's HospitalVerified email at tch.harvard.edu
Daniel FilanPhD Student, UC BerkeleyVerified email at berkeley.edu
Andrew CritchUC Berkeley, Department of Electrical Engineering and Computer SciencesVerified email at eecs.berkeley.edu
Stuart RussellProfessor of Computer Science, University of California, BerkeleyVerified email at cs.berkeley.edu
Shlomi HodPhD Candidate, Boston UniversityVerified email at bu.edu
Cody WildGoogle DeepMindVerified email at google.com
Anson HoEpochVerified email at epochai.org
Xavier BoixMITVerified email at mit.edu
Kasper VinkenHarvard Medical SchoolVerified email at hms.harvard.edu
Arush TagadeML Researcher, Leap LaboratoriesVerified email at leap-labs.com
Martin SchrimpfEPFLVerified email at epfl.ch
Kevin Honglin ZhangDepartment of Economics, Illinois State UniversityVerified email at ilstu.edu
Kaivalya HariharanMIT CSAILVerified email at mit.edu
Jason LinGoogle / StanfordVerified email at stanford.edu
Gatlen CulpMassachusetts Institute of TechnologyVerified email at mit.edu
Joe KwonMITVerified email at csail.mit.edu
Soroush PourHarmony IntelligenceVerified email at soroushjp.com
Javier RandoETH ZurichVerified email at ai.ethz.ch
Rusheb ShahApollo ResearchVerified email at apolloresearch.ai

Stephen Casper

PhD student, MIT

Verified email at mit.edu - Homepage

AI safety AI responsibility red-teaming robustness auditing


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023	259	2023
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks T Räuker, A Ho, S Casper, D Hadfield-Menell 2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023	127	2023
Explore, establish, exploit: Red teaming language models from scratch S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell arXiv preprint arXiv:2306.09442, 2023	54	2023
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023	46	2023
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, X Xu, Y Yao, H Li, ... arXiv preprint arXiv:2402.08787, 2024	43	2024
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	34	2024
Frivolous units: Wider networks are not really that wide S Casper, X Boix, V D'Amario, L Guo, M Schrimpf, K Vinken, G Kreiman Proceedings of the AAAI Conference on Artificial Intelligence 35 (8), 6921-6929, 2021	28*	2021
Clusterability in neural networks D Filan, S Casper, S Hod, C Wild, A Critch, S Russell arXiv preprint arXiv:2103.03386, 2021	27	2021
Red teaming deep neural networks with feature synthesis tools S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell Advances in Neural Information Processing Systems 36, 80470-80516, 2023	25*	2023
Robust feature-level adversaries are interpretability tools S Casper, M Nadeau, D Hadfield-Menell, G Kreiman Advances in Neural Information Processing Systems 35, 33093-33106, 2022	25	2022
Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254-2272, 2024	19	2024
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024	16	2024
Probing neural dialog models for conversational understanding A Saleh, T Deutsch, S Casper, Y Belinkov, S Shieber arXiv preprint arXiv:2006.08331, 2020	15	2020
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? K Liu, S Casper, D Hadfield-Menell, J Andreas arXiv preprint arXiv:2312.03729, 2023	13	2023
Detecting modularity in deep neural networks S Hod, S Casper, D Filan, C Wild, A Critch, S Russell	11*	2021
Graphical clusterability and local specialization in deep neural networks S Casper, S Hod, D Filan, C Wild, A Critch, S Russell ICLR 2022 Workshop on PAIR {\textasciicircum} 2Struct: Privacy …, 2022	9	2022
Diagnostics for deep neural networks with automated copy/paste attacks S Casper, K Hariharan, D Hadfield-Menell arXiv preprint arXiv:2211.10024, 2022	8	2022
Quantifying local specialization in deep neural networks S Hod, D Filan, S Casper, A Critch, S Russell arXiv preprint arXiv:2110.08058, 2021	8	2021
Defending Against Unforeseen Failure Modes with Latent Adversarial Training S Casper, L Schulze, O Patel, D Hadfield-Menell arXiv preprint arXiv:2403.05030, 2024	7	2024
Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint ARXIV.2307.15217, 0	7

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors