Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Paper • 2403.19647 • Published Mar 28, 2024 • 3
Opening the AI black box: program synthesis via mechanistic interpretability Paper • 2402.05110 • Published Feb 7, 2024 • 2
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Paper • 2307.15217 • Published Jul 27, 2023 • 37