Transferring Activation Features for model interventions Collection 22 items • Updated 26 days ago • 1
Blog: Activations transfer for model interventions. Collection Collects backdoor datasets, language models and transfer mappings between these spaces. • 6 items • Updated May 10 • 3
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models Paper • 2310.08164 • Published Oct 12, 2023 • 4