Representation Engineering: A Top-Down Approach to AI Transparency Paper • 2310.01405 • Published Oct 2, 2023 • 5
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks Paper • 1910.01279 • Published Oct 3, 2019
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 4
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Paper • 2408.15221 • Published Aug 27, 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Paper • 2410.13886 • Published Oct 11, 2024
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy Paper • 2401.12129 • Published Jan 22, 2024 • 1
Representation Learning in Continuous-Time Dynamic Signed Networks Paper • 2207.03408 • Published Jul 7, 2022
A Careful Examination of Large Language Model Performance on Grade School Arithmetic Paper • 2405.00332 • Published May 1, 2024 • 32
Planning In Natural Language Improves LLM Search For Code Generation Paper • 2409.03733 • Published Sep 5, 2024
Learning Goal-Conditioned Representations for Language Reward Models Paper • 2407.13887 • Published Jul 18, 2024
A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift Paper • 2311.14743 • Published Nov 21, 2023
Federated Reconnaissance: Efficient, Distributed, Class-Incremental Learning Paper • 2109.00150 • Published Sep 1, 2021
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data Paper • 2409.00238 • Published Aug 30, 2024