Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark Paper • 2304.03279 • Published Apr 6, 2023 • 2
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training Paper • 2406.10670 • Published Jun 15, 2024 • 4