PatronusAI/world_model_corpus
Viewer • Updated • 284k • 43
LLM Evaluation
Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments