StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Paper • 2403.07714 • Published • 1
Replicates the Play2Prompt paper conditions on StableToolBench using Llama-3.1-8B-Instruct.
Designed for extensibility: The 4 conditions are controlled by two pluggable components — tool descriptions and in-context examples. To test your own description types, just drop replacement files into p2p_data/descriptions/ and p2p_data/examples/.
See pipeline/ directory for all source code.