Post
443
A new benchmark (DPAB-Ī±) has been released that evaluates LLM function calling in both Pythonic and JSON approaches.
It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks:
ā Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
ā Smaller models show impressive results (Dria-Agent-Ī±-3B: 72% Pythonic)
ā Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.
The benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog post: https://huggingface.co/blog/andthattoo/dpab-a
It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks:
ā Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
ā Smaller models show impressive results (Dria-Agent-Ī±-3B: 72% Pythonic)
ā Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.
The benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog post: https://huggingface.co/blog/andthattoo/dpab-a