Post
444
A new benchmark (DPAB-α) has been released that evaluates LLM function calling in both Pythonic and JSON approaches.
It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks:
— Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
— Smaller models show impressive results (Dria-Agent-α-3B: 72% Pythonic)
— Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.
The benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog post: https://huggingface.co/blog/andthattoo/dpab-a
It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks:
— Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
— Smaller models show impressive results (Dria-Agent-α-3B: 72% Pythonic)
— Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.
The benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog post: https://huggingface.co/blog/andthattoo/dpab-a