Berkeley Function Calling Leaderboard (BFCL) V4
🚀 ProductivityExplore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard...
Home Blog Try it Out! Leaderboard Berkeley Function-Calling Leaderboard Gorilla LLM Team --> BFCL: From Tool Use to Agentic Evaluation of Large Language Models The Berkeley Function Calling Leaderboard (BFCL) V4 evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blogs: BFCL-v1 introducing AST as an evaluation metric, BFCL-v2 introducing enterprise and OSS-contributed functions, BFCL-v3 introducing multi-turn interactions, and BFCL-v4 introducing holistic agentic evaluation. Checkout code and data . Last Updated: 2025-12-16 [Change Log] Search FC = native support for function/tool calling. Prompt = walk-around for function calling, using model's normal text generation capability. Cost is calculated as an estimate of the cost for the entire benchmark, in USD. Latency is measured in seconds. Overall Accuracy is the unweighted average of all the sub-categories. For details on score composition, please refer to our blog . Format sensitivity test cases are only supported for prompt (non-FC) models. Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via discord . Models are evaluated using commit f7cf735 . All the model response we obtained is available here . To reproduce the results, please either checkout our codebase at this checkpoint , or install the PyPI package pip install bfcl-eval==2025.12.17 . Wagon Wheel The following chart shows the comparison of the models based on a few metrics. You can select and deselect which models to compare. More information on each metric can be found in the blog . Select Models to Compare Clear All Search models... Error Type Analysis This interactive treemap shows the distribution of error types across different models. The size of each block represents the number of errors encountered by that...
Related Tools

Claude
Claude is Anthropic

Stability AI
Multimodal media generation and editing tools designed for the best in the business. No creative cha...

DALL·E 3
DALL·E 3 understands significantly more nuance and detail than our previous systems, allowing you to...

Put AI agents to work for marketing | Jasper
Orchestrate intelligent agents to run end-to-end marketing workflows—delivering speed, control, and ...