Berkeley Function Calling Leaderboard (BFCL) V4

Berkeley Function Calling Leaderboard (BFCL) V4

🚀 Productivity

Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard...

Mar 24, 2026Admin

Home Blog Try it Out! Leaderboard Berkeley Function-Calling Leaderboard Gorilla LLM Team --> BFCL: From Tool Use to Agentic Evaluation of Large Language Models The Berkeley Function Calling Leaderboard (BFCL) V4 evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blogs: BFCL-v1 introducing AST as an evaluation metric, BFCL-v2 introducing enterprise and OSS-contributed functions, BFCL-v3 introducing multi-turn interactions, and BFCL-v4 introducing holistic agentic evaluation. Checkout code and data . Last Updated: 2025-12-16 [Change Log] Search FC = native support for function/tool calling. Prompt = walk-around for function calling, using model's normal text generation capability. Cost is calculated as an estimate of the cost for the entire benchmark, in USD. Latency is measured in seconds. Overall Accuracy is the unweighted average of all the sub-categories. For details on score composition, please refer to our blog . Format sensitivity test cases are only supported for prompt (non-FC) models. Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via discord . Models are evaluated using commit f7cf735 . All the model response we obtained is available here . To reproduce the results, please either checkout our codebase at this checkpoint , or install the PyPI package pip install bfcl-eval==2025.12.17 . Wagon Wheel The following chart shows the comparison of the models based on a few metrics. You can select and deselect which models to compare. More information on each metric can be found in the blog . Select Models to Compare Clear All Search models... Error Type Analysis This interactive treemap shows the distribution of error types across different models. The size of each block represents the number of errors encountered by that...

Related Tools

Comments

Please login to leave a comment