
AlpacaEval Leaderboard An Automatic Evaluator for Instruction-following Language Models --> --> --> Length-controlled (LC) win rates alleviate length biases of GPT-4, but it may favor models finetuned on its outputs. Version: AlpacaEval AlpacaEval 2.0 Filter: Community Verified --> Minimal --> Baseline: GPT-4 Preview (11/06) | Auto-annotator: GPT-4 Preview (11/06) Rank Model Name LC Win Rate Win Rate Github About AlpacaEval AlpacaEval an LLM-based automatic evaluation that is fast, cheap, and reliable. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. These responses are then compared to reference responses (Davinci003 for AlpacaEval, GPT-4 Preview for AlpacaEval 2.0) by the provided GPT-4 based auto-annotators, which results in the win rates presented above. AlpacaEval displays a high agreement rate with ground truth human annotations, and leaderboard rankings on AlpacaEval are very correlated with leaderboard rankings based on human annotators. Please see our documentation for more details on our analysis. Adding new models We welcome new model contributions to the leaderboard from the community! To do so, please follow the steps in the contributions section . Specifically, you'll need to run the model on the evaluation set, auto-annotate the outputs, and submit a PR with the model config and leaderboard results. We've also set up a Discord for community support and discussion. Adding new evaluators or eval sets We also welcome contributions for new evaluators or new eval sets! For making new evaluators, we release our ground-truth human annotations and comparison metrics . We also release a rough guide to follow for making new eval sets. We specifically encourage contributions for harder instructions distributions and for safety testing of LLMs. AlpacaEval limitations While AlpacaEval provides a useful comparison of model capabilities in following instructions, it is not a comprehensive or...
Related Tools

Claude
Claude is Anthropic

Stability AI
Multimodal media generation and editing tools designed for the best in the business. No creative cha...

DALL·E 3
DALL·E 3 understands significantly more nuance and detail than our previous systems, allowing you to...

Put AI agents to work for marketing | Jasper
Orchestrate intelligent agents to run end-to-end marketing workflows—delivering speed, control, and ...