FELM: Benchmarking Factuality Evaluation of Large Language Models

FELM: Benchmarking Factuality Evaluation of Large Language Models Shiqi Chen 1 , Yiran Zhao 3 , Jinghan Zhang 2 , I-Chun Chern 4 , Siyang Gao 1 , Pengfei Liu 5 , Junxian He 2 , 1 City University of Hong Kong 2 Hong Kong University of Science and Technology 3 National University of Singapore 4 Carnegie Mellon University 5 Shanghai Jiaotong University Video --> --> 📖 Paper Code --> 🤗 Dataset Nerfies turns selfie videos from your phone into free-viewpoint portraits. --> --> Abstract FELM is a meta benchmark to evaluate factuality evaluation benchmark for Large Language Models. Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. FELM covers five distinct domains: World Knowledge, Science/Technology, Writing/Recommendation, Reasoning, and Math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors. We then obtain responses from ChatGPT for these prompts. Examples from each domain in FELM Dataset Statistics Dataset Snapshot Category Data Number of Instances 847 Number of Fields 5 Labeled Classes 2 Number of Labels 4427 Descriptive Statistics Statistic All world_knowledge Reasoning Math Science/tech Writing/Recommendation Segments 4427 532 1025 599 683 1588 Positive segments 3642 385...

Related Tools

Claude

Stability AI

DALL·E 3

Put AI agents to work for marketing | Jasper

Comments