MMToM-QA: Multimodal Theory of Mind Question Answering

MMToM-QA: Multimodal Theory of Mind Question Answering

🚀 Productivity

MMToM-QA: Multimodal Theory of Mind Question Answering

Mar 24, 2026Admin

MMToM-QA: Multimodal Theory of Mind Question Answering Chuanyang Jin 1 , Yutong Wu 2 , Jing Cao 3 , Jiannan Xiang 4 , Yen-Ling Kuo 5 , Zhiting Hu 4 , Tomer Ullman 2 , Antonio Torralba 3 , Joshua Tenenbaum 3 , Tianmin Shu 6 1 NYU, 2 Harvard, 3 MIT, 4 UCSD, 5 UVA, 6 JHU ACL 2024 Outstanding Paper Award 📑 Blog Paper Code Twitter 📣 Benchmark 🏆 Leaderboard MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data . The questions evaluate belief inference and goal inference in rich and diverse situations. For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper . The instructions for using or submitting to the MMToM-QA benchmark are available here . --> --> Leaderboard --> --> Leaderboard --> MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data . The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper . The instructions for using or submitting to the MMToM-QA benchmark are available here . Method Belief Goal All Multimodal Human 97.5 88.5 93 AutoToM + Model Spec. (w/ GTP-4o) Zhang et al., '25 94.0 65.7 79.8 BIP-ALM (w/ LLaMA 2) Jin et al., '24 80.3 73.3 76.7 AutoToM (w/ GTP-4o) Zhang et al., '25 88.7 62.3 75.5 o3-mini 88.7 40.7 64.7 Gemini 2.0 Flash Thinking 73.3 34.7 54.0 SimToM (w/ GTP-4o) Wilf, et al., '23 75.7 26.3 51.0 Gemini 2.0 Pro 57.0 44.7 50.8 Gemini 2.0 Flash 62.7 33.3 48.0 InstructBLIP 48.7 44.7 46.7 GPT-4o 55.7 32.3 44.0 Llama 3.1 70B 51.3 36.3 43.8 LLaVA 43.0 44.0 43.5 Video-LLaMA 2 42.0 38.3 40.2 GPT-4V 55.3 34.7 40.0 Text Only Human 91.0 74.0 82.5 o1* 95.1 59.2 76.5 o3-mini* 97.1 44.9 71.5 BIP-ALM (w/ LLaMA 2) Jin et al., '24 82.3 58.7 70.5...

Related Tools

Comments

Please login to leave a comment