MMToM-QA: Multimodal Theory of Mind Question Answering
🚀 ProductivityMMToM-QA: Multimodal Theory of Mind Question Answering
MMToM-QA: Multimodal Theory of Mind Question Answering Chuanyang Jin 1 , Yutong Wu 2 , Jing Cao 3 , Jiannan Xiang 4 , Yen-Ling Kuo 5 , Zhiting Hu 4 , Tomer Ullman 2 , Antonio Torralba 3 , Joshua Tenenbaum 3 , Tianmin Shu 6 1 NYU, 2 Harvard, 3 MIT, 4 UCSD, 5 UVA, 6 JHU ACL 2024 Outstanding Paper Award 📑 Blog Paper Code Twitter 📣 Benchmark 🏆 Leaderboard MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data . The questions evaluate belief inference and goal inference in rich and diverse situations. For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper . The instructions for using or submitting to the MMToM-QA benchmark are available here . --> --> Leaderboard --> --> Leaderboard --> MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data . The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper . The instructions for using or submitting to the MMToM-QA benchmark are available here . Method Belief Goal All Multimodal Human 97.5 88.5 93 AutoToM + Model Spec. (w/ GTP-4o) Zhang et al., '25 94.0 65.7 79.8 BIP-ALM (w/ LLaMA 2) Jin et al., '24 80.3 73.3 76.7 AutoToM (w/ GTP-4o) Zhang et al., '25 88.7 62.3 75.5 o3-mini 88.7 40.7 64.7 Gemini 2.0 Flash Thinking 73.3 34.7 54.0 SimToM (w/ GTP-4o) Wilf, et al., '23 75.7 26.3 51.0 Gemini 2.0 Pro 57.0 44.7 50.8 Gemini 2.0 Flash 62.7 33.3 48.0 InstructBLIP 48.7 44.7 46.7 GPT-4o 55.7 32.3 44.0 Llama 3.1 70B 51.3 36.3 43.8 LLaVA 43.0 44.0 43.5 Video-LLaMA 2 42.0 38.3 40.2 GPT-4V 55.3 34.7 40.0 Text Only Human 91.0 74.0 82.5 o1* 95.1 59.2 76.5 o3-mini* 97.1 44.9 71.5 BIP-ALM (w/ LLaMA 2) Jin et al., '24 82.3 58.7 70.5...
Related Tools

Claude
Claude is Anthropic

Stability AI
Multimodal media generation and editing tools designed for the best in the business. No creative cha...

DALL·E 3
DALL·E 3 understands significantly more nuance and detail than our previous systems, allowing you to...

Put AI agents to work for marketing | Jasper
Orchestrate intelligent agents to run end-to-end marketing workflows—delivering speed, control, and ...