MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting

Anand, Avinash; Kapuriya, Janak; Singh, Apoorv; Saraf, Jay; Lal, Naman; Verma, Astha; Gupta, Rushali; Shah, Rajiv

doi:10.1007/978-981-97-2262-4_5

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14649))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

236 Accesses

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high school-level multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
Anand, A., et al.: SciPhyRAG-retrieval augmentation to improve LLMs on physics Q & A. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 50–63. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_4
Chapter Google Scholar
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). n. pag
Google Scholar
Anand, A., et al.: Context-enhanced language models for generating multi-paper citations. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 80–94. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_6
Chapter Google Scholar
Anand, A., et al.: KG-CTG: citation generation through knowledge graph-guided large language models. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 37–49. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_3
Chapter Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022)
Google Scholar
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
Arora, D., Singh, H.G.: Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074 (2023). pag
Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017)
Wang, X., et al.: SciBench: evaluating college-level scientific problem-solving abilities of large language models. ArXiv abs/2307.10635 (2023). n. pag
Google Scholar
Hendrycks, D., et al.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)
Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)
Chen, J., et al.: GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517 (2021)
Jin, N., Siebert, J., Li, D., Chen, Q.: A survey on table question answering: recent advances. In: Sun, M., et al. (eds.) CCKS 2022. CCIS, vol. 1669, pp. 174–186. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-7596-7_14
Chapter Google Scholar
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)
Deepak, G., Kumari, S., Ekbal, A., Bhattacharyya, P.: MMQA: a multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
https://openai.com/research/gpt-4v-system-card
Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023)
https://chat.openai.com/
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. CoRR abs/2106.09685 (2021)
Google Scholar

Download references

Acknowledgements

Rajiv Ratn Shah is partly supported by the Infosys Center for AI, the Center of Design and New Media, and the Center of Excellence in Healthcare at IIIT Delhi.

Author information

Authors and Affiliations

Indraprastha Institute of Information Technology, Delhi, India
Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta & Rajiv Shah

Authors

Avinash Anand
View author publications
You can also search for this author in PubMed Google Scholar
Janak Kapuriya
View author publications
You can also search for this author in PubMed Google Scholar
Apoorv Singh
View author publications
You can also search for this author in PubMed Google Scholar
Jay Saraf
View author publications
You can also search for this author in PubMed Google Scholar
Naman Lal
View author publications
You can also search for this author in PubMed Google Scholar
Astha Verma
View author publications
You can also search for this author in PubMed Google Scholar
Rushali Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Rajiv Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Avinash Anand .

Editor information

Editors and Affiliations

Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anand, A. et al. (2024). MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14649. Springer, Singapore. https://doi.org/10.1007/978-981-97-2262-4_5

Download citation

DOI: https://doi.org/10.1007/978-981-97-2262-4_5
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2264-8
Online ISBN: 978-981-97-2262-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting