Abstract
While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high school-level multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
Anand, A., et al.: SciPhyRAG-retrieval augmentation to improve LLMs on physics Q & A. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 50–63. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_4
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). n. pag
Anand, A., et al.: Context-enhanced language models for generating multi-paper citations. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 80–94. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_6
Anand, A., et al.: KG-CTG: citation generation through knowledge graph-guided large language models. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds.) BDA 2023. LNCS, vol. 14418, pp. 37–49. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49601-1_3
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
Arora, D., Singh, H.G.: Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074 (2023). pag
Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017)
Wang, X., et al.: SciBench: evaluating college-level scientific problem-solving abilities of large language models. ArXiv abs/2307.10635 (2023). n. pag
Hendrycks, D., et al.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)
Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)
Chen, J., et al.: GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517 (2021)
Jin, N., Siebert, J., Li, D., Chen, Q.: A survey on table question answering: recent advances. In: Sun, M., et al. (eds.) CCKS 2022. CCIS, vol. 1669, pp. 174–186. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-7596-7_14
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)
Deepak, G., Kumari, S., Ekbal, A., Bhattacharyya, P.: MMQA: a multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. CoRR abs/2106.09685 (2021)
Acknowledgements
Rajiv Ratn Shah is partly supported by the Infosys Center for AI, the Center of Design and New Media, and the Center of Excellence in Healthcare at IIIT Delhi.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Anand, A. et al. (2024). MM-PhyQA: Multimodal Physics Question-Answering with Multi-image CoT Prompting. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14649. Springer, Singapore. https://doi.org/10.1007/978-981-97-2262-4_5
Download citation
DOI: https://doi.org/10.1007/978-981-97-2262-4_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2264-8
Online ISBN: 978-981-97-2262-4
eBook Packages: Computer ScienceComputer Science (R0)