Abstract
Automatic assessment of learning is a process where the computer system automatically generates test items and evaluates the responses. Image is one of the major media to assess learning capabilities. In this article, we have proposed a system with a dataset containing images used to evaluate the kid’s ability to answer pictorial questions. The system tests a set of skills such as counting ability, color concept, and knowledge of objects and geometric shapes. It utilized a pipeline of RCNN and LSTM bridging with an object knowledge layer for generating question–answer pairs and achieved promising results. We benchmarked our dataset with state-of-the-art deep learning methods and assessed the performance of generating question–answer pairs from a given image. The source code and the dataset are available at https://github.com/skarifahmed/pic2question (will be publicly opened after the acceptance of the manuscript).
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bilker WB, Hansen JA, Brensinger CM, Richard J, Gur RE, Gur RC (2012) Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test. Assessment 19(3):354–369
Ch DR, Saha SK (2018) Automatic multiple choice question generation from text: a survey. IEEE Trans Learn Technol 13:14–25
Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:14115654
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1724–1734
Das B, Majumder M (2017) Factual open cloze question generation for assessment of learner’s knowledge. Int J Educ Technol High Educ 14(1):24
Das B, Majumder M, Phadikar S, Sekh AA (2019) Automatic generation of fill-in-the-blank question with corpus-based distractors for e-assessment to enhance learning. Comput Appl Eng Educ 27:1485–1495
Divate M, Salgaonkar A (2017) Automatic question generation approaches and evaluation techniques. Curr Sci 113(9):1683
Duan N, Tang D, Chen P, Zhou M (2017) Question generation for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 866–874
Fan Z, Wei Z, Li P, Lan Y, Huang X (2018) A question type driven framework to diversify visual question generation. In: International joint conference on artificial intelligence, pp 4048–4054
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861
Jain U, Zhang Z, Schwing AG (2017) Creativity: Generating diverse questions using variational autoencoders. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6485–6494
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910
Jurafsky D (2000) Speech & language processing. Pearson Education India, New Delhi
Karacı A, Arıcı N (2014) Determining students’ level of page viewing in intelligent tutorial systems with artificial neural network. Neural Comput Appl 24(3–4):675–684
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Krishna R, Bernstein M, Fei-Fei L (2019) Information maximizing visual question generation. In: IEEE conference on computer vision and pattern recognition
Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of the international conference on machine learning research, pp 2085–2094
Li Y, Duan N, Zhou B, Chu X, Ouyang W, Wang X, Zhou M (2018) Visual question generation as dual task of visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6116–6124
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26
Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: Thirty-second association for the advancement of artificial intelligence conference on artificial intelligence, pp 1–8
Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Mora IM, de la Puente SP, Nieto XGi (2016) Towards automatic generation of question answer pairs from images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–2
Mostafazadeh N, Misra I, Devlin J, Mitchell M, He X, Vanderwende L (2016) Generating natural questions about an image. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 1802–1813
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis G, Vanderwende L (2017) Image-grounded conversations: Multimodal context for natural question and response generation. In: Proceedings of the eighth international joint conference on natural language processing, pp 462–472
Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL conference on computational natural language learning, pp 280–290
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Prentzas J (2013) Artificial intelligence methods in early childhood education. In: Artificial intelligence, evolutionary computing and metaheuristics. Springer, pp 169–199
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the International Speech Communication Association, pp 338–342
Santoro A, Hill F, Barrett D, Morcos A, Lillicrap T (2018) Measuring abstract reasoning in neural networks. In: International conference on machine learning, pp 4477–4486
Serban IV, Garcia-Duran A, Gulcehre C, Ahn S, Chandar S, Courville A, Bengio Y (2016) Generating factoid questions with recurrent neural networks: the 30m factoid question–answer corpus. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 588–598
Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representation. Annu Rev Neurosci 24(1):1193–1216
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Song J, Tang S, Xiao J, Wu F, Zhang ZM (2016) LSTM-in-LSTM for generating long descriptions of images. Comput Vis Media 2(4):379–388
Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y (2019) A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 6418–6428
Sun X, Wu P, Hoi SC (2018) Face detection using deep learning: an improved faster RCNN approach. Neurocomputing 299:42–50
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Uehara K, Tejero-De-Pablos A, Ushiku Y, Harada T (2018) Visual question generation for class acquisition of unknown objects. In: Proceedings of the European conference on computer vision (ECCV), pp 481–496
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294
Wang X, Shrivastava A, Gupta A (2017) A-fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717
Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2461–2469
Zhang S, Qu L, You S, Yang Z, Zhang J (2017) Automatic generation of grounded visual questions. In: Proceedings of the 26th international joint conference on artificial intelligence, pp 4235–4243
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:151202167
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004
Funding
This study is not funded from anywhere.
Author information
Authors and Affiliations
Contributions
All authors equally contributed and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflict of interest
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Das, B., Sekh, A.A., Majumder, M. et al. Can deep learning solve a preschool image understanding problem?. Neural Comput & Applic 33, 14401–14411 (2021). https://doi.org/10.1007/s00521-021-06080-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06080-w