Can deep learning solve a preschool image understanding problem?

Das, Bidyut; Sekh, Arif Ahmed; Majumder, Mukta; Phadikar, Santanu

doi:10.1007/s00521-021-06080-w

Can deep learning solve a preschool image understanding problem?

Original Article
Published: 12 May 2021

Volume 33, pages 14401–14411, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

536 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Automatic assessment of learning is a process where the computer system automatically generates test items and evaluates the responses. Image is one of the major media to assess learning capabilities. In this article, we have proposed a system with a dataset containing images used to evaluate the kid’s ability to answer pictorial questions. The system tests a set of skills such as counting ability, color concept, and knowledge of objects and geometric shapes. It utilized a pipeline of RCNN and LSTM bridging with an object knowledge layer for generating question–answer pairs and achieved promising results. We benchmarked our dataset with state-of-the-art deep learning methods and assessed the performance of generating question–answer pairs from a given image. The source code and the dataset are available at https://github.com/skarifahmed/pic2question (will be publicly opened after the acceptance of the manuscript).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 6

Artificial intelligence in education: Addressing ethical challenges in K-12 settings

Article 22 September 2021

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Article Open access 25 March 2022

Evolution and Revolution in Artificial Intelligence in Education

Article 22 February 2016

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bilker WB, Hansen JA, Brensinger CM, Richard J, Gur RE, Gur RC (2012) Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test. Assessment 19(3):354–369
Article Google Scholar
Ch DR, Saha SK (2018) Automatic multiple choice question generation from text: a survey. IEEE Trans Learn Technol 13:14–25
Article Google Scholar
Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:14115654
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1724–1734
Das B, Majumder M (2017) Factual open cloze question generation for assessment of learner’s knowledge. Int J Educ Technol High Educ 14(1):24
Article Google Scholar
Das B, Majumder M, Phadikar S, Sekh AA (2019) Automatic generation of fill-in-the-blank question with corpus-based distractors for e-assessment to enhance learning. Comput Appl Eng Educ 27:1485–1495
Article Google Scholar
Divate M, Salgaonkar A (2017) Automatic question generation approaches and evaluation techniques. Curr Sci 113(9):1683
Article Google Scholar
Duan N, Tang D, Chen P, Zhou M (2017) Question generation for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 866–874
Fan Z, Wei Z, Li P, Lan Y, Huang X (2018) A question type driven framework to diversify visual question generation. In: International joint conference on artificial intelligence, pp 4048–4054
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861
Jain U, Zhang Z, Schwing AG (2017) Creativity: Generating diverse questions using variational autoencoders. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6485–6494
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910
Jurafsky D (2000) Speech & language processing. Pearson Education India, New Delhi
Google Scholar
Karacı A, Arıcı N (2014) Determining students’ level of page viewing in intelligent tutorial systems with artificial neural network. Neural Comput Appl 24(3–4):675–684
Article Google Scholar
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Krishna R, Bernstein M, Fei-Fei L (2019) Information maximizing visual question generation. In: IEEE conference on computer vision and pattern recognition
Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of the international conference on machine learning research, pp 2085–2094
Li Y, Duan N, Zhou B, Chu X, Ouyang W, Wang X, Zhou M (2018) Visual question generation as dual task of visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6116–6124
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26
Article Google Scholar
Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: Thirty-second association for the advancement of artificial intelligence conference on artificial intelligence, pp 1–8
Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
MATH Google Scholar
Mora IM, de la Puente SP, Nieto XGi (2016) Towards automatic generation of question answer pairs from images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–2
Mostafazadeh N, Misra I, Devlin J, Mitchell M, He X, Vanderwende L (2016) Generating natural questions about an image. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 1802–1813
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis G, Vanderwende L (2017) Image-grounded conversations: Multimodal context for natural question and response generation. In: Proceedings of the eighth international joint conference on natural language processing, pp 462–472
Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL conference on computational natural language learning, pp 280–290
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Prentzas J (2013) Artificial intelligence methods in early childhood education. In: Artificial intelligence, evolutionary computing and metaheuristics. Springer, pp 169–199
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the International Speech Communication Association, pp 338–342
Santoro A, Hill F, Barrett D, Morcos A, Lillicrap T (2018) Measuring abstract reasoning in neural networks. In: International conference on machine learning, pp 4477–4486
Serban IV, Garcia-Duran A, Gulcehre C, Ahn S, Chandar S, Courville A, Bengio Y (2016) Generating factoid questions with recurrent neural networks: the 30m factoid question–answer corpus. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 588–598
Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representation. Annu Rev Neurosci 24(1):1193–1216
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Song J, Tang S, Xiao J, Wu F, Zhang ZM (2016) LSTM-in-LSTM for generating long descriptions of images. Comput Vis Media 2(4):379–388
Article Google Scholar
Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y (2019) A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 6418–6428
Sun X, Wu P, Hoi SC (2018) Face detection using deep learning: an improved faster RCNN approach. Neurocomputing 299:42–50
Article Google Scholar
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Uehara K, Tejero-De-Pablos A, Ushiku Y, Harada T (2018) Visual question generation for class acquisition of unknown objects. In: Proceedings of the European conference on computer vision (ECCV), pp 481–496
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294
Wang X, Shrivastava A, Gupta A (2017) A-fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717
Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2461–2469
Zhang S, Qu L, You S, Yang Z, Zhang J (2017) Automatic generation of grounded visual questions. In: Proceedings of the 26th international joint conference on artificial intelligence, pp 4235–4243
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:151202167
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004

Download references

Funding

This study is not funded from anywhere.

Author information

Authors and Affiliations

Haldia Institute of Technology, Haldia, India
Bidyut Das
UiT The Arctic University of Norway, Tromsø, Norway
Arif Ahmed Sekh
University of North Bengal, Darjeeling, India
Mukta Majumder
Maulana Abul Kalam Azad University of Technology, West Bengal, India
Santanu Phadikar

Authors

Bidyut Das
View author publications
You can also search for this author in PubMed Google Scholar
Arif Ahmed Sekh
View author publications
You can also search for this author in PubMed Google Scholar
Mukta Majumder
View author publications
You can also search for this author in PubMed Google Scholar
Santanu Phadikar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors equally contributed and approved the final manuscript.

Corresponding author

Correspondence to Bidyut Das.

Ethics declarations

Conflict of interest

Authors declare no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Das, B., Sekh, A.A., Majumder, M. et al. Can deep learning solve a preschool image understanding problem?. Neural Comput & Applic 33, 14401–14411 (2021). https://doi.org/10.1007/s00521-021-06080-w

Download citation

Received: 22 July 2020
Accepted: 20 April 2021
Published: 12 May 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00521-021-06080-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can deep learning solve a preschool image understanding problem?

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in education: Addressing ethical challenges in K-12 settings

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Evolution and Revolution in Artificial Intelligence in Education

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Can deep learning solve a preschool image understanding problem?

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in education: Addressing ethical challenges in K-12 settings

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Evolution and Revolution in Artificial Intelligence in Education

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation