Skip to main content
Log in

Can deep learning solve a preschool image understanding problem?

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Automatic assessment of learning is a process where the computer system automatically generates test items and evaluates the responses. Image is one of the major media to assess learning capabilities. In this article, we have proposed a system with a dataset containing images used to evaluate the kid’s ability to answer pictorial questions. The system tests a set of skills such as counting ability, color concept, and knowledge of objects and geometric shapes. It utilized a pipeline of RCNN and LSTM bridging with an object knowledge layer for generating question–answer pairs and achieved promising results. We benchmarked our dataset with state-of-the-art deep learning methods and assessed the performance of generating question–answer pairs from a given image. The source code and the dataset are available at https://github.com/skarifahmed/pic2question (will be publicly opened after the acceptance of the manuscript).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  2. Bilker WB, Hansen JA, Brensinger CM, Richard J, Gur RE, Gur RC (2012) Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test. Assessment 19(3):354–369

    Article  Google Scholar 

  3. Ch DR, Saha SK (2018) Automatic multiple choice question generation from text: a survey. IEEE Trans Learn Technol 13:14–25

    Article  Google Scholar 

  4. Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:14115654

  5. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1724–1734

  6. Das B, Majumder M (2017) Factual open cloze question generation for assessment of learner’s knowledge. Int J Educ Technol High Educ 14(1):24

    Article  Google Scholar 

  7. Das B, Majumder M, Phadikar S, Sekh AA (2019) Automatic generation of fill-in-the-blank question with corpus-based distractors for e-assessment to enhance learning. Comput Appl Eng Educ 27:1485–1495

    Article  Google Scholar 

  8. Divate M, Salgaonkar A (2017) Automatic question generation approaches and evaluation techniques. Curr Sci 113(9):1683

    Article  Google Scholar 

  9. Duan N, Tang D, Chen P, Zhou M (2017) Question generation for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 866–874

  10. Fan Z, Wei Z, Li P, Lan Y, Huang X (2018) A question type driven framework to diversify visual question generation. In: International joint conference on artificial intelligence, pp 4048–4054

  11. Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304

  12. Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861

  15. Jain U, Zhang Z, Schwing AG (2017) Creativity: Generating diverse questions using variational autoencoders. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6485–6494

  16. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574

  17. Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910

  18. Jurafsky D (2000) Speech & language processing. Pearson Education India, New Delhi

    Google Scholar 

  19. Karacı A, Arıcı N (2014) Determining students’ level of page viewing in intelligent tutorial systems with artificial neural network. Neural Comput Appl 24(3–4):675–684

    Article  Google Scholar 

  20. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  21. Krishna R, Bernstein M, Fei-Fei L (2019) Information maximizing visual question generation. In: IEEE conference on computer vision and pattern recognition

  22. Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of the international conference on machine learning research, pp 2085–2094

  23. Li Y, Duan N, Zhou B, Chu X, Ouyang W, Wang X, Zhou M (2018) Visual question generation as dual task of visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6116–6124

  24. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  25. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26

    Article  Google Scholar 

  26. Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: Thirty-second association for the advancement of artificial intelligence conference on artificial intelligence, pp 1–8

  27. Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  28. Mora IM, de la Puente SP, Nieto XGi (2016) Towards automatic generation of question answer pairs from images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–2

  29. Mostafazadeh N, Misra I, Devlin J, Mitchell M, He X, Vanderwende L (2016) Generating natural questions about an image. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 1802–1813

  30. Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis G, Vanderwende L (2017) Image-grounded conversations: Multimodal context for natural question and response generation. In: Proceedings of the eighth international joint conference on natural language processing, pp 462–472

  31. Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL conference on computational natural language learning, pp 280–290

  32. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318

  33. Prentzas J (2013) Artificial intelligence methods in early childhood education. In: Artificial intelligence, evolutionary computing and metaheuristics. Springer, pp 169–199

  34. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  35. Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961

  36. Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the International Speech Communication Association, pp 338–342

  37. Santoro A, Hill F, Barrett D, Morcos A, Lillicrap T (2018) Measuring abstract reasoning in neural networks. In: International conference on machine learning, pp 4477–4486

  38. Serban IV, Garcia-Duran A, Gulcehre C, Ahn S, Chandar S, Courville A, Bengio Y (2016) Generating factoid questions with recurrent neural networks: the 30m factoid question–answer corpus. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, pp 588–598

  39. Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representation. Annu Rev Neurosci 24(1):1193–1216

    Article  Google Scholar 

  40. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556

  41. Song J, Tang S, Xiao J, Wu F, Zhang ZM (2016) LSTM-in-LSTM for generating long descriptions of images. Comput Vis Media 2(4):379–388

    Article  Google Scholar 

  42. Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y (2019) A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 6418–6428

  43. Sun X, Wu P, Hoi SC (2018) Face detection using deep learning: an improved faster RCNN approach. Neurocomputing 299:42–50

    Article  Google Scholar 

  44. Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems

  45. Uehara K, Tejero-De-Pablos A, Ushiku Y, Harada T (2018) Visual question generation for class acquisition of unknown objects. In: Proceedings of the European conference on computer vision (ECCV), pp 481–496

  46. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  47. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294

  48. Wang X, Shrivastava A, Gupta A (2017) A-fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615

  49. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  50. Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717

  51. Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2461–2469

  52. Zhang S, Qu L, You S, Yang Z, Zhang J (2017) Automatic generation of grounded visual questions. In: Proceedings of the 26th international joint conference on artificial intelligence, pp 4235–4243

  53. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:151202167

  54. Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004

Download references

Funding

This study is not funded from anywhere.

Author information

Authors and Affiliations

Authors

Contributions

All authors equally contributed and approved the final manuscript.

Corresponding author

Correspondence to Bidyut Das.

Ethics declarations

Conflict of interest

Authors declare no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das, B., Sekh, A.A., Majumder, M. et al. Can deep learning solve a preschool image understanding problem?. Neural Comput & Applic 33, 14401–14411 (2021). https://doi.org/10.1007/s00521-021-06080-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06080-w

Keywords

Navigation