Computer Vision for Image Understanding: A Comprehensive Review

  • Luis-Roberto Jácome-Galarza
  • Miguel-Andrés Realpe-Robalino
  • Luis-Antonio Chamba-Eras
  • Marlon-Santiago Viñán-LudeñaEmail author
  • Javier-Francisco Sinche-Freire
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1066)


Computer Vision has its own Turing test: Can a machine describe the contents of an image or a video in the way a human being would do? In this paper, the progress of Deep Learning for image recognition is analyzed in order to know the answer to this question. In recent years, Deep Learning has increased considerably the precision rate of many tasks related to computer vision. Many datasets of labeled images are now available online, which leads to pre-trained models for many computer vision applications. In this work, we gather information of the latest techniques to perform image understanding and description. As a conclusion we obtained that the combination of Natural Language Processing (using Recurrent Neural Networks and Long Short-Term Memory) plus Image Understanding (using Convolutional Neural Networks) could bring new types of powerful and useful applications in which the computer will be able to answer questions about the content of images and videos. In order to build datasets of labeled images, we need a lot of work and most of the datasets are built using crowd work. These new applications have the potential to increase the human machine interaction to new levels of usability and user’s satisfaction.


Computer vision Deep Learning Image understanding CNN Scene recognition Object classification 


  1. 1.
  2. 2.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  3. 3.
    Eickhoff, C., de Vries, A.: How crowdsourcable is your task. In: Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11–14 (2011)Google Scholar
  4. 4.
    Draper, R., Hunt, D.: Smart Robots, A Handbook of Intelligent Robotic System. Springer, Heidelberg (1985)Google Scholar
  5. 5.
    Li-Jia, L., Fei-Fei, L.: What, where and who? Classifying events by scene and object recognition. In: Proceedings/IEEE International Conference on Computer Vision (2007)Google Scholar
  6. 6.
  7. 7.
    Fei-Fei, L., Iyer, A., Koch, C., Perona, P.: What do we see in a glance of a scene? J. Vis. 7(1), 10, 1–29 (2007). Scholar
  8. 8.
    Coursera, Université nationale de recherche, École des hautes études en sciences économiques. 12 Mar 2019
  9. 9.
    Fei-Fei, L., Fergus, R., Torralba, A.: Recognizing and learning object categories. In: Short Course CVPR, International Conference on Computer Vision (2007)Google Scholar
  10. 10.
    Recognizing and Learning Object Categories course. 25 Mar 2019
  11. 11.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 806–813 (2014)Google Scholar
  12. 12.
  13. 13.
    VisLab – Computer and Robot Vision Laboratory. 25 Mar 2019
  14. 14.
  15. 15.
    SUN360 panorama database. 24 Mar 2019
  16. 16.
    The Places Audio Caption Corpus. 25 Mar 2019
  17. 17.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  18. 18.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  19. 19.
    Candamo, J., Shreve, M., Goldgof, D.B., Sapper, D.B., Kasturi, R.: Understanding transit scenes: a survey on human behavior-recognition algorithms. IEEE Trans. Intell. Transp. Syst. 11, 206–224 (2010)CrossRefGoogle Scholar
  20. 20.
    Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: A survey. Comput. Vis. Image Underst. 11, 281–307 (2008)CrossRefGoogle Scholar
  21. 21.
    Trucco, E., Plakas, K.: Video tracking: a concise survey. IEEE J. Oceanic Eng. 31, 520–529 (2006)CrossRefGoogle Scholar
  22. 22.
    Imagenet large scale visual recognition challenge 2013 (ilsvrc2013): 13 Mar 2019
  23. 23.
    IBM Watson demonstration website. 10 May 2019
  24. 24.
    Microsoft Caption Bot. 10 May 2019
  25. 25.
    Kaggle’s “Dog breed identification” kernel. 10 May 2019
  26. 26.
    Fei-Fei, L., Perona, P.: A Bayesian hierarchy model for learning natural scene categories. In: CVPR (2005)Google Scholar
  27. 27.
    Torralba, A., Fergus, R., Freeman, W.: 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  28. 28.
    Tiny Images dataset. 01 Mar 2019
  29. 29.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems (NIPS) 27 (2014)Google Scholar
  30. 30.
    Cross-Modal Places database. 23 Feb 2019
  31. 31.
    Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering (2015)Google Scholar
  32. 32.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Conference Paper at NIPS (2015)Google Scholar
  33. 33.
    Ren, S., He, K., Girshick, R., Sun, J.: FasterR-CNN: towards real-time object detection with region proposal networks (2016)Google Scholar
  34. 34.
    Liang, M., Hu, X., Zhang, B.: Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Proceeding NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 1, pp. 937–945 (2015)Google Scholar
  35. 35.
    Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. NIPS DeepLearning Workshop 201 (2014)Google Scholar
  36. 36.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. TACL (2015)Google Scholar
  37. 37.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  38. 38.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  39. 39.
    Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
  40. 40.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  41. 41.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  42. 42.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
  43. 43.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL pp. 479–488 (2014)Google Scholar
  44. 44.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
  45. 45.
    Lebret, R., Pinheiro, P.O., Collobert, R.: Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014)
  46. 46.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)Google Scholar
  47. 47.
    Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)Google Scholar
  48. 48.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  49. 49.
    Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning ChallengeGoogle Scholar
  50. 50.
    Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692 (2015)
  51. 51.
    Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.-C.: Joint video and text parsing for understanding events and answering queries. MultiMedia IEEE 21(2), 42–70 (2014)CrossRefGoogle Scholar
  52. 52.
    Benchmark of Deep Learning Representations for Visual Recognition. 23 Feb 2019

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Escuela Superior Politécnica del Litoral, Facultad de Ingeniería en Electricidad y Computación, CIDISGuayaquilEcuador
  2. 2.Universidad Nacional de Loja, Grupo de Investigación en Tecnologías de la Información y Comunicación (GITIC), Carrera de Ingeniería en SistemasLojaEcuador
  3. 3.Universidad Nacional de Loja, Carrera de Ingeniería en SistemasLojaEcuador

Personalised recommendations