Improving Audiovisual Content Annotation Through a Semi-automated Process Based on Deep Learning

  • Luís Vilaça
  • Paula VianaEmail author
  • Pedro Carvalho
  • Teresa Andrade
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 942)


Over the last years, Deep Learning has become one of the most popular research fields of Artificial Intelligence. Several approaches have been developed to address conventional challenges of AI. In computer vision, these methods provide the means to solve tasks like image classification, object identification and extraction of features.

In this paper, some approaches to face detection and recognition are presented and analyzed, in order to identify the one with the best performance. The main objective is to automate the annotation of a large dataset and to avoid the costy and time-consuming process of content annotation. The approach follows the concept of incremental learning and a R-CNN model was implemented. Tests were conducted with the objective of detecting and recognizing one personality within image and video content.

Results coming from this initial automatic process are then made available to an auxiliary tool that enables further validation of the annotations prior to uploading them to the archive.

Tests show that, even with a small size dataset, the results obtained are satisfactory.


Content annotation Computer Vision Machine Learning Deep Learning Object detection Facial detection Facial recognition 



The work presented was partially supported by the following projects: FourEyes, a Research Line within project “TEC4Growth: Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01- 0145-FEDER-000020” financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF); FotoInMotion funded by H2020 Framework Programme of the European Commission.


  1. 1.
    Darkflow repository. Accessed 09 July 2018
  2. 2.
    Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. Technical report, Yale University, New Haven, United States (1997)CrossRefGoogle Scholar
  3. 3.
    Bertini, M., Del Bimbo, A., Torniai, C.: Automatic video annotation using ontologies extended with visual information. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 2005, pp. 395–398. ACM, New York (2005)Google Scholar
  4. 4.
    Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8(6), 679–698 (1986)CrossRefGoogle Scholar
  5. 5.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, vol. 1, pp. 886–893, June 2005Google Scholar
  6. 6.
    Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
  7. 7.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013)Google Scholar
  8. 8.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)Google Scholar
  9. 9.
    Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)Google Scholar
  10. 10.
    Howell, A.J., Buxton, H.: Invariance in radial basis function neural networks in human face classification. Neural Process. Lett. 2(3), 26–30 (1995)CrossRefGoogle Scholar
  11. 11.
    Kotropoulos, C., Pitas, I.: Rule-based face detection in frontal views. In: Proceedings International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 2537–2540 (1997)Google Scholar
  12. 12.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks, pp. 1097–1105 (2012)Google Scholar
  13. 13.
    Lanitis, A., Taylor, C.J., Cootes, T.F.: Automatic interpretation and coding of face images using flexible models. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 743–756 (1997)CrossRefGoogle Scholar
  14. 14.
    Larson, M., Soleymani, M., Serdyukov, P., Rudinac, S., Wartena, C., Murdock, V., Friedland, G., Ordelman, R., Jones, G.J.F.: Automatic tagging and geotagging in video collections and communities. In: Proceedings 1st ACM International Conference on Multimedia Retrieval, ICMR 2011, pp. 51:1–51:8. ACM, New York (2011)Google Scholar
  15. 15.
    Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)CrossRefGoogle Scholar
  16. 16.
    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)Google Scholar
  17. 17.
    Moxley, E., Mei, T., Hua, X., Ma, W., Manjunath, B.S.: Automatic video annotation through search and mining. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 685–688, June 2008Google Scholar
  18. 18.
    Osuna, E., Freund, R., Girosit, F.: Training support vector machines: an application to face detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 130–136, June 1997Google Scholar
  19. 19.
    Pinto, J.P., Viana, P.: TAG4VD: a game for collaborative video annotation. In: Proceedings of the 2013 ACM International Workshop on Immersive Media Experiences, ImmersiveMe 2013, pp. 25–28. ACM, New York (2013)Google Scholar
  20. 20.
    Pinto, J.P., Viana, P.: Using the crowd to boost video annotation processes: a game based approach. In: Proceedings of the 12th European Conference on Visual Media Production, CVMP 2015, pp. 22:1–22:1. ACM, New York (2015)Google Scholar
  21. 21.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  22. 22.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016)Google Scholar
  23. 23.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks (2015)Google Scholar
  24. 24.
    Sirohey, S.A.: Human face segmentation and identification. Technical report (1993)Google Scholar
  25. 25.
    Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces. J. Opt. Soc. Am. A 4(3), 519–524 (1987)CrossRefGoogle Scholar
  26. 26.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)Google Scholar
  27. 27.
    Tsukamoto, A., Lee, C.W., Tsuji, S.: Detection and pose estimation of human face with synthesized image models. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 754–757, October 1994Google Scholar
  28. 28.
    Tukamoto, A.: Detection and tracking of human face with synthesized templates. In: Proceedings of the ACCV 1993, pp. 183–186 (1993)Google Scholar
  29. 29.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013)zbMATHGoogle Scholar
  30. 30.
    Viana, P., Pinto, J.P.: A collaborative approach for semantic time-based video annotation using gamification. Hum.-Centric Comput. Inf. Sci. 7(1), 13 (2017)CrossRefGoogle Scholar
  31. 31.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features, vol. 1, pp. I-511–I-518 (2001)Google Scholar
  32. 32.
    Yang, G., Huang, T.S.: Human face detection in a complex background. Pattern Recognit. 27(1), 53–63 (1994)CrossRefGoogle Scholar
  33. 33.
    Yang, M.H., Ahuja, N.: Detecting human faces in color images. In: Proceedings of the International Conference on Image Processing, ICIP 1998, vol. 1, pp. 127–130, October 1998Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Luís Vilaça
    • 1
  • Paula Viana
    • 1
    • 2
    Email author
  • Pedro Carvalho
    • 2
  • Teresa Andrade
    • 2
    • 3
  1. 1.School of EngineeringPolytechnic of PortoPortoPortugal
  2. 2.INESC TECPortoPortugal
  3. 3.Faculty of EngineeringUniversity of PortoPortoPortugal

Personalised recommendations