Improving Audiovisual Content Annotation Through a Semi-automated Process Based on Deep Learning
Over the last years, Deep Learning has become one of the most popular research fields of Artificial Intelligence. Several approaches have been developed to address conventional challenges of AI. In computer vision, these methods provide the means to solve tasks like image classification, object identification and extraction of features.
In this paper, some approaches to face detection and recognition are presented and analyzed, in order to identify the one with the best performance. The main objective is to automate the annotation of a large dataset and to avoid the costy and time-consuming process of content annotation. The approach follows the concept of incremental learning and a R-CNN model was implemented. Tests were conducted with the objective of detecting and recognizing one personality within image and video content.
Results coming from this initial automatic process are then made available to an auxiliary tool that enables further validation of the annotations prior to uploading them to the archive.
Tests show that, even with a small size dataset, the results obtained are satisfactory.
KeywordsContent annotation Computer Vision Machine Learning Deep Learning Object detection Facial detection Facial recognition
The work presented was partially supported by the following projects: FourEyes, a Research Line within project “TEC4Growth: Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01- 0145-FEDER-000020” financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF); FotoInMotion funded by H2020 Framework Programme of the European Commission.
- 1.Darkflow repository. https://github.com/thtrieu/darkflow. Accessed 09 July 2018
- 3.Bertini, M., Del Bimbo, A., Torniai, C.: Automatic video annotation using ontologies extended with visual information. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 2005, pp. 395–398. ACM, New York (2005)Google Scholar
- 5.Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, vol. 1, pp. 886–893, June 2005Google Scholar
- 6.Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
- 7.Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013)Google Scholar
- 8.He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)Google Scholar
- 9.Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)Google Scholar
- 11.Kotropoulos, C., Pitas, I.: Rule-based face detection in frontal views. In: Proceedings International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 2537–2540 (1997)Google Scholar
- 12.Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks, pp. 1097–1105 (2012)Google Scholar
- 14.Larson, M., Soleymani, M., Serdyukov, P., Rudinac, S., Wartena, C., Murdock, V., Friedland, G., Ordelman, R., Jones, G.J.F.: Automatic tagging and geotagging in video collections and communities. In: Proceedings 1st ACM International Conference on Multimedia Retrieval, ICMR 2011, pp. 51:1–51:8. ACM, New York (2011)Google Scholar
- 16.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)Google Scholar
- 17.Moxley, E., Mei, T., Hua, X., Ma, W., Manjunath, B.S.: Automatic video annotation through search and mining. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 685–688, June 2008Google Scholar
- 18.Osuna, E., Freund, R., Girosit, F.: Training support vector machines: an application to face detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 130–136, June 1997Google Scholar
- 19.Pinto, J.P., Viana, P.: TAG4VD: a game for collaborative video annotation. In: Proceedings of the 2013 ACM International Workshop on Immersive Media Experiences, ImmersiveMe 2013, pp. 25–28. ACM, New York (2013)Google Scholar
- 20.Pinto, J.P., Viana, P.: Using the crowd to boost video annotation processes: a game based approach. In: Proceedings of the 12th European Conference on Visual Media Production, CVMP 2015, pp. 22:1–22:1. ACM, New York (2015)Google Scholar
- 21.Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
- 22.Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016)Google Scholar
- 23.Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks (2015)Google Scholar
- 24.Sirohey, S.A.: Human face segmentation and identification. Technical report (1993)Google Scholar
- 26.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)Google Scholar
- 27.Tsukamoto, A., Lee, C.W., Tsuji, S.: Detection and pose estimation of human face with synthesized image models. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 754–757, October 1994Google Scholar
- 28.Tukamoto, A.: Detection and tracking of human face with synthesized templates. In: Proceedings of the ACCV 1993, pp. 183–186 (1993)Google Scholar
- 31.Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features, vol. 1, pp. I-511–I-518 (2001)Google Scholar
- 33.Yang, M.H., Ahuja, N.: Detecting human faces in color images. In: Proceedings of the International Conference on Image Processing, ICIP 1998, vol. 1, pp. 127–130, October 1998Google Scholar