Abstract
Text in videos contains rich semantic information, which is useful for content based video understanding and retrieval. Although a great number of state-of-the-art methods are proposed to detect text in images and videos, few works focus on spatiotemporal text localization in videos. In this paper, we present a spatiotemporal text localization method with an improved detection efficiency and performance. Concretely, a unified framework is proposed which consists of the sampling-and-recovery model (SaRM) and the divide-and-conquer model (DaCM). SaRM aims at exploiting the temporal redundancy of text to increase the detection efficiency for videos. DaCM is designed to efficiently localize the text in spatiotemporal domain simultaneously. Besides, we construct a challenging video overlaid text dataset named UCAS-STLData, which contains 57070 frames with spatiotemporal ground truths. In the experiments, we comprehensively evaluate the proposed method on the publicly available overlaid text datasets and UCAS-STLData. A slight performance improvement is achieved compared with the state-of-the-art methods for spatiotemporal text localization, with a significant efficiency improvement.
Similar content being viewed by others
Notes
The dataset will be publicly available soon for researchers.
In generally, videos are played at the speed of 25 frames per second.
References
Bai X, Shi B, Zhang C, Cai X, Qi L (2017) Text/non-text image classification in the wild with convolutional neural networks. Pattern Recogn 66:437–446
Busta M, Neumann L, Matas J (2015) Fastext: efficient unconstrained scene text detector. In: The International conference on computer vision (ICCV’15)
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE Conference on computer vision and pattern recognition (CVPR’10). IEEE, pp 2963– 2970
Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting Uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15,083–15,103
Fernández D, Del Barrio A, Botella G, García C (2018) Fast and effective cu size decision based on spatial and temporal homogeneity detection. Multimed Tools Appl 77(5):5907–5927
Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126
Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252–264
Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100
Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced mser trees. In: The European conference on computer vision (ECCV’14). Springer, pp 497–511
Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: The European conference on computer vision (ECCV’14). Springer, pp 512–528
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) Icdar 2013 robust reading competition. In: The International conference on document analysis and recognition (ICDAR’13). IEEE, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: The International conference on document analysis and recognition (ICDAR’15). IEEE, pp 1156–1160
Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recogn 54:128–148
Khare V, Shivakumara P, Paramesran R, Blumenstein M (2017) Arbitrarily-oriented multi-lingual text detection in video. Multimed Tools Appl 76 (15):16,625–16,655
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999
Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26 (9):2138–2150
Li Z, Tang J, He X (2017) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst
Liang G, Shivakumara P, Lu T, Tan CL (2015) Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Trans Image Process 24(11):4488–4501
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI Conference on artificial intelligence (AAAI’17), pp 4161–4167
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 988–996
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. arXiv:1703.01425
Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM international conference on multimedia (ACM MM’10). ACM, pp 843–846
Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: The European conference on computer vision (ECCV’16). Springer, pp 21–37
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE Conference on computer vision and pattern recognition (CVPR’15), pp 3431–3440
Lucas SM (2005) Icdar 2005 text locating competition results. In: The International conference on document analysis and recognition (ICDAR’05). IEEE, pp 80–84
Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The IEEE International conference on multimedia and expo (ICME’17). IEEE, pp 367–372
Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The IEEE International conference on image processing (ICIP’11). IEEE, pp 505–508
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 3538–3545
Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The IEEE Winter conference on applications of computer vision (WACV’14). IEEE, pp 776–783
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: The Neural information processing systems (NIPS’15). Curran Associates, Inc, pp 91–99
Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)
Shivakumara P, Dutta A, Phan TQ, Tan CL, Pal U (2011) A novel mutual nearest neighbor based symmetry for text frame classification in video. Pattern Recogn 44(8):1671–1683
Shivakumara P, Phan TQ, Tan CL (2011) A laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419
Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Vid Technol 22(8):1227–1235
Shivakumara P, Phan TQ, Lu S, Tan CL (2013) Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images. IEEE Trans Circ Syst Vid Technol 23(10):1729–1739
Sullivan GJ, Ohm J, Han WJ, Wiegand T (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans Circ Syst Vid Technol 22(12):1649–1668
Tian S, Pei WY, Zuo ZY, Yin X (2016) Scene text detection in video by learning locally and globally. In: The International joint conference on artificial intelligence (IJCAI’16), vol 10, p 18
Tian S, Yin X, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell
Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883
Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152
Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process, 26
Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 146–153
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 1083–1090
Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao ZM (2016) Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002
Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605
Yin X, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983
Yin X, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)
Acknowledgments
This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, and also supported by National Nature Science Foundation of China (NSFC) under Grant Nos. 61772495.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cai, Y., Wang, W., Huang, S. et al. Spatiotemporal text localization for videos. Multimed Tools Appl 77, 29323–29345 (2018). https://doi.org/10.1007/s11042-018-6081-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6081-7