Abstract
A video file naturally contains audio, text metadata, and visual content in the form of frames, as it is a series of images with adequate motion. To get an efficient result in video categorization, it is necessary to use and analyze all the available resources. For this reason, in this paper we introduce a video categorization method by examining all the essential elements of video in the form of text, audio, and frames. The proposed method consists of three different modules. These modules are used for analyzing the text, audio, and visual contents to provide the analysis results, which are finally combined to get the final output. A set of fundamental properties are analyzed and compared with standard values acquired from training data set to understand the genre of the videos and eventually tagging it with the most probable category. Besides, we have conducted different tests using the proposed method and the simulation results show that the proposed method effectively categorizes the video sequence.
Similar content being viewed by others
References
Smith C (2017) 160 Amazing YouTube statistics. DMR, 28-Oct-2017. http://expandedramblings.com/index.php/youtube-statistics/. Accessed 04 Nov 2017 (online)
Tao D, Gong C, Liu W (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, pp 4694–4702
Huang C, Fu T, Chen H (2010) Text-based video content classification for online video-sharing sites. J Am Soc Inf Sci Technol 61:891–906
Cai D, He X, Li Z, Ma WY, Wen JR (2004) Hierarchical clustering of WWW image search results using visual, textual and link information. In: Proceedings of the 12th annual ACM international conference on multimedia, New-York, USA, 10–16 October 2004, pp 952–959
Zha S, Luicier F, Andrews W, Srivastava N, Salakhutdinov R (2015) Exploiting image-trained CNN architectures for unconstrained video classification. In: 26th British machine vision conference (BMVC), Swansea, UK, 7–10 September 2015, pp 1–13
Meier DC, Meier U (2012) Multi-column deep neural networks for image classification. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp 3642–3649
Lin L, Ravitz G, Shyu M, Chen S (2007) Video semantic concept discovery using multimodal-based association classification. In: IEEE international conference on multimedia and expo (ICME), Beijing, China, 2–5 July 2007, pp 859–862
Feng H, Shi R, Chua T-S (2004) A bootstrapping framework for annotating and retrieving WWW images. In: Proceedings of the 12th annual ACM international conference on multimedia, New-York, USA, 15–16 October 2004, pp 960–967
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition (CVPR), Columbus, Ohio, USA, 23–28 June 2014, pp 1725–1732
Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR. arXiv:1412.0767
Lin W-H, Hauptmann A (2002) News video classification using SVM-based multimodal classifiers and combination strategies. In: Proceedings 10th ACM international conference on multimedia, Huan-les-Pins, France, 1–6 December 2002, pp 323–326
Zhang R, Sarukkai R, Chow JH, Dai W, Zhang Z (2006) Joint categorization of queries and clips for web-based video search. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa-Barbara, California, USA, 26–27 October 2006, pp 193–202
Klaser A, Marszalek A, Schmid M (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British machine vision conference (BMVC), Leeds, September 2008
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. In: Proc. IEEE conference on computer vision and pattern recognition, Miami, FL, USA, 20–25 June 2009
Cinbis I, Sclaroff N (2010) Object, scene and actions: combining multiple features for human action recognition. In: Proc. 11th European conference on computer vision, Heraklion, Crete, Greece, 5–11 September 2010, pp 494–507
Gong C, Tao D, Maybank SJ, Liu W, Kang G, Liu W (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Trans Image Process 25(7):3249–3260
Jiang Y-G, Zuxuan W, Wang J, Xue X, Chang S-F (2018) Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans Pattern Anal Mach Intell 40(2):352–364
Afzal M, Wu X, Chen H, Jiang YG, Peng Q (2016) Web video categorization using category-predictive classifiers and category-specific concept classifiers. Neurocomputing 214:175–190
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Amin, H.M.A., Arefin, M.S. & Dhar, P.K. A method for video categorization by analyzing text, audio, and frames. Int. j. inf. tecnol. 12, 889–898 (2020). https://doi.org/10.1007/s41870-019-00338-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-019-00338-2