Skip to main content
Log in

A method for video categorization by analyzing text, audio, and frames

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

A video file naturally contains audio, text metadata, and visual content in the form of frames, as it is a series of images with adequate motion. To get an efficient result in video categorization, it is necessary to use and analyze all the available resources. For this reason, in this paper we introduce a video categorization method by examining all the essential elements of video in the form of text, audio, and frames. The proposed method consists of three different modules. These modules are used for analyzing the text, audio, and visual contents to provide the analysis results, which are finally combined to get the final output. A set of fundamental properties are analyzed and compared with standard values acquired from training data set to understand the genre of the videos and eventually tagging it with the most probable category. Besides, we have conducted different tests using the proposed method and the simulation results show that the proposed method effectively categorizes the video sequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Smith C (2017) 160 Amazing YouTube statistics. DMR, 28-Oct-2017. http://expandedramblings.com/index.php/youtube-statistics/. Accessed 04 Nov 2017 (online)

  2. Tao D, Gong C, Liu W (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, pp 4694–4702

  3. Huang C, Fu T, Chen H (2010) Text-based video content classification for online video-sharing sites. J Am Soc Inf Sci Technol 61:891–906

    Article  Google Scholar 

  4. Cai D, He X, Li Z, Ma WY, Wen JR (2004) Hierarchical clustering of WWW image search results using visual, textual and link information. In: Proceedings of the 12th annual ACM international conference on multimedia, New-York, USA, 10–16 October 2004, pp 952–959

  5. Zha S, Luicier F, Andrews W, Srivastava N, Salakhutdinov R (2015) Exploiting image-trained CNN architectures for unconstrained video classification. In: 26th British machine vision conference (BMVC), Swansea, UK, 7–10 September 2015, pp 1–13

  6. Meier DC, Meier U (2012) Multi-column deep neural networks for image classification. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp 3642–3649

  7. Lin L, Ravitz G, Shyu M, Chen S (2007) Video semantic concept discovery using multimodal-based association classification. In: IEEE international conference on multimedia and expo (ICME), Beijing, China, 2–5 July 2007, pp 859–862

  8. Feng H, Shi R, Chua T-S (2004) A bootstrapping framework for annotating and retrieving WWW images. In: Proceedings of the 12th annual ACM international conference on multimedia, New-York, USA, 15–16 October 2004, pp 960–967

  9. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition (CVPR), Columbus, Ohio, USA, 23–28 June 2014, pp 1725–1732

  10. Tran D, Bourdev LD, Fergus R, Torresani L, Paluri M (2014) C3D: generic features for video analysis. CoRR. arXiv:1412.0767

  11. Lin W-H, Hauptmann A (2002) News video classification using SVM-based multimodal classifiers and combination strategies. In: Proceedings 10th ACM international conference on multimedia, Huan-les-Pins, France, 1–6 December 2002, pp 323–326

  12. Zhang R, Sarukkai R, Chow JH, Dai W, Zhang Z (2006) Joint categorization of queries and clips for web-based video search. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa-Barbara, California, USA, 26–27 October 2006, pp 193–202

  13. Klaser A, Marszalek A, Schmid M (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British machine vision conference (BMVC), Leeds, September 2008

  14. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. In: Proc. IEEE conference on computer vision and pattern recognition, Miami, FL, USA, 20–25 June 2009

  15. Cinbis I, Sclaroff N (2010) Object, scene and actions: combining multiple features for human action recognition. In: Proc. 11th European conference on computer vision, Heraklion, Crete, Greece, 5–11 September 2010, pp 494–507

  16. Gong C, Tao D, Maybank SJ, Liu W, Kang G, Liu W (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Trans Image Process 25(7):3249–3260

    Article  MathSciNet  Google Scholar 

  17. Jiang Y-G, Zuxuan W, Wang J, Xue X, Chang S-F (2018) Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans Pattern Anal Mach Intell 40(2):352–364

    Article  Google Scholar 

  18. Afzal M, Wu X, Chen H, Jiang YG, Peng Q (2016) Web video categorization using category-predictive classifiers and category-specific concept classifiers. Neurocomputing 214:175–190

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Shamsul Arefin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amin, H.M.A., Arefin, M.S. & Dhar, P.K. A method for video categorization by analyzing text, audio, and frames. Int. j. inf. tecnol. 12, 889–898 (2020). https://doi.org/10.1007/s41870-019-00338-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-019-00338-2

Keywords

Navigation