Skip to main content

Video Indexing System Based on Multimodal Information Extraction Using Combination of ASR and OCR

  • Conference paper
  • First Online:
Big-Data-Analytics in Astronomy, Science, and Engineering (BDA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13167))

Included in the following conference series:

Abstract

With the ever-increasing internet penetration across the world, there has been a huge surge in the content on the worldwide web. Video has proven to be one of the most popular media. The COVID-19 pandemic has further pushed the envelope, forcing learners to turn to E-Learning platforms. In the absence of relevant descriptions of these videos, it becomes imperative to generate metadata based on the content of the video. In the current paper, an attempt has been made to index videos based on the visual and audio content of the video. The visual content is extracted using an Optical Character Recognition (OCR) on the stack of frames obtained from a video while the audio content is generated using an Automatic Speech Recognition (ASR). The OCR and ASR generated texts are combined to obtain the final description of the respective video. The dataset contains 400 videos spread across 4 genres. To quantify the accuracy of our descriptions, clustering is performed using the video description to discern between the genres of video.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.youtube.com/.

  2. 2.

    https://pypi.org/project/pytube/.

  3. 3.

    https://pypi.org/project/pytesseract/.

  4. 4.

    https://www.nltk.org/.

  5. 5.

    https://cloud.ibm.com/apidocs/speech-to-text.

  6. 6.

    https://pypi.org/project/Whoosh/.

References

  1. Medida, L.-H., Raman, K.: An optimized e-lecture video retrieval based on machine learning classification. Int. J. Eng. Adv. Technol. 8(6), 4820–4827 (2019)

    Google Scholar 

  2. Adcock, J., Cooper, M., Denoue, L., Pirsiavash, H., Rowe, L.: Talkminer: A Lecture Webcast Search Engine, pp. 241–250, October 2010

    Google Scholar 

  3. Balagopalan, A., Balasubramanian, L.L., Balasubramanian, V., Chandrasekharan, N., Damodar. A.: Automatic keyphrase extraction and segmentation of video lectures. In: 2012 IEEE International Conference on Technology Enhanced Education (ICTEE). IEEE, January 2012

    Google Scholar 

  4. Balasubramanian, V., Doraisamy, S.G., Kanakarajan, N.K.: A multimodal approach for extracting content descriptive metadata from lecture videos. J. Intell. Inf. Syst, 46(1), 121–145 (2015)

    Google Scholar 

  5. Chand, D., Ogul, H.: Content-based search in lecture video: a systematic literature review. In: 2020 3rd International Conference on Information and Computer Technologies (ICICT). IEEE, March 2020

    Google Scholar 

  6. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, June 2010

    Google Scholar 

  7. Jeong, H.J., Kim, T.-E., Kim, M.H.: An accurate lecture video segmentation method by using sift and adaptive threshold. In: Proceedings of the 10th International Conference on Advances in Mobile Computing & Multimedia - MoMM 2012. ACM Press (2012)

    Google Scholar 

  8. Pranali, B., Anil, W., Kokhale, S.: Inhalt based video recuperation system using OCR and ASR technologies. In: 2015 International Conference on Computational Intelligence and Communication Networks (CICN). IEEE, December 2015

    Google Scholar 

  9. Yang, H., Meinel, C.: Content based lecture video retrieval using speech and video text information. IEEE Trans. Learn. Technol. 7(2), 142–154 (2014)

    Article  Google Scholar 

  10. Yang, H., Quehl, B., Sack, H.: A framework for improved video text detection and recognition. Multim. Tools Appl. 69(1), 217–245 (2012). https://doi.org/10.1007/s11042-012-1250-6

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sandeep Varma , Arunanshu Pandey , Shivam , Soham Das or Soumya Deep Roy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Varma, S., Pandey, A., Shivam, Das, S., Roy, S.D. (2022). Video Indexing System Based on Multimodal Information Extraction Using Combination of ASR and OCR. In: Sachdeva, S., Watanobe, Y., Bhalla, S. (eds) Big-Data-Analytics in Astronomy, Science, and Engineering. BDA 2021. Lecture Notes in Computer Science(), vol 13167. Springer, Cham. https://doi.org/10.1007/978-3-030-96600-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96600-3_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96599-0

  • Online ISBN: 978-3-030-96600-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics