Skip to main content
Log in

A video indexing and retrieval computational prototype based on transcribed speech

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Using the voice to interact with systems is attractive in medicine and other areas due to its friendliness and flexibility. Video indexing and retrieval have benefited from this resource. However, few initiatives use speech recognition to support both tasks. This work aims to develop and evaluate a prototype system to index and retrieve videos from speech transcription. In particular, the user can narrate each video’s content, generating the utterance that is captured, transformed into text and timestamped by the computational system. Simple text processing techniques are then applied to the obtained transcript before indexing. Afterward, the user can also query by speech or text to find relevant videos previously indexed. We conducted an experimental evaluation of the prototype in sets of 50 and 10 public videos. As part of this process, one collaborator manually narrated the 50 videos, while four others narrated a subset of 13 videos. An automatic narration scheme was also applied to this subset and the set of 10 videos. The evaluation showed promising results regarding Brazilian Portuguese speech recognition and retrieval performance. For example, the average word error rate reached down to 0.03 and the mean average precision achieved up to 1.00. Besides performing well, the computational tool is flexible since few changes are required to support other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The videos used by us are publicly available in Portuguese in the links reported in Sect. 3.3. More information on the data or the prototype’s configuration files is available under request to the authors.

Code availability

Currently, the code is not available.

Notes

  1. http://tiny.cc/5x5gvy

  2. https://astah.net/products/astah-community/

  3. https://www.google.com/chrome/demos/speech.html

  4. https://www.apachefriends.org/index.html

  5. http://tomcat.apache.org/

  6. https://www.ffmpeg.org

  7. https://www.eclipse.org

  8. https://netbeans.org/

  9. https://www.oracle.com/br/virtualization/virtualbox

  10. https://www.docker.com

  11. https://www.audacityteam.org

  12. https://www.camara.leg.br/tv

  13. http://tvines.org.br

  14. https://www12.senado.leg.br/tv#/

  15. https://cultura.uol.com.br/programas/rodaviva/

  16. http://www.kurento.org/

  17. https://github.com/mozilla/DeepSpeech

References

  1. Agharwal A, Kovvuri R, Nevatia R, Snoek CGM (2016) Tag-based video retrieval by embedding semantic content in a continuous word space. In IEEE Winter Conf Appl Comput Vis New York. IEEE, The United States of America, pp 1–8. https://doi.org/10.1109/WACV.2016.7477706

  2. Akosu N, Selamat A (2014) Enhancing the effectiveness of the spelling checker approach for language identification. In: Badica A, Trawinski B, Nguyen NT (eds) Recent Developments in Computational Collective Intelligence, Studies in Computational Intelligence, vol 513, Springer International Publishing, Cham, pp 157–16. https://doi.org/10.1007/978-3-31901787-7_15

  3. Al Kabary I, Schuldt H (2014) Enhancing sketch-based sport video retrieval by suggesting relevant motion paths. In: ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, The United States of America, pp 1227–1230. https://doi.org/10.1145/2600428.2609551

  4. Ambekar T, Musande V (2017) A novel approach to personalize the health care video search. In: International Conference on Intelligent Systems and Information Management, IEEE, New York, The United States of America, pp 212–216, https://doi.org/10.1109/ICISIM.2017.8122175

  5. Amir A, Srinivasan S, Efrat A (2003) Search the audio, browse the video–a generic paradigm for video collections. EURASIP J Adv Sig Pr 2003(2):209–222. https://doi.org/10.1155/S111086570321012X

    Article  Google Scholar 

  6. Amorim MN, Segundo RMC, Santos CAS, Tavares OL (2017) Crowdnote: Crowdsourcing environment for complex video annotations. In: Brazilian Symposium of Multimedia Systems and the Web–Tools and Applications Workshop, Brazilian Computer Society, Porto Alegre, Brazil, pp 194–198

  7. Atkins A, Niranjan M, Gerding E (2018) Financial news predicts stock market volatility better than close price. J Finance Data Sci 4(2):120–137. https://doi.org/10.1016/j.jfds.2018.02.002

    Article  Google Scholar 

  8. Barra GDO, Lux M, I-Nieto XG (2016) Large scale content-based video retrieval with LIvRE. In: International Workshop on Content-Based Multimedia Indexing, IEEE, New York, The United States of America, pp 1–4. https://doi.org/10.1109/CBMI.2016.7500266

  9. Bastianelli E, Castellucci G, Croce D, Basili R, Nardi D (2017) Structured learning for spoken language understanding in human-robot interaction. Int J Robot Res 36(5–7):660–683. https://doi.org/10.1177/0278364917691112

    Article  Google Scholar 

  10. Bernard G, Lebboss G (2017) Methods for word encoding: A survey. In: International Conference on Engineering and Technology, IEEE, New York, The United States of America, pp 1–6. https://doi.org/10.1109/ICEngTechnol.2017.8308139

  11. Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: A survey. Speech Commun 56:85–100. https://doi.org/10.1016/j.specom.2013.07.008

    Article  Google Scholar 

  12. Bird S, Klein E, Loper E (2009) Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol, The United States of America

    MATH  Google Scholar 

  13. Bonilla Cardona DA, Nedjah N, Mourelle LM (2017) Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs. Neurocomputing 265:78–90

    Article  Google Scholar 

  14. Cao Y, Tavanapong W, Li D, Oh J, de Groen PC, Wong J (2004) A visual model approach for parsing colonoscopy videos. In: Enser P, Kompatsiaris Y, O’Connor NE, Smeaton AF, Smeulders AWM (eds) Image and Video Retrieval, Lecture Notes in Computer Science, vol 3115, Springer Berlin Heidelberg, Berlin, Germany, pp 160–169. https://doi.org/10.1007/978-3-540-27814-6_22

  15. Carpineto C, Romano G (2012) A survey of automatic query expansion in information retrieval. ACM Comput Surv 44(1):1:1–1:50. https://doi.org/10.1145/2071389.2071390

  16. Charriére K, Quellec G, Lamard M, Coatrieux G, Cochener B, Cazuguel G (2014) Automated surgical step recognition in normalized cataract surgery videos. In: Int Conf IEEE Eng Med Biol Soc, IEEE, New York, The United States of America, pp 4647–4650. https://doi.org/10.1109/EMBC.2014.6944660

  17. Choi J, Wang Z, Lee S, Jeon WJ (2013) A spatio-temporal pyramid matching for video retrieval. Comput Vis Image Und 117(6):660–669. https://doi.org/10.1016/j.cviu.2013.02.003

    Article  Google Scholar 

  18. Christel MG, Huang C, Moraveji N, Papernick N (2004) Exploiting multiple modalities for interactive video retrieval. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, New York, The United States of America, vol 3, pp 1032–1035. https://doi.org/10.1109/ICASSP.2004.1326724

  19. Coulouris G, Dollimore J, Kindberg T, Blair G (2011) Distributed systems: concepts and design. Addison-Wesley, Boston, The United States of America

    MATH  Google Scholar 

  20. D’agostino RB, Belanger A, Jr RBD (1990) A suggestion for using powerful and informative tests of normality. Am Stat 44(4):316–321. https://doi.org/10.1080/00031305.1990.10475751

  21. Das D, Chen D, Hauptmann AG (2008) Improving multimedia retrieval with a video ocr. In: Gevers T, Jain RC, Santini S (eds) Multimedia Content Access: Algorithms and Systems II, Proceedings of SPIE, vol 6820, SPIE, Bellingham, The United States of America, pp 68200B–1– 68200B–12. https://doi.org/10.1117/12.766931

  22. de Toledo TF, Lee HD, Spolaôr N, Coy CSR, Wu FC (2019) Web system prototype based on speech recognition to construct medical reports in Brazilian Portuguese. Int J Méd Informatics 121:39–52. https://doi.org/10.1016/j.ijmedinf.2018.10.010

  23. Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the world-wide web. Commun ACM 54(4):86–96. https://doi.org/10.1145/1924421.1924442

    Article  Google Scholar 

  24. Ghoulam A, Barigou F, Belalem G, Meziane F (2018) Query expansion using medical information extraction for improving information retrieval in french medical domain. Int J Intell Inf Technol 14(3):1–17. https://doi.org/10.4018/IJIIT.2018.070101

    Article  Google Scholar 

  25. Giannakopoulos T, Pikrakis A, Theodoridis S (2008) A novel efficient approach for audio segmentation. In: Int Conf Pattern Recognit, IEEE, Tampa, The United States of America, pp 1–4

  26. Girish KVV (2019) Beginner’s guide to speech analysis. https://towardsdatascience.com/beginners-guide-to-speech-analysis4690ca7a7c05

  27. Goel P, Giangreco I, Rossetto L, Tănase C, Schuldt H(2017) “hey,vitrivr!” – a multimodal ui for video retrieval. In: Jose JM, Hauff C, Altıngovde IS, Song D, Albakour D, Watt S, Tait J (eds) Advances in Information Retrieval, Springer International Publishing, Cham, Switzerland, pp 749–752. https://doi.org/10.1007/978-3-319-56608-5_7

  28. Gómez-Durán J, Simancas-García J, Acosta-Coll M, Meléndez-Pertuz F, Vélez-Zapata J (2017) Speech recognition algorithm based on nonlinear techniques (in spanish). Espacios 38(17):4–21. https://repositorio.cuc.edu.co/xmlui/handle/11323/904

  29. Granell E, Romero V, MartínezHinarejos CD (2018) Multimodality, interactivity, and crowdsourcing for document transcription. Comput Intell 34(2):398–419. https://doi.org/10.1111/coin.12169

  30. Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual content based video indexing and retrieval. IEEE Trans Syst Man Cyber C Appl Rev 41(6):797–819. https://doi.org/10.1109/TSMCC.2011.2109710

  31. Huurnink B, Snoek CGM, de Rijke M, Smeulders AWM (2012) Content-based analysis improves audiovisual archive retrieval. IEEE Trans Multimedia 14(4):1166–1178. https://doi.org/10.1109/TMM.2012.2193561

    Article  Google Scholar 

  32. Ianeva TI, Vries APD, Westerveld T (2004) A dynamic probabilistic multimedia retrieval model. In: IEEE International Conference on Multimedia and Expo, IEEE, New York, The United States of America, vol 3, pp 1607–1610. https://doi.org/10.1109/ICME.2004.1394557

  33. Inoue N, Shinoda K (2016) Semantic indexing for large-scale video retrieval. ITE Trans Media Technol Appl 4(3):209–217. https://doi.org/10.3169/mta.4.209

    Article  Google Scholar 

  34. Iwata S, Ohyama W, Wakabayashi T, Kimura F (2016) Recognition and transition frame detection of arabic news captions for video retrieval. In: Int Conf Pattern Recognit, IEEE, New York, The United States of America, pp 4005–4010. https://doi.org/10.1109/ICPR.2016.7900260

  35. Ji X, Han J, Hu X, Li K, Deng F, Fang J, Guo L, Liu T (2011) Retrieving video shots in semantic brain imaging space using manifold-ranking. In: IEEE International Conference on Image Processing, IEEE, New York, The United States of America, pp 3633–3636. https://doi.org/10.1109/ICIP.2011.6116505

  36. Jiang L, Yu S, Meng D, Yang Y, Mitamura T, Hauptmann AG (2015) Fast and accurate content-based semantic search in 100m internet videos. In: ACM International Conference on Multimedia, ACM, New York, The United States of America, pp 49–58. https://doi.org/10.1145/2733373.2806237

  37. Johnson M, Lapkin S, Long V, Sanchez P, Suominen H, Basilakis J, Dawson L (2014) A systematic review of speech recognition technology in health care. BMC Med Inform Decis Mak 14(1):94. https://doi.org/10.1186/14726947-14-94

    Article  Google Scholar 

  38. Johnston AB, Burnett DC (2001) Professional Java Server Programming J2EE 1.3 Edition. Wrox Press, Birmingham, United Kingdom 

  39. Johnston AB, Burnett DC (2014) WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web, 3rd edn. Digital Codex LLC, Saint Louis, The United States of America

    Google Scholar 

  40. Kamabathula VK, Iyer S (2011) Automated tagging to enable fine-grained browsing of lecture videos. In: 2011 IEEE International Conference on Technology for Education, IEEE, New York, The United States of America, pp 96–102. https://doi.org/10.1109/T4E.2011.23

  41. Kayama A, Carvalho F, Castro L, Herr M, Rubim M, Pádua M, Mattos W (2007) Sung Brazilian Portuguese: Pronunciation standards for Brazilian Portuguese in scholarly chant (in Portuguese). OPUS 13(2):16–38. https://www.anppom.com.br/revista/index.php/opus/article/view/300

    Google Scholar 

  42. Kemp T, Weber M, Waibel A (2001) The ISL view4you broadcast news transcription system. Int J Speech Technol 4(3–4):177–191. https://doi.org/10.1023/A:1011348306007

    Article  MATH  Google Scholar 

  43. Larson M, Newman E, Jones GJF (2010) Overview of videoclef 2009: New perspectives on speech-based multimedia content enrichment. In: Peters C, Caputo B, Gonzalo J, Jones GJF, Kalpathy-Cramer J, Müller H, Tsikrika T (eds) Multilingual Information Access Evaluation II. Multimedia Experiments, Lecture Notes in Computer Science, vol 6242, Springer-Verlag, Berlin, Germany, pp 354–368. https://doi.org/10.1007/978-3-642-15751-6_46

  44. Li H, Bao L, Gao Z, Overwijk A, Liu W, Zhang L, Yu S, Chen M, Metze F, Hauptmann AG (2010) Informedia @ trecvid 2010. https://www.cs.unc.edu/~wliu/papers/trecvid2010_informedia.pdf

  45. Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165. https://doi.org/10.1147/rd.22.0159

    Article  MathSciNet  Google Scholar 

  46. Luong TH, Pham NM, Vu QH (2016) Vietnamese multimedia agricultural information retrieval system as an info service. In: Murakami Y, Lin D (eds) International Workshop on Worldwide Language Service Infrastructure, Lecture Notes in Computer Science, vol 9442, Springer International Publishing, Cham, Switzerland, pp 147–160. https://doi.org/10.1007/978-3319-31468-6_11

  47. Machado RB, Lee HD, Ayrizono MDLS, Leal RF, Coy CSR, Fagundes JJ, Wu FC (2012) Prototype of a computer system for managing data and video colonoscopy exams. J Coloproctol (Rio de Janeiro) 32(1):50–59. https://doi.org/10.1590/S2237-93632012000100007

    Article  Google Scholar 

  48. Mitrović D, Zeppelzauer M, Zaharieva M, Breiteneder C (2011) Retrieval of visual composition in film. In:  International Workshop on Image Analysis for Multimedia Interactive Services, TU Delft, Delft, The Netherlands, pp 1–4

  49. Mühling M, Meister M, Korfhage N, Wehling J, Hörth A, Ewerth R, Freisleben B (2016) Content-based video retrieval in historical collections of the german broadcasting archive. In: Fuhr N, Kovács L, Risse T, Nejdl W (eds) International Conference on Theory and Practice of Digital Libraries, Lecture Notes in Computer Science, vol 9819, Springer International Publishing, Cham, Switzerland, pp 67–78. https://doi.org/10.1007/978-3-31943997-6_6

  50. Neto N, Patrick C, Klautau A, Trancoso I (2011) Free tools and resources for Brazilian Portuguese speech recognition. J Braz Comput Soc 17(1):53–68. https://doi.org/10.1007/s13173-010-0023-1

    Article  Google Scholar 

  51. Oliva JT, Lee HD, Spolaôr N, Takaki WSR, Coy CSR, Fagundes JJ, Wu FC (2019) A computational system based on ontologies to automate the mapping process of medical reports into structured databases. Expert Syst Appl 115:37–56. https://doi.org/10.1016/j.eswa.2018.08.004

    Article  Google Scholar 

  52. Pala M, Parayitam L, Appala V (2019) Real-time transcription, keyword spotting, archival and retrieval for telugu tv news using ASR. Int J Speech Technol 22:433–439. https://doi.org/10.1007/s10772-019-09598-6

    Article  Google Scholar 

  53. Pereira MHR, de Souza CL, Pádua FLC, Silva GD, de Assis GT, Pereira ACM (2015) SAPTE: A multimedia information system to support the discourse analysis and information retrieval of television programs. Multimed Tools Appl 74(23):10923–10963. https://doi.org/10.1007/s11042-014-2311-9

    Article  Google Scholar 

  54. Pham NM, Vu QH (2013) Acoustic modeling for under-resourced languages: A role in Vietnamese soccer video retrieval. In: International Conference on Advanced Technologies for Communications, IEEE, New York, The United States of America, pp 652–656. https://doi.org/10.1109/ATC.2013.6698195

  55. Pham NM, Vu QH (2013) Temporal confusion network for speech-based soccer event retrieval. In: International Conference on Advanced Technologies for Communications, IEEE, New York, The United States of America, pp 549–553. https://doi.org/10.1109/ATC.2013.6698176

  56. Pranali B, Anil W, Kokhale S (2015) Inhalt based video recuperation system using OCR and ASR technologies. In: International Conference on Computational Intelligence and Communication Networks, IEEE, New York, The United States of America, pp 382–386. https://doi.org/10.1109/CICN.2015.315

  57. Pressman RS (2010) Software Engineering: A Practitioner’s Approach, 7th edn. McGraw-Hill, Boston, The United States of America

    MATH  Google Scholar 

  58. Priya R, Shanmugam TN (2013) A comprehensive review of significant researches on content based indexing and retrieval of visual information. Front Comput Sci 7(5):782–799. https://doi.org/10.1007/s11704-013-1276-6

    Article  MathSciNet  Google Scholar 

  59. Quilici AF (2000) Colonoscopy (in Portuguese). Lemos, São Paulo, Brazil

    Google Scholar 

  60. Radha N (2016) Video retrieval using speech and text in video. In: International Conference on Inventive Computation Technologies, IEEE, New York, The United States of America, pp 1–6. https://doi.org/10.1109/INVENTIVE.2016.7824801

  61. Rahman MM, Bhuiyan MA (2012) Continuous bangla speech segmentation using short-term speech features extraction approaches. Int J Adv Comput Sci Appl 3(11):131–138. https://doi.org/10.14569/IJACSA.2012.031121

    Article  Google Scholar 

  62. Rautiainen M, Ojala T, Seppänen T (2004) Analysing the performance of visual, concept and text features in content-based video retrieval. In: ACM SIGMM International Workshop on Multimedia Information Retrieval, ACM, New York, The United States of America, pp 197–204. https://doi.org/10.1145/1026711.1026744

  63. Ravinder M, Venugopal T (2016) Content-based video indexing and retrieval using block based local binary patterns and pixel change ratio map (bblbppcrm). Int J Eng Technol 7(6):2156–2162. http://www.enggjournals.com/ijet/docs/IJET15-07-06-050.pdf

    Google Scholar 

  64. Repp S, Linckels S, Meinel C (2008) Question answering from lecture videos based on an automatic semantic annotation. SIGCSE Bull 40(3):17–21. https://doi.org/10.1145/1597849.1384278

    Article  Google Scholar 

  65. Rooij OD, Worring M (2012) Efficient targeted search using a focus and context video browser. ACM Trans Multimedia Comput Commun Appl 8(4):51:1–51:19. https://doi.org/10.1145/2379790.2379793

  66. Rosas VP, Mihalcea R, Morency LP (2013) Multimodal sentiment analysis of spanish online videos. IEEE Intell Syst 28(3):38–45. https://doi.org/10.1109/MIS.2013.9

    Article  Google Scholar 

  67. Rossetto L, Giangreco I, Gasser R, Schuldt H (2018) Content-based multimedia retrieval using vitrivr. ACM SIGMultimedia Rec 9(3):8:8–8:8. 10.1145/3178422.3178430 

  68. Rudinac S, Larson M, Hanjalic A (2010) Exploiting result consistency to select query expansions for spoken content retrieval. In: Gurrin C, He Y, Kazai G, Kruschwitz U, Little S, Roelleke T, Rüger S, van Rijsbergen K (eds) Advances in Information Retrieval, Lecture Notes in Computer Science, vol 5993, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 645– 648. https://doi.org/10.1007/978-3-642-12275-0_67

  69. Saita J (2018) Ok google: How to do speech recognition? https://towardsdatascience.com/ok-google-how-to-do-speechrecognition-f77b5d7cbe0b

  70. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. https://doi.org/10.1145/361219.361220

    Article  MATH  Google Scholar 

  71. Schoeffmann K, Beecks C, Lux M, Uysal MS, Seidl T (2016) Content based retrieval in videos from laparoscopic surgery. In: Webster RJ, Yaniv ZR (eds) Medical Imaging 2016: Image-Guided Procedures, Robotic Interventions, and Modeling, Proceedings of SPIE, vol 9786, SPIE, Bellingham, The United States of America, pp 9786–9786–10. https://doi.org/10.1117/12.2216864

  72. Shao L, Jones S, Li X (2014) Efficient search and localization of human actions in video databases. IEEE Trans Circuits Syst Video Technol 24(3):504–512. https://doi.org/10.1109/TCSVT.2013.2276700

    Article  Google Scholar 

  73. Sharma R, Mummareddy S, Hershey J, Jung N (2013) Method and system for analyzing shopping behavior in a store by associating RFID data with video-based behavior and segmentation data. Patent US 8380558 

  74. Sheikh I, Fohr D, Illina I, Linars G (2017) Modelling semantic context of oov words in large vocabulary continuous speech recognition. IEEE/ACM Trans Audio Speech Lang Process 25(3):598–610. https://doi.org/10.1109/TASLP.2017.2651361

    Article  Google Scholar 

  75. Silva CPA (2010) A speech recognition software for Brazilian Portuguese (in Portuguese). Master’s thesis, Pará Federal University, Belém, Brazil 

  76. Singh A, Larson M (2013) Narrative-driven multimedia tagging and retrieval: Investigating design and practice for speech-based mobile applications. Language and Audio in Multimedia, In Workshop on Speech, pp 90–95

    Google Scholar 

  77. Singhal A (2001) Modern information retrieval: A brief overview. Bull IEEE Comput Soc Technical Comm Data Eng 24(4):35–43

  78. Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606. https://doi.org/10.1109/TPAMI.2008.111

    Article  Google Scholar 

  79. Spolaôr N, Lee HD, Takaki WSR, Ensina LA, Coy CSR, Wu FC (2020) A systematic review on content-based video retrieval. Eng Appl Artif Intel 90:103557. https://doi.org/10.1016/j.engappai.2020.103557

  80. Sprugnoli R, Moretti G, Bentivogli L, Giuliani D (2017) Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing. Lang Resour Eval 51(2):283–317. https://doi.org/10.1007/s10579-016-9372-5

    Article  Google Scholar 

  81. Tahayna B, Ayyasamy RK, Alhashmi S, Eu-Gene S (2010) A novel weighting scheme for efficient document indexing and classification. In: International Symposium on Information Technology, IEEE, New York, The United States of America, vol 2, pp 783-788. https://doi.org/10.1109/ITSIM.2010.5561553

  82. Vigneshwari G, Juliet ANM (2015) Optimized searching of video based on speech and video text content. In: International Conference on Soft-Computing and Networks Security, IEEE, New York, The United States of America, pp 1–4. https://doi.org/10.1109/ICSNS.2015.7292369

  83. Vogel M, Kaisers W, Wassmuth R, Mayatepek E (2015) Analysis of documentation speed using web-based medical speech recognition technology: Randomized controlled trial. J Méd Internet Res 17(11):e247. https://doi.org/10.2196/jmir.5072

    Article  Google Scholar 

  84. Waheed K, Weaver K, Salam FM (2002) A robust algorithm for detecting speech segments using an entropic contrast. In: The Midwest Symposium on Circuits and Systems, IEEE, Tulsa, The United States of America, pp III–328–III–331

  85. Wang X, Yang C, Guan R (2018) A comparative study for biomedical named entity recognition. Int J Mach Learn Cyber 9(3):373–382. https://doi.org/10.1007/s13042-015-0426-6

    Article  Google Scholar 

  86. Wei XY, Jiang YG, Ngo CW (2011) Concept-driven multi-modality fusion for video search. IEEE Trans Circuits Syst Video Technol 21(1):62–73. https://doi.org/10.1109/TCSVT.2011.2105597

    Article  Google Scholar 

  87. Witbrock MJ, Hauptmann AG (1998) Speech recognition for a digital video library. J Am Soc Inf Sci Technol 49(7):619–632. https://doi.org/10.1002/(SICI)1097-4571

    Article  Google Scholar 

  88. Wu FC, Lee HD, Coy CSR, Fagundes JJ, Ferrero CA, Machado RB, Maletzke AG, Zalewski W, Leal RF, Ayrizono MLS, Costa LHD (2010) Method to map textual documents into structured databases using ontologies (in Portuguese). Patent BR INPI 01810036941

  89. Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154

    Article  Google Scholar 

  90. Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154

  91. Yin Y, Seo B, Zimmermann R (2015) Content vs. context: Visual and geographic information use in video landmark retrieval. ACM Trans Multimedia Comput Commun Appl 11(3):39:1–39:21. https://doi.org/10.1145/2700287

  92. Yu D, Deng L (2015) Automatic Speech Recognition: A Deep Learning Approach. Springer-Verlag, London, London, United Kingdom

    MATH  Google Scholar 

  93. Zhai Y, Liu J, Shah M (2006) Automatic query expansion for news video retrieval. In: IEEE International Conference on Multimedia and Expo, IEEE, New York, The United States of America, pp 965–968. https://doi.org/10.1109/ICME.2006.262693

  94. Zhao B, Xu S, Lin S, Luo X, Duan L (2016) A new visual navigation system for exploring biomedical open educational resource (OER) videos. J Am Med Inform Assoc 23(e1):e34–e41. https://doi.org/10.1093/jamia/ocv123

    Article  Google Scholar 

Download references

Funding

We would like to thank Araucária Foundation for the Support of the Scientific and Technological Development of Paraná through a Research and Technological Productivity Scholarship for H. D. Lee (grant 028/2019). We also would like to thank PGEEC/UNIOESTE through a postdoctoral scholarship for N. Spolaôr, the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001 through a MSc. scholarship for L. A. Ensina and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) through the grant number 142050/2019-9 for A. R. S. Parmezan. These agencies did not have any further involvement in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Newton Spolaôr.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Calibration text in Brazilian Portuguese

Tudo indica que a Reforma da Previdência será o tema de destaque. A proposta deve começar a ser discutida no plenário na quinta-feira, mas a expectativa de votação é só pra semana que vem. Está na pauta do plenário ainda uma proposta que parcela dívidas dos produtores rurais com a previdência, que substitui uma medida provisória que perdeu a validade. O texto base foi aprovado.

1.2 Queries used for video retrieval

Table 12 First part of the queries applied in the databases built from news videos
Table 13 Second part of the queries applied in the databases built from news videos
Table 14 Third part of the queries applied in the databases built from news videos
Table 15 Fourth part of the queries applied in the databases built from news videos
Table 16 Queries applied in the database built from talk show videos

1.3 Speech recognition results

Table 17 WER associated with each transcript used to build the M1_50 database
Table 18 WER associated with each transcript used to build the M1_13 database
Table 19 WER associated with each transcript used to build the M2_13 database
Table 20 WER associated with each transcript used to build the M3_13 database
Table 21 WER associated with each transcript used to build the M4_13 database
Table 22 WER associated with each transcript used to build the M5_13 database
Table 23 WER associated with each transcript used to build the R_13 database
Table 24 WER associated with each transcript used to build the R_10 database
Table 25 Retrieval performance associated with each video indexed in the M1_50 database. The retrieval time (rt) in seconds is also reported

1.4 Video retrieval results

Table 26 Retrieval performance associated with each video indexed in the M1_13 database. The retrieval time (rt) in seconds is also reported
Table 27 Retrieval performance associated with each video indexed in the M2_13 database. The retrieval time (rt) in seconds is also reported
Table 28 Retrieval performance associated with each video indexed in the M3_13 database. The retrieval time (rt) in seconds is also reported
Table 29 Retrieval performance associated with each video indexed in the M4_13 database. The retrieval time (rt) in seconds is also reported
Table 30 Retrieval performance associated with each video indexed in the M5_13 database. The retrieval time (rt) in seconds is also reported
Table 31 Retrieval performance associated with each video indexed in the R_13 database. The retrieval time (rt) in seconds is also reported
Table 32 Retrieval performance associated with each video indexed in the R_10 database. The retrieval time (rt) in seconds is also reported

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spolaôr, N., Lee, H.D., Takaki, W.S.R. et al. A video indexing and retrieval computational prototype based on transcribed speech. Multimed Tools Appl 80, 33971–34017 (2021). https://doi.org/10.1007/s11042-021-11401-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11401-1

Keywords

Navigation