Multimedia Tools and Applications

, Volume 78, Issue 2, pp 1495–1510 | Cite as

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

  • Ilyes RebaiEmail author
  • Yassine Ben Ayed
  • Walid Mahdi


Keyword search for spoken documents has become more and more important nowadays due to the increasing amount of spoken data. The typical system makes use of an Automatic Speech Recognition system (ASR) and information retrieval methods. While a number of studies have been done to get the optimal system performance, KeyWord Search (KWS) systems still suffer from two main drawbacks. First, the system performance depends strongly on the ASR transcripts which are inherently inexact. Due to the speech signal variabilities, ASR systems are far from being powerful. Second, KWS systems make detection decisions based on the lattice-based posterior probability which is incomparable across keywords. In addition, posterior probabilities of true detection usually fall into different ranges which decrease the spotting performance. This paper considers the problems of ASR transcriptions and keyword detection decision based on posterior probabilities. More specifically, we propose to enhance the ASR transcripts accuracy by introducing a new ASR architecture in which we integrate data augmentation and ensemble learning techniques into a single framework. In addition, we proposed a novel keyword rescoring method that provides scores from a new perspective. Precisely, inspired by template-based KWS approach, scores of similarity between the detected keywords are computed by computing the distance between the acoustic features and are used as new scores for decision. Experiments on French and English datasets show that the proposed KWS system potentially leads to more accurate keyword results than the conventional systems.


Speech recognition Keyword spotting Template-based keyword rescoring Acoustic model fusion Score normalization 



  1. 1.
    Abdullah A, Veltkamp R, Wiering M (2009) An ensemble of deep support vector machines for image categorization. In: International Conference of soft computing and pattern recognition (SOCPAR), pp 301–306Google Scholar
  2. 2.
    Allauzen C, Mohri M, Saraclar M (2004) General indexation of weighted automata: application to spoken utterance retrieval. In: Proceedings of the workshop on interdisciplinary approaches to speech indexing and retrieval at HLT-NAACL 2004. Association for Computational Linguistics, pp 33–40Google Scholar
  3. 3.
    Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell, 29(1)Google Scholar
  4. 4.
    Can D, Saraclar M (2011) Lattice indexing for spoken term detection. IEEE Trans Audio Speech Language Process 19(8):2338–2347CrossRefGoogle Scholar
  5. 5.
    Ceamanos X, Waske B, Benediktsson JA, Chanussot J, Fauvel M, Sveinsson JR (2010) A classifier ensemble based on fusion of support vector machines for classifying hyperspectral data. Int J Image Data Fusion 1(4):293–307CrossRefGoogle Scholar
  6. 6.
    Chen G, Parada C, Heigold G (2014) Small-footprint keyword spotting using deep neural networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 4087–4091Google Scholar
  7. 7.
    Chen G, Parada C, Sainath TN (2015) Query-by-example keyword spotting using long short-term memory networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 5236–5240Google Scholar
  8. 8.
    Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, Lang Process 20(1):30–42CrossRefGoogle Scholar
  9. 9.
    Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2133–2136Google Scholar
  10. 10.
    Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J et al (2013) Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8604–8608Google Scholar
  11. 11.
    Fiscus JG, Ajot J, Garofolo JS, Doddingtion G (2007) Results of the 2006 spoken term detection evaluation. In: Proc. sigir, vol 7, pp 51–57Google Scholar
  12. 12.
    Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278Google Scholar
  13. 13.
    Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 6645–6649Google Scholar
  14. 14.
    Jaitly N, Hinton GE (2013) Vocal tract length perturbation (vtlp) improves speech recognition. In: ICML Workshop on deep learning for audio, speech and languageGoogle Scholar
  15. 15.
    Jaitly N, Nguyen P, Senior AW, Vanhoucke V (2012) Application of pretrained deep neural networks to large vocabulary speech recognition. In: Interspeech, pp 2578–2581Google Scholar
  16. 16.
    Karakos D, Schwartz R, Tsakalidis S, Zhang L, Ranjan S, Ng TT, Hsiao R, Saikumar G, Bulyko I, Nguyen L et al (2013) Score normalization and system combination for improved keyword spotting. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 210–215Google Scholar
  17. 17.
    Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: INTERSPEECH, pp 3586–3589Google Scholar
  18. 18.
    Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 615–622Google Scholar
  19. 19.
    Mamou J, Cui J, Cui X, Gales MJ, Kingsbury B, Knill K, Mangu L, Nolden D, Picheny M, Ramabhadran B et al (2013) System combination and score normalization for spoken term detection. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8272–8276Google Scholar
  20. 20.
    Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. Tech. rep. National Inst of Standards and Technology Gaithersburg MDGoogle Scholar
  21. 21.
    Miller DR, Kleber M, Kao CL, Kimball O, Colthurst T, Lowe SA, Schwartz RM, Gish H (2007) Rapid and accurate spoken term detection. In: Eighth Annual Conference of the international speech communication associationGoogle Scholar
  22. 22.
    Mohamed AR, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22CrossRefGoogle Scholar
  23. 23.
    Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210Google Scholar
  24. 24.
    Ragni A, Knill KM, Rath SP, Gales MJ (2014) Data augmentation for low resource languages. In: INTERSPEECH, pp 810–814Google Scholar
  25. 25.
    Rebai I, BenAyed Y, Mahdi W, Lorré JP (2017) Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput Sci 112:316–322CrossRefGoogle Scholar
  26. 26.
    Saraclar M, Sproat R (2004) Lattice-based search for spoken utterance retrieval. Urbana 51(61):801Google Scholar
  27. 27.
    Siohan O, Bacchiani M (2005) Fast vocabulary-independent audio search using path-based graph indexing. In: Ninth European Conference on speech communication and technologyGoogle Scholar
  28. 28.
    Szöke I, Fapso M, Karafiát M, Burget L, Grézl F, Schwarz P, Glembek O, Matejka P, Kontár S, Cernockỳ J (2006) But system for nist std 2006-english. In: NIST Spoken Term detection evaluation workshopGoogle Scholar
  29. 29.
    Szöke I, Burget L, Cernocky J, Fapso M (2008) Sub-word modeling of out of vocabulary words in spoken term detection. In: Spoken Language technology workshop, 2008. SLT 2008. IEEE, pp 273–276Google Scholar
  30. 30.
    Wang Y, Metze F (2014) An in-depth comparison of keyword specific thresholding and sum-to-one score normalization. Tech. rep., Carnegie Mellon UniversityGoogle Scholar
  31. 31.
    Wang SH, Lv YD, Sui Y, Liu S, Wang SJ, Zhang YD (2018) Alcoholism detection by data augmentation and convolutional neural network with stochastic pooling. J Med Syst 42(1):2CrossRefGoogle Scholar
  32. 32.
    Wolpert D (1992) Stacked generalization. IEEE Trans Neural Netw 5(2):241–259CrossRefGoogle Scholar
  33. 33.
    Xu H, Chen NF, Sivadas S, Lim BP, Chng ES, Li H et al (2014) Discriminative score normalization for keyword search decision. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7078–7082Google Scholar
  34. 34.
    Yu RP, Thambiratnam K, Seide F (2008) Word-lattice based spoken-document indexing with standard text indexers. In: Searching Spontaneous conversational speech workshop, SIGIR, pp 54–61Google Scholar
  35. 35.
    Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 215–219Google Scholar
  36. 36.
    Zhang YD, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang SH (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl, 1–20Google Scholar
  37. 37.
    Zhang YD, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochastic pooling for tea category classification on gpu platform. Multimed Tools Appl, 1–19Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.MIRACL: Multimedia InfoRmation System and Advanced Computing LaboratoryUniversity of SfaxSfaxTunisia
  2. 2.College of Computers and Information TechnologyTaif UniversityTaifSaudi Arabia

Personalised recommendations