Multimedia Tools and Applications

, Volume 62, Issue 1, pp 5–34 | Cite as

Evolutionary discriminative confidence estimation for spoken term detection

  • Javier TejedorEmail author
  • Alejandro Echeverría
  • Dong Wang
  • Ravichander Vipperla


Spoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 h of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.


Spoken term detection Confidence measurement Evolutionary discriminant analysis 



This work was partially supported by the French Ministry of Industry (Innovative Web call) under contract, ‘Collaborative Annotation for Video Accessibility’ (ACAV) and by ‘The Adaptable Ambient Living Assistant’ (ALIAS) project funded through the joint national Ambient Assisted Living (AAL) programme.


  1. 1.
    Akbacak M, Vergyri D, Stolcke A (2008) Open-vocabulary spoken term detection using graphone-based hybrid recognition systems. In: Proc. ICASSP’08. Las Vegas, USA, pp 5240–5243Google Scholar
  2. 2.
    Alander JT (1995) Indexed bibliography of genetic algorithms in physical sciences. Report 94-1-PHYS, University of Vaasa, Department of Information Technology and Production EconomicsGoogle Scholar
  3. 3.
    Beyer HG, Schwefel HP (2002) Evolution strategies—a comprehensive introduction. Natural Comput. 1(1):3–52MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Bisani M, Ney H (2008) Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun 50(5):434–451CrossRefGoogle Scholar
  5. 5.
    Bishop CM (1995) Neural networks for pattern recognition. Oxford University PressGoogle Scholar
  6. 6.
    Black AW, Lenzo K, Pagel V (1998) Issues in building general letter to sound rules. In: Proc. 3rd ESCA workshop on speech synthesis. Jenolan Caves, Australia, pp 77–80Google Scholar
  7. 7.
    Burger S, MacLaren V, Yu H (2002) The ISL meeting corpus: the impact of meeting type on speech style. In: Proc. ICSLP’02. Denver, USA, pp 301–304Google Scholar
  8. 8.
    Can D, Cooper E, Sethy A, White C, Ramabhadran B, Saraclar M (2009) Effect of pronunciations on OOV queries in spoken term detection. In: Proc. ICASSP’09. Taipei, Taiwan, pp 3957–3960Google Scholar
  9. 9.
    Chan CA, Lee LS (2010) Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In: Proc. interspeech’10Google Scholar
  10. 10.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machinesGoogle Scholar
  11. 11.
    Chen SH (2002) Evolutionary computation in economics and finance. Springer-Verlag, New York, Inc., Secaucus, NJGoogle Scholar
  12. 12.
    Chen CP, Lee HY, Yeh CF, Lee LS (2010) Improved spoken term detection by feature space pseudo-relevance feedback. In: Proc. interspeech’10Google Scholar
  13. 13.
    Cordón O, Herrera-Viedma E, López-Pujalte C, Luque M, Zarco C (2003) A review on the application of evolutionary computation to information retrieval. Int J Approx Reason 34(2–3):241–264. Soft Computing Applications to Intelligent Information Retrieval on the InternetzbMATHCrossRefGoogle Scholar
  14. 14.
    Daelemans W, van den Bosch A, Zavrel J (1999) Forgetting exceptions is harmful in language learning. Mach Learn 34(1–3):11–41zbMATHCrossRefGoogle Scholar
  15. 15.
    Damper R, Eastmond J (1997) Pronunciation by analogy: Impact of implementational choices on performance. Lang Speech 40(1):1–23Google Scholar
  16. 16.
    Deligne S, Yvon F, Bimbot F (1995) Variable-length sequence matching for phonetic transcription using joint multigrams. In: Proc. Eurospeech’95. Madrid, Spain, pp 2243–2246Google Scholar
  17. 17.
    Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer-VerlagGoogle Scholar
  18. 18.
    Filho E, De Carvalho A (1997) Evolutionary design of MLP neural network architectures. In: Proceedings in 4th Brazilian symposium on neural networks, 1997, pp 58–65Google Scholar
  19. 19.
    Fiscus JG, Ajot J, Garofolo JS, Doddington G (2007) Results of the 2006 spoken term detection evaluation. In: Proc. Workshop on Searching Spontaneous Conversational Speech (SIGIR-SSCS’07). AmsterdamGoogle Scholar
  20. 20.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188Google Scholar
  21. 21.
    Fogel GB, Corne DW (2002) Evolutionary computation in bioinformatics. ElsevierGoogle Scholar
  22. 22.
    Hain T, Burget L, Dines J, Garau G, Karafiat M, Lincoln M, Vepa J, Wan V (2006) The AMI meeting transcription system: progress and performance. In: Machine learning for multimodal interaction, vol 4299/2006. Springer, Berlin/Heidelberg, pp 419–431CrossRefGoogle Scholar
  23. 23.
    Janin A, Baron D, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) The ICSI meeting corpus. In: Proc. ICASSP’03. Hong Kong, pp 364–367Google Scholar
  24. 24.
    Jansen A, Church K, Hermansky H (2010) Towards spoken term discovery at scale with zero resources. In: Proc. interspeech’10Google Scholar
  25. 25.
    Logan B, Moreno P, Thong JMV, Whittaker E (2000) An experimental study of an audio indexing system for the web. In: Proc. ICSLP’00, vol. 2. Beijing, China, pp 676–679Google Scholar
  26. 26.
    Mamou J, Ramabhadran B (2008) Phonetic query expansion for spoken document retrieval. In: Proc. interspeech’08. Brisbane, Australia, pp 2106–2109Google Scholar
  27. 27.
    Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proc. ACM-SIGIR’07, pp 615–622Google Scholar
  28. 28.
    Mantere T, Alander JT (2005) Evolutionary software engineering, a review. Appl. Soft Comput. 5:315–331CrossRefGoogle Scholar
  29. 29.
    Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: Proc. Eurospeech’97, vol 4. Rhodes, Greece, pp 1895–1898Google Scholar
  30. 30.
    Meng S, Yu P, Liu J, Seide F (2008) Fusing multiple systems into a compact lattice index for Chinese spoken term detection. In: Proc. ICASSP’08. Las Vegas, USA, pp 4345–4348Google Scholar
  31. 31.
    Michalewicz Z (1997) Evolutionary Algorithms in Engineering Applications, 1st edn. Springer-Verlag, New York, Inc., Secaucus, NJzbMATHGoogle Scholar
  32. 32.
    Miller DRH, Kleber M, Kao CL, Kimball O, Colthurst T, Lowe SA, Schwartz RM, Gish H (2007) Rapid and accurate spoken term detection. In: Proc. interspeech’07. Antwerp, Belgium, pp 314–317Google Scholar
  33. 33.
    Mitchell M, Taylor CE (1999) Evolutionary computation: an overview. Ann Rev Ecolog Syst 30(1):593–616CrossRefGoogle Scholar
  34. 34.
    Motlicek P, Valente F, Garner P (2010) English spoken term detection in multilingual recordings. In: Proc. interspeech’10Google Scholar
  35. 35.
    NIST (2006) The spoken term detection (STD) 2006 evaluation plan, 10th edn. National Institute of Standards and Technology (NIST). Gaithersburg, MD. Accessed 26 Oct 2011
  36. 36.
    Parada C, Sethy A, Dredze M, Jelinek F (2010) A spoken term detection framework for recovering out-of-vocabulary words using the web. In: Proc. interspeech’10Google Scholar
  37. 37.
    Parada C, Sethy A, Ramabhadran B (2010) Balancing false alarms and hits in spoken term detection. In: Proc. ICASSP’10Google Scholar
  38. 38.
    Rocha M, Cortez P, Neves J (2005) Evolutionary design of neural networks for classification and regression. In: Ribeiro B, Albrecht R, Dobnikar A, Pearson D, Steele N (eds) Adaptive and natural computing algorithms. Springer-Verlag, Wien, pp 304–307CrossRefGoogle Scholar
  39. 39.
    Rocha M, Cortez P, Neves J (2007) Evolution of neural networks for classification and regression. Neurocomput 70:2809–2816CrossRefGoogle Scholar
  40. 40.
    Schneider D, Mertens T, Larson M, Kohler J (2010) Contextual verification for open vocabulary spoken term detection. In: Proc. interspeech’10Google Scholar
  41. 41.
    Sierra A, Echeverría A (2005) Neural networks trained by distance to means. WSEAS Trans. on Inf. Science and App. 2(9):1446–1453Google Scholar
  42. 42.
    Sierra A, Echeverría A (2006) Evolutionary discriminant analysis. IEEE Trans Evol Comput 10(1):81–92CrossRefGoogle Scholar
  43. 43.
    Smith SL, Cagnoni S (2010) Genetic and evolutionary computation: medical applications. WileyGoogle Scholar
  44. 44.
    Szöke I, Burget L, C̆ernocký J, Faps̆o M (2008) Sub-word modeling of out of vocabulary words in spoken term detection. In: Proc. IEEE workshop on Spoken Language Technology (SLT’08). Goa, India, pp 273–276Google Scholar
  45. 45.
    Szöke I, Faps̆o M, Burget L, C̆ernocký J (2008) Hybrid word-subword decoding for spoken term detection. In: Proc. speech search workshop at SIGIR (SSCS’08). Association for Computing Machinery, SingaporeGoogle Scholar
  46. 46.
    Szöke I, Faps̆o M, Karafiát M, Burget L, Grézl F, Schwarz P, Glembek O, Matĕjka P, Kontár S, C̆ernocký J (2006) BUT system for NIST STD 2006—English. In: Proc. NIST Spoken Term Detection evaluation workshop (STD’06). National Institute of Standards and Technology, MarylandGoogle Scholar
  47. 47.
    Szöke I, Faps̆o M, Karafiát M, Burget L, Grézl F, Schwarz P, Glembek O, Matĕjka P, Kopecký J, C̆ernocký J (2008) Spoken term detection system based on combination of LVCSR and phonetic search. In: Machine learning for multimodal interaction. Lecture notes in computer science, vol 4892/2008. Springer, Berlin/Heidelberg, pp 237–247CrossRefGoogle Scholar
  48. 48.
    Taylor P (2005) Hidden Markov models for grapheme to phoneme conversion. In: Proc. interspeech’05. Lisbon, Portugal, pp 1973–1976Google Scholar
  49. 49.
    Thambiratmann K, Sridharan S (2007) Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Trans. on Audio, Speech and Language Process. 15(1):346–357CrossRefGoogle Scholar
  50. 50.
    Vergyri D, Shafran I, Stolcke A, Gadde RR, Akbacak M, Roark B, Wang W (2007) The SRI/OGI 2006 spoken term detection system. In: Proc. interspeech’07. Antwerp, Belgium, pp 2393–2396Google Scholar
  51. 51.
    Vergyri D, Stolcke A, Gadde RR, Wang W (2006) The SRI 2006 spoken term detection system. In: Proc. NIST spoken term detection workshop (STD 2006). Gaithersburg, MarylandGoogle Scholar
  52. 52.
    Wallace R, Vogt R, Baker B, Sridharan S (2010) Optimising figure of merit for phonetic spoken term detection. In: Proc. ICASSP’10Google Scholar
  53. 53.
    Wang D (2009) Out-of-vocabulary spoken term detection. PhD thesis, The Center for Speech Technology Research, Edinburgh UniversityGoogle Scholar
  54. 54.
    Wang D, King S, Frankel J (2009) Stochastic pronunciation modelling for spoken term detection. In: Proc. interspeech’09. Brighton, UK, pp 2135–2138Google Scholar
  55. 55.
    Wang D, King S, Frankel J (2010) Stochastic pronunciation modeling for out-of-vocabulary spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing 19(4):688–698CrossRefGoogle Scholar
  56. 56.
    Wang D, King S, Frankel J, Bell P (2009) Term-dependent confidence for out-of-vocabulary term detection. In: Proc. interspeech’09. Brighton, UK, pp 2139–2142Google Scholar
  57. 57.
    Wang D, King S, Frankel J, Vipperla R, Evans N, Troncy R (2011) Direct posterior confidence estimation for out-of-vocabulary spoken term detection. ACM Trans Inf Sys (accepted, in revision)Google Scholar
  58. 58.
    Watson D (2003) Death sentence, the decay of public language. Knopf, SydneyGoogle Scholar
  59. 59.
    Wessel F, Macherey K, Schlüter R (1998) Using word probabilities as confidence measures. In: Proc. ICASSP’98, vol 1. Seattle, Washington, pp 225–228Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Javier Tejedor
    • 1
    Email author
  • Alejandro Echeverría
    • 2
  • Dong Wang
    • 3
  • Ravichander Vipperla
    • 3
  1. 1.Human Computer Technology LaboratoryUniversidad Autónoma de MadridMadridSpain
  2. 2.Machine Learning GroupUniversidad Autónoma de MadridMadridSpain
  3. 3.Multimedia Communications DepartmentEURECOMSophia-AntipolisFrance

Personalised recommendations