Multimedia Systems

, Volume 18, Issue 6, pp 499–518 | Cite as

Speech information retrieval: a review

Regular Paper


Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.


Speech signal processing Speech event detection Speech classification Speech segmentation Speech analysis features Speech recognition Speaker recognition Indexing and retrieval Multilingual analysis Acoustic fingerprinting 


  1. 1.
    Adami, A., Mihaescu, R., Reynolds, D., Godfrey, J.: Modeling prosodic dynamics for speaker recognition. In: Proceedings of the ICASSP, vol. 4, pp. 788–791 (2003)Google Scholar
  2. 2.
    Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M.: Content-based identification of audio material using MPEG-7 low level description. In: Proceedings of the International Symposium of Music Information Retrieval (2001)Google Scholar
  3. 3.
    Allegro, S., Buchler, M., Launer, S.: Automatic sound classification inspired by auditory scene analysis. In: Consistent and Reliable Acoustic Cues for Sound Analysis (CRAC), One-Day Workshop, Aalborg, Denmark (2001)Google Scholar
  4. 4.
    Al-Sawalmeh, W., Daqrouq, K., Daoud, O., Al-Qawasmi, A.: Speaker identification system-based mel frequency and wavelet transform using neural network classifier. Eur. J. Sci. Res. 41(4), 515–525 (2010)Google Scholar
  5. 5.
    Anguera, X., Wooters, C., Pardo, J.: Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Ninth International Conference on Spoken Language Processing (2006)Google Scholar
  6. 6.
    Azmi, M., Tolba, H., Mahdy, S., Fashal, M.: Syllable-based automatic Arabic speech recognition. In: Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation, pp. 246–250. World Scientific and Engineering Academy and Society (WSEAS), Greece (2008)Google Scholar
  7. 7.
    Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N., O’Shaugnessy, D.: Research developments and directions in speech recognition and understanding. Part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)CrossRefGoogle Scholar
  8. 8.
    Barbu, T.: A supervised text-independent speaker recognition approach. World Acad. Sci. Eng. Technol. 33 (2007)Google Scholar
  9. 9.
    Barras, C., Zhu, X., Meignier, S., Gauvain, J.: Improving speaker diarization. In: RT-04F Workshop (2004)Google Scholar
  10. 10.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002)Google Scholar
  11. 11.
    Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10–11), 763–786 (2007)CrossRefGoogle Scholar
  12. 12.
    Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcıa, J.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 4, 430–451 (2004)Google Scholar
  13. 13.
    Bonastre, J., Wils, F., Meignier, S.: ALIZE, a free toolkit for speaker recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, USA, pp. 737–740 (2005)Google Scholar
  14. 14.
    Bonastre, J., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N., Fauve, B., Mason, J.: ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition. In: Odyssey-The Speaker and Language Recognition Workshop (2008)Google Scholar
  15. 15.
    Brill, E.: Discovering the lexical features of a language. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 339–340. Association for Computational Linguistics (1991)Google Scholar
  16. 16.
    Brümmer, N., du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)CrossRefGoogle Scholar
  17. 17.
    Burges, C., Platt, J., Jana, S.: Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Audio Process. 11(3), 165–174 (2003)CrossRefGoogle Scholar
  18. 18.
    Camastra, F., Vinciarelli, A., Yu, J.: Machine learning for audio, image and video analysis. J. Electron. Imaging 18, 029901 (2009)CrossRefGoogle Scholar
  19. 19.
    Campbell, J., Reynolds, D., Dunn, R.: Fusing high-and low-level features for speaker recognition. In: Eighth European Conference on Speech Communication and Technology (2003)Google Scholar
  20. 20.
    Campbell, W., Sturim, D., Reynolds, D.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5) (2006)Google Scholar
  21. 21.
    Campbell, J.P., Shen, W., Campbell, W.M., Schwartz, R., Bonastre, J.F., Matrouf, D.: Forensic speaker recognition. Signal Process. Mag. IEEE 26(2009), 95–103 (2009)CrossRefGoogle Scholar
  22. 22.
    Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A review of audio fingerprinting. J. VLSI Signal Process. 41(3), 271–284 (2005)CrossRefGoogle Scholar
  23. 23.
    Canseco-Rodriguez, L., Lamel, L., Gauvain, J.: Speaker diarization from speech transcripts. In: Proceedings of the ICSLP, vol. 4 (2004)Google Scholar
  24. 24.
    Casey, M.: General sound classification and similarity in MPEG-7. Organ. Sound 6(2), 153–164 (2002)Google Scholar
  25. 25.
    Cohen, L.: Time frequency distributions—a review. In: Proceedings of the IEEE, vol. 77 (1989)Google Scholar
  26. 26.
    de Jong, F., Gauvain, J.L., Hiemstra, D., Netter, K.: Language-based multimedia information retrieval. In: In 6th RIAO Conference (2000)Google Scholar
  27. 27.
    Dunning, T.: Statistical identification of language. Tech. Rep. MCCS 94-273, New Mexico State University (1994)Google Scholar
  28. 28.
    Dusan, S., Deng, L.: Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of CITO Researcher Retreat-Hamilton (1998)Google Scholar
  29. 29.
    ELDA: Evaluations and Language Resources Distribution Agency (2010).
  30. 30.
    Fauve, B.G.B., Matrouf, D., Scheffer, N., Bonastre, J.F.F., Mason, J.S.D.: State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans. Audio Speech Lang. Process. 15(7), 1960–1968 (2007)CrossRefGoogle Scholar
  31. 31.
    Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4414–4417. IEEE, New York (2010)Google Scholar
  32. 32.
    Friedland, A., Vinyals, B., Huang, C., Muller, D.: Fusing short term and long term features for improved speaker diarization. In: Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4077–4080. IEEE (2009)Google Scholar
  33. 33.
    Fulop, S., Disner, S.: The reassigned spectrogram as a tool for voice identification. In: Proceedings of ICPhS 2007, pp. 1853–1856 (2007)Google Scholar
  34. 34.
    Fulop, S., Disner, S.: Advanced time-frequency displays applied to forensic speaker identification. Proc. Meet. Acoust. 6, 060008 (2009)Google Scholar
  35. 35.
    Gang, C., Hui, T., Xin-meng, C.: Audio segmentation via the similarity measure of audio feature vectors. Wuhan Univ. J. Nat. Sci. 10(5), 833–837 (2005)CrossRefGoogle Scholar
  36. 36.
    Gannert, T.: A Speaker Verification System Under the Scope: Alize. Master’s thesis, TMH (2007)Google Scholar
  37. 37.
    Gravier, G., Betser, M., Ben, M.: Audio Segmentation Toolkit, release 1.2. IRISA (2010)Google Scholar
  38. 38.
    Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system with an efficient search strategy. J. New Music Res. 32(2), 211–221 (2003)CrossRefGoogle Scholar
  39. 39.
    Haitsma, J., Kalker, T., Oostveen, J.: Robust audio hashing for content identification. In: Proceedings of the Content-Based Multimedia Indexing (2001)Google Scholar
  40. 40.
    Hansen, J., Bou-Ghazale, S., Sarikaya, R., Pellom, B.: Getting started with the SUSAS: speech under simulated and actual stress database. In: Robust Speech Processing Laboratory (1998)Google Scholar
  41. 41.
    Hansen, J.H., Gavidia-Ceballos, L., Kaiser, J.F.: A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment. In: IEEE Transactions on Biomedical Engineering (1998)Google Scholar
  42. 42.
    Harris, F.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66(1), 51–83 (1978)CrossRefGoogle Scholar
  43. 43.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRefGoogle Scholar
  44. 44.
    Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRefGoogle Scholar
  45. 45.
    Heymann, M.: sound: A sound interface for R. R package version 1.3 (2010).
  46. 46.
    Huijbregts, M.: Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede (2008)Google Scholar
  47. 47.
    ISIP: Automatic speech recognition (2010).
  48. 48.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)Google Scholar
  49. 49.
    Jiang, D.N., Cai, L.H.: Speech emotion classification with the combination of statistic features and temporal features. In: IEEE International Conference on Multimedia and Expo (2004)Google Scholar
  50. 50.
    Jin, Q.: Robust Speaker Recognition. Ph.D. thesis, Carnegie Mellon University (2007)Google Scholar
  51. 51.
    Kajarekar, S., Ferrer, L., Stolcke, A., Shriberg, E.: Voice-based speaker recognition combining acoustic and stylistic features. In: Advances in Biometrics, pp. 183–201 (2008)Google Scholar
  52. 52.
    Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)CrossRefGoogle Scholar
  53. 53.
    Kimura, A., Kashino, K., Kurozumi, T., Murase, H.: Very quick audio searching: introducing global pruning to the time-series active search. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 3 (2001)Google Scholar
  54. 54.
    Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)CrossRefGoogle Scholar
  55. 55.
    Kinnunen, T.: Spectral features for automatic text-independent speaker recognition. Ph. Lic. thesis, University of Joensuu, Department of Computer Science (2004)Google Scholar
  56. 56.
    Larcher, A., Lévy, C., Matrouf, D., Bonastre, J.: LIA NIST-SRE’10 systems. Unpublished (2010)Google Scholar
  57. 57.
    LDC: Language Data Consortium (2010).
  58. 58.
    Lee, A., Kawahara, T., Takeda, K., Mimura, M., Yamada, A., Ito, A., Itou, K., Shikano, K.: Continuous speech recognition consortium—an open repository for CSR tools and models. In: Proceedings of the IEEE International Conference on Language Resources and Evaluation (2002)Google Scholar
  59. 59.
    Lee, C.H.: Back to speech science-towards a collaborative ASR community of the 21st century. In: Dynamics of Speech Production and Perception, p. 221 (2006)Google Scholar
  60. 60.
    Li, S.: Content-based audio classification and retrieval using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2002)CrossRefGoogle Scholar
  61. 61.
    Li, D., Sethi, I., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recognit. Lett. 22(5), 533–544 (2001)MATHCrossRefGoogle Scholar
  62. 62.
    Li, X., Tao, J., Johnson, M.T., Soltis, J., Savage, A., Leong, K.M., Newman, J.D.: Stress and emotion classification using Jitter and Shimmer features. In: IEEE International Conference on Acoustics Speech and Signal Processing (2007)Google Scholar
  63. 63.
    Li, H., Ma, B., Lee, C.: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)CrossRefGoogle Scholar
  64. 64.
    Linguistic Data Consortium (2010).
  65. 65.
    Liscombe, J., Riccardi, G., Hakkaini-Tür, D.: Using context to improve emotion detection in spoken dialog systems. In: Proceedings of Interspeech (2005)Google Scholar
  66. 66.
    Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L.B., Allen, N.B.: Detection of clinical depression n adolescents’ speech during family interactions. In: IEEE Transactions on Biomedical Engineering (2011)Google Scholar
  67. 67.
    Lu, L., Zhang, H., Li, S.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)CrossRefGoogle Scholar
  68. 68.
    Lu, H., Pan, W., Lane, N., Choudhury, T., Campbell, A.: SoundSense: scalable sound sensing for people-centric applications on mobile phones. In: Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services, pp. 165–178. ACM, New York (2009)Google Scholar
  69. 69.
    Ma, G., Zhou, W., Zheng, J., You, X., Ye, W.: A comparison between HTK and SPHINX on Chinese Mandarin. In: Proceedings of the 2009 International Joint Conference on Artificial Intelligence, pp. 394–397. IEEE Computer Society, New York (2009)Google Scholar
  70. 70.
    Makhoul, J.: Information extraction from speech. In: Spoken Language Technology Workshop, 2006, p. 3. IEEE, New York (2007)Google Scholar
  71. 71.
    Meignier, S., Moraru, D., Fredouille, C., Bonastre, J., Besacier, L.: Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006)CrossRefGoogle Scholar
  72. 72.
    Meignier, S., Merlin, T.: Lium SpkDiarization: an open source toolkit for diarization. In: CMU SPUD Workshop (2010)Google Scholar
  73. 73.
    Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2. IEEE, New York (2003)Google Scholar
  74. 74.
    Milner, B., Shao, X.: Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model. In: Seventh International Conference on Spoken Language Processing (2002)Google Scholar
  75. 75.
    Miotto, R., Orio, N.: Automatic identification of music works through audio matching. In: ECDL (2007)Google Scholar
  76. 76.
    Moore, E. II, Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. In: IEEE Transactions on Biomedical Engineering (2008)Google Scholar
  77. 77.
    Nexidia: Nexidia Rich Media (2010).
  78. 78.
    NIST: Nist Language Recognition Evaluation (2010).
  79. 79.
    NIST: Nist Speaker Recognition Evaluation (2010).
  80. 80.
    NIST: Rich Transcription Evaluation Project (2010).
  81. 81.
    Nwe, T.L., Wei, F.S., Silva, L.D.: Speech based emotion classification. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology (2001)Google Scholar
  82. 82.
    OLAC: Open Language Archives Community (2010).
  83. 83.
    O’Shaughnessy, D.: Interacting with computers by voice: automatic speech recognition and synthesis. Proc. IEEE 91(9), 1272–1305 (2003)CrossRefGoogle Scholar
  84. 84.
    Padgett, C., Cottrell, G.: Representing face images for emotion classification. In: Advances in Neural Information Processing Systems (1997)Google Scholar
  85. 85.
    Pallett, D.: A look at NIST’s benchmark ASR tests: past, present, and future. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (2003)Google Scholar
  86. 86.
    Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, T., Alexiou, C.: A new approach to the automatic recognition of musical recordings. J. Audio Eng. Soc. 49(1/2), 23–35 (2001)Google Scholar
  87. 87.
    Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: A Speaker Odyssey-The Speaker Recognition Workshop (2001)Google Scholar
  88. 88.
    Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Text-independent speaker verification: state of the art and challenges. In: Progress in Nonlinear Speech Processing, pp. 135–169 (2007)Google Scholar
  89. 89.
    Poutsma, A.: Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands (CLIN) (2001)Google Scholar
  90. 90.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2010). http://www.R-project.or. ISBN 3-900051-07-0
  91. 91.
    Ramachandran, R., Farrell, K., Ramachandran, R., Mammone, R.: Speaker recognition—general classifier approaches and data fusion methods. Pattern Recognit. 35(12), 2801–2821 (2002)MATHCrossRefGoogle Scholar
  92. 92.
    Ravindran, S., Anderson, D., Slaney, M.: Low-power audio classification for ubiquitous sensor networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)Google Scholar
  93. 93.
    Recognition Technologies, Inc. (2010).
  94. 94.
    Rehurek, R., Kolkus, M.: Language identification on the web: extending the dictionary method. Lect. Notes Comput. Sci. 5449, 357–368 (2009)CrossRefGoogle Scholar
  95. 95.
    Reynolds, D.: An overview of automatic speaker recognition technology. IEEE Int. Conf. Acoust. Speech Signal Process. 4, 4072–4075 (2002)Google Scholar
  96. 96.
    Reynolds, D.: Channel robust speaker verification via feature mapping. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2 (2003)Google Scholar
  97. 97.
    Reynolds, D., Campbell, J., Campbell, W., Dunn, R., Gleason, T., Jones, D., Quatieri, T., Quillen, C., Sturim, D., Torres-Carrasquillo, P.: Beyond cepstra: exploiting high-level information in speaker recognition. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 223–229 (2003)Google Scholar
  98. 98.
    Reynolds, D., Torres-Carrasquillo, P.: The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. In: RT-04F Workshop (2004)Google Scholar
  99. 99.
    Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05) (2005)Google Scholar
  100. 100.
    Rose, P.: Forensic Speaker Identification. CRC, Boca Raton (2002)CrossRefGoogle Scholar
  101. 101.
    Satori, H., Hiyassat, H., Harti, M., Chenfour, N.: Investigation Arabic speech recognition using CMU Sphinx system. Int. Arab J. Inf. Technol. 6(2) (2009)Google Scholar
  102. 102.
    Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH (2007)Google Scholar
  103. 103.
    Sinha, R., Tranter, S., Gales, M., Woodland, P.: The Cambridge University March 2005 speaker diarisation system. In: Ninth European Conference on Speech Communication and Technology (2005)Google Scholar
  104. 104.
    Sonmez, M., Heck, L., Weintraub, M., Shriberg, E.: A lognormal tied mixture model of pitch for prosody-based speaker recognition. In: Proceedings of the Eurospeech, vol. 3, pp. 1391–1394 (1997)Google Scholar
  105. 105.
    Sonmez, K., Shriberg, E., Heck, L., Weintraub, M.: Modeling dynamic prosodic variation for speaker verification. In: Fifth International Conference on Spoken Language Processing (1998)Google Scholar
  106. 106.
    SpeecFind: Search the Speech from Last Century (2010).
  107. 107.
    Stallard, D., Prasad, R., Natarajan, P.: Development and internal evaluation of speech-to-speech translation technology at BBN. In: PerMIS ’09: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 231–237. ACM, New York (2009). doi:10.1145/1865909.1865956
  108. 108.
    Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 185 (1937)CrossRefGoogle Scholar
  109. 109.
    Sueur, J., Aubin, T., Simonis, C.: Seewave: a free modular tool for sound analysis and synthesis. Bioacoustics 18, 213–226 (2008). Google Scholar
  110. 110.
    Sukittanon, S., Atlas, L.: Modulation frequency features for audio fingerprinting. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 2 (2002)Google Scholar
  111. 111.
    Switchboard: Spontaneous conversation corpus (2010).
  112. 112.
    Teager, H.: Some observations on oral air flow during phonation. In: IEEE Transactions on Acoustics, Speech and Signal Processing (1980)Google Scholar
  113. 113.
    Tokuhisa, R., Inui, K., Matsumoto, Y.: Emotion classification using massive examples extracted from the web. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008)Google Scholar
  114. 114.
    Tong, R., Ma, B., Zhu, D., Li, H., Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2006)Google Scholar
  115. 115.
    Tranter, S., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRefGoogle Scholar
  116. 116.
    Tzanetakis, G., Cook, F.: A framework for audio analysis based on classification and temporal segmentation. In: Proceedings of the 25th EUROMICRO Conference, 1999, vol. 2, pp. 61–67. IEEE, New York (2002)Google Scholar
  117. 117.
    Urbanek, S.: audio: Audio Interface for R (2012). R package version 0.1-3
  118. 118.
    Vertanen, K.: Baseline WSJ acoustic models for HTK and Sphinx: training recipes and recognition experiments. Tech. rep., Cavendish Laboratory, University of Cambridge (2006)Google Scholar
  119. 119.
    VoxForge: Free speech… recognition (2010).
  120. 120.
    Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: A Flexible Open Source Framework for Speech Recognition, p. 18. Sun Microsystems, Inc., Mountain View (2004)Google Scholar
  121. 121.
    Wang, A.: An industrial strength audio search algorithm. In: International Conference on Music Information Retrieval (ISMIR) (2003)Google Scholar
  122. 122.
    Wassner, H., Chollet, G.: New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition. In: Proceedings of ICSLP, vol. 96 (1996)Google Scholar
  123. 123.
    Woodland, P., Odell, J., Valtchev, V., Young, S.: Large vocabulary continuous speech recognition using HTK. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP-94, vol. 2 (1994)Google Scholar
  124. 124.
    Wooters, C., Huijbregts, M.: The ICSI RT07s speaker diarization system. In: Multimodal Technologies for Perception of Humans, pp. 509–519 (2009)Google Scholar
  125. 125.
    Xu, M., Duan, L., Cai, J., Chia, L., Xu, C., Tian, Q.: HMM-based audio keyword generation. In: Advances in Multimedia Information Processing-PCM 2004, pp. 566–574 (2005)Google Scholar
  126. 126.
    Yang, C., Lin, K.H.Y., Chen, H.H.: Emotion classification using web blog corpora. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2007)Google Scholar
  127. 127.
    Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Cambridge (2002)Google Scholar
  128. 128.
    Zhang, T., Kuo, C.: Hierarchical classification of audio data for archiving and retrieving. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 6, pp. 3001–3004 (1999)Google Scholar
  129. 129.
    Zhang, J., Whalley, J., Brooks, S.: A two phase method for general audio segmentation. In: IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 626–629. IEEE (2009)Google Scholar
  130. 130.
    Zhang, X.: Audio Segmentation, Classification and Visualization. Ph.D. thesis, Auckland University of Technology (2009)Google Scholar
  131. 131.
    Zhu, X., Barras, C., Meignier, S., Gauvain, J.: Combining speaker identification and BIC for speaker diarization. In: Ninth European Conference on Speech Communication and Technology (2005)Google Scholar
  132. 132.
    Zhu, X., Barras, C., Lamel, L., Gauvain, J.: Speaker diarization: from broadcast news to lectures. In: Machine Learning for Multimodal Interaction, pp. 396–406 (2006)Google Scholar
  133. 133.
    Zwicker, E.: Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoust. Soc. Am. J. 33, 248 (1961)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.Pacific Northwest National LaboratoryRichlandUSA

Personalised recommendations