International Journal of Speech Technology

, Volume 19, Issue 4, pp 669–675 | Cite as

Stream fusion for multi-stream automatic speech recognition

  • Hesam Sagha
  • Feipeng Li
  • Ehsan Variani
  • José del R. Millán
  • Ricardo Chavarriaga
  • Björn Schuller


Multi-stream automatic speech recognition (MS-ASR) has been confirmed to boost the recognition performance in noisy conditions. In this system, the generation and the fusion of the streams are the essential parts and need to be designed in such a way to reduce the effect of noise on the final decision. This paper shows how to improve the performance of the MS-ASR by targeting two questions; (1) How many streams are to be combined, and (2) how to combine them. First, we propose a novel approach based on stream reliability to select the number of streams to be fused. Second, a fusion method based on Parallel Hidden Markov Models is introduced. Applying the method on two datasets (TIMIT and RATS) with different noises, we show an improvement of MS-ASR.


Multi-stream speech recognition Performance monitor Classifier ensemble creation and fusion 



The authors would like to thank Professor Hynek Hermansky for his valuable comments.


  1. Allen, J. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4), 567–577.CrossRefGoogle Scholar
  2. Bourlard, H. & Dupont, S. (1997). Subband-based speech recognition. In 22nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp 1251–1254). Munich, GermanyGoogle Scholar
  3. Bourlard, H., Dupont , S., Ris, C. (1997). Multi-stream speech recognition. Tech. Rep. IDIAP-RR 96-07, IDIAPGoogle Scholar
  4. Fletcher, H. (1953). Speech and hearing in communication. New York: Krieger.Google Scholar
  5. Furui, S. (1992). Towards robust speech recognition under adverse conditions. In ESCA Workshop on Speech Processing in Adverse Conditions (pp. 31–41)Google Scholar
  6. Ganapathy, S., & Hermansky, H. (2012). Temporal resolution analysis in frequency domain linear prediction. The Journal of the Acoustical Society of America, 132(5), 436–442.CrossRefGoogle Scholar
  7. Garofolo, J. S., et al. (1988). Getting started with the darpa timit cd-rom: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, p. 107Google Scholar
  8. Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G. (2014). Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), ISCA, Singapore, SingaporeGoogle Scholar
  9. Giacinto, G., Roli, F. (2000). Dynamic classifier selection. In Multiple Classifier Systems (pp. 177–189). SpringerGoogle Scholar
  10. Hermansky, H. (2013). Multistream recognition of speech: Dealing with unknown unknowns. IEEE Proceedings, 101(5), 1076–1088.CrossRefGoogle Scholar
  11. Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRefGoogle Scholar
  12. Hermansky, H., Tibrewala, S., Pavel, M. (1996). Towards ASR on partially corrupted speech. In Fourth International Conference on Spoken Language (ICSLP), vol 1 (pp. 462–465). IEEE, Philadelphia, PA, USAGoogle Scholar
  13. Hermansky, H., Variani, E., Peddinti, V. (2013). Mean temporal distance: Predicting ASR error from temporal properties of speech signal. In 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Vancouver, CanadaGoogle Scholar
  14. Ikbal, S., Misra, H., Hermansky, H., & Magimai-Doss, M. (2012). Phase autocorrelation (PAC) features for noise robust speech recognition. Speech Communication, 54(7), 867–880.CrossRefGoogle Scholar
  15. Mallidi, S. H., & Hermansky, H. (2016). Novel neural network based fusion for multistream ASR. In 41st International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5680–5684). Shanghai, China: IEEE.Google Scholar
  16. Mallidi, S. H., Ogawa, T., & Hermansky, H. (2015). Uncertainty estimation of dnn classifiers. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 283–288). USA: Arizona.Google Scholar
  17. Mesgarani, N., Thomas, S., Hermansky, H. (2011). Adaptive stream fusion in multistream recognition of speech. In 12th Annual Conference of the International Speech Communication Association (InterSpeech). Portland, OregonGoogle Scholar
  18. Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech and Language Processing, 20(1), 14–22.CrossRefGoogle Scholar
  19. Sharma, S. R. (1999). Multi-stream approach to robust speech recognition. PhD thesisGoogle Scholar
  20. Tibrewala, S., Hermansky, H. (1997). Sub-band based recognition of noisy speech. In 22nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp. 1255–1258). Munich, Germany,Google Scholar
  21. Variani, E., Li, F., Hermansky, H. (2013). Multi-stream recognition of noisy speech with performance monitoring. In 14th Annual Conference of the International Speech Communication Association (InterSpeech). Lyon, FranceGoogle Scholar
  22. Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., & Rigoll, G. (2014). Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4), 888–902.CrossRefGoogle Scholar
  23. Wöllmer, M., Weninger, F., Geiger, J., Schuller, B., & Rigoll, G. (2013). Noise robust ASR in reverberated multisource environments applying convolutive NMF and long short-term memory. Computer Speech and Language, 27(3), 780–797.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Hesam Sagha
    • 1
  • Feipeng Li
    • 2
    • 3
  • Ehsan Variani
    • 2
    • 4
  • José del R. Millán
    • 5
  • Ricardo Chavarriaga
    • 5
  • Björn Schuller
    • 1
    • 6
  1. 1.Chair of Complex & Intelligent SystemsUniversity of PassauPassauGermany
  2. 2.Center of Language and Speech ProcessingJohns Hopkins UniversityBaltimoreUSA
  3. 3.Apple IncSan Francisco Bay AreaUSA
  4. 4.GoogleSan Francisco Bay AreaUSA
  5. 5.Defitech Chair in Brain-Machine InterfaceÉcole Polytechnique Fédérale de LausanneLausanneSwitzerland
  6. 6.Department of ComputingImperial CollegeLondonUK

Personalised recommendations