International Journal of Speech Technology

, Volume 21, Issue 3, pp 601–618 | Cite as

Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command

  • Benjamin LecouteuxEmail author
  • Michel Vacher
  • François Portet


Voice command in multi-room smart homes for assisting people in loss of autonomy in their daily activities faces several challenges, one of them being the distant condition which impacts ASR performance. This paper presents an overview of multiple techniques for fusion of multi-source audio (pre, middle, post fusion) for automatic speech recognition for in-home voice command. The robustness of the models of speech is obtained by adaptation to the environment and to the task. Experiments are based on several publicly available realistic datasets with participants enacting activities of daily life. The corpora were recorded in natural condition, meaning background noise is sporadic, so there is no extensive background noise in the data. The smart home is equipped with one or two microphones in each room, the distance between them being larger than 1 m. An evaluation of the most suited techniques improves voice command recognition at the decoding level, by using multiple sources and model adaptation. Although Word Error Rate (WER) is between 26 and 40%, Domotic Error Rate (identical to the WER, but at the level of the voice command) is less than 5.8% for deep neural network models, the method using Feature space Maximum Likelihood Linear Regression (fMLLR) with speaker adaptation training and Subspace Gaussian Mixture Model (SGMM) exhibits comparable results.


Home automation Voice command Smart Home Ambient assisted living Multichannel analysis 



This work is supported by the Agence Nationale de la Recherche under grant ANR-09-VERS-011. The authors would like to thank the participants who accepted to perform the experiments.


  1. Aman, F., Vacher, M., Rossato, S., & Portet, F. (2013). Speech recognition of aged voices in the AAL context: Detection of distress sentences. The 7th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2013 (pp. 177–184). Cluj-Napoca, Romania.Google Scholar
  2. Aman, F., Aubergé, V., Vacher, M. (2016). Influence of expressive speech on ASR performances: Application to elderly assistance in smart home. In: Sojka, P., Horak, A., Kopecek, I., Pala, K. (eds) Text, speech, and dialogue: 19th International Conference, TSD 2016, New York: Springer International Publishing, pp. 522–530. 10.1007/978-3-319-45510-5_60Google Scholar
  3. Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2011–2022. Scholar
  4. Audibert, N., Aubergé, V., Rilliard, A. (2005). The prosodic dimensions of emotion in speech: The relative weights of parameters. 9th european conference on speech communication and technology. Interspeech 2005, Lisbon, Portugal, pp. 525–528.Google Scholar
  5. Baba, A., Lee, A., Saruwatari, H., Shikano, K. (2002). Speech recognition by reverberation adapted acoustic model. In: ASJ General Meeting, pp. 27–28.Google Scholar
  6. Baba, A., Yoshizawa, S., Yamada, M., Lee, A., Shikano, K. (2004). Acoustic models of the elderly for large-vocabulary continuous speech recognition. Electronics and Communications in Japan, Part 2, Vol. 87, No. 7, 2004, 87(2), pp. 49–57.Google Scholar
  7. Badii, A., Boudy, J. (2009). CompanionAble—integrated cognitive assistive & domotic companion robotic systems for ability & security. 1st Congres of the Société Française des Technologies pour l’Autonomie et de Gérontechnologie (SFTAG’09), Troyes, pp. 18–20.Google Scholar
  8. Barker, J., Vincent, E., Ma, N., Christensen, H., & Green, P. D. (2013). The PASCAL chime speech separation and recognition challenge. Computer Speech & Language, 27(3), 621–633.CrossRefGoogle Scholar
  9. Barker, J., Marxer, R., Vincent, E., Watanabe, S. (2015). The third ’chime’ speech separation and recognition challenge: Dataset, task and baselines. Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511.Google Scholar
  10. Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2001). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, 33(1–2), 5–22.CrossRefzbMATHGoogle Scholar
  11. Bouakaz, S., Vacher, M., Bobillier-Chaumon, M. E., Aman, F., Bekkadja, S., Portet, F., et al. (2014). CIRDO: Smart companion for helping elderly to live at home for longer. IRBM, 35(2), 101–108.CrossRefGoogle Scholar
  12. Brandstein, M., & Ward, D. (Eds.). (2001). Microphone arrays : signal processing techniques and applications. Berlin: Springer-Verlag.Google Scholar
  13. Caballero-Morales, S. O., & Trujillo-Romero, F. (2014). Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition. Expert Systems with Applications, 41(3), 841–852.CrossRefGoogle Scholar
  14. Chahuara, P., Portet, F., & Vacher, M. (2017). Context-aware decision making under uncertainty for voice-based control of smart home. Expert Systems with Applications, 75, 63–79. Scholar
  15. Chan, M., Estéve, D., Escriba, C., & Campo, E. (2008). A review of smart homes—present state and future challenges. Computer Methods and Programs in Biomedicine, 91(1), 55–81.CrossRefGoogle Scholar
  16. Charalampos, D., Maglogiannis, I. (2008). Enabling human status awareness in assistive environments based on advanced sound and motion data classification. Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments, pp. 1:1–1:8.Google Scholar
  17. Christensen, H., Casanuevo, I., Cunningham, S., Green, P., Hain, T. (2013). Homeservice: Voice-enabled assistive technology in the home using cloud-based automatic speech recognition. SLPAT, pp. 29–34.Google Scholar
  18. Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmueller, M., et al. (2014). The DIRHA simulated corpus. The 9th edition of the Language Resources and Evaluation Conference (LREC) (pp. 2629–2634). Reykjavik, Iceland.Google Scholar
  19. Deng, L., Acero, A., Plumpe, M., Huang, X. (2000). Large-vocabulary speech recognition under adverse acoustic environments. ICSLP-2000, ISCA, Beijing, China, Vol. 3, pp. 806–809.Google Scholar
  20. Filho, G., & Moir, T. (2010). From science fiction to science fact: A smart-house interface using speech technology and a photorealistic avatar. International Journal of Computer Applications in Technology, 39(8), 32–39.CrossRefGoogle Scholar
  21. Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). Proceedings of IEEE Workshop on ASRU, pp. 347–354.
  22. Fleury, A., Vacher, M., Portet, F., Chahuara, P., & Noury, N. (2013). A French corpus of audio and multimodal interactions in a health smart home. Journal on Multimodal User Interfaces, 7(1), 93–109.CrossRefGoogle Scholar
  23. Hamill, M., Young, V., Boger, J., Mihailidis, A. (2009). Development of an automated speech recognition interface for personal emergency response systems. Journal of NeuroEngineering and Rehabilitation, 6, 26 Google Scholar
  24. Hwang, Y., Shin, D., Yang, C. Y., Lee, S. Y., Kim, J., Kong, B., Chung, J., Kim, S., Chung, M. (2012). Developing a voice user interface with improved usability for people with dysarthria. 13th International Conference on Computers Helping People with Special Needs, ICCHP’12, pp. 117–124.Google Scholar
  25. Lecouteux, B., Vacher, M., Portet, F. (2011). Distant speech recognition in a smart home: Comparison of several multisource asrs in realistic conditions. Proceedings of InterSpeech, pp. 2273–2276.Google Scholar
  26. Lecouteux, B., Linares, G., Estève, Y., & Gravier, G. (2013). Dynamic combination of automatic speech recognition systems by driven decoding. IEEE Transactions on Audio, Speech & Language Processing, 21(6), 1251–1260.CrossRefGoogle Scholar
  27. Matos, M., Abad, A., Astudillo, R., & Trancoso, I. (2014). IberSPEECH 2014 (pp. 178–188). Las Palmas de Gran Canaria, Spain.Google Scholar
  28. McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., et al. (2005). On the use of information retrieval measures for speech recognition evaluation, Tech. rep.. Martigny: Idiap.Google Scholar
  29. Michaut F, Bellanger M (2005) Filtrage adaptatif : théorie et algorithmes. Hermes Science Publication, LavoisierGoogle Scholar
  30. Mueller, P., Sweeney, R., & Baribeau, L. (1984). Acoustic and morphologic study of the senescent voice. Ear, Nose, and Throat Journal, 63, 71–75.Google Scholar
  31. Ons, B., Gemmeke, J. F., Hamme, H. V. (2014). The self-taught vocal interface. EURASIP Journal on Audio, Speech, and Music Processing, 2014, 43.Google Scholar
  32. Parker, M., Cunningham, S., Enderby, P., Hawley, M., & Green, P. (2006). Automatic speech recognition and training for severely dysarthric users of assistive technology: The stardust project. Clinical Linguistics & Phonetics, 20(2–3), 149–156.CrossRefGoogle Scholar
  33. Peetoom, K. K. B., Lexis, M. A. S., Joore, M., Dirksen, C. D., De Witte, L. P. (2014). Literature review on monitoring technologies and their outcomes in independently living elderly people. Disability and Rehabilitation: Assistive Technology, 10, 1–24.Google Scholar
  34. Pellegrini, T., Trancoso, I., Hämäläinen, A., Calado, A., Dias, M. S., Braga, D. (2012). Impact of Age in ASR for the Elderly: Preliminary Experiments in European Portuguese. Advances in Speech and Language Technologies for Iberian Languages—IberSPEECH 2012 Conference, Madrid, Spain, November 21-23, 2012. Proceedings, pp. 139–147.Google Scholar
  35. Popescu, M., Li, Y., Skubic, M., Rantz, M. (2008). An acoustic fall detector system that uses sound height information to reduce the false alarm rate. Proceedings of 30th Annual International Conference of the IEEE-EMBS 2008, pp. 4628–4631.Google Scholar
  36. Portet, F., Vacher, M., Golanski, C., Roux, C., & Meillon, B. (2013). Design and evaluation of a smart home voice interface for the elderly—Acceptability and objection aspects. Personal and Ubiquitous Computing, 17(1), 127–144.CrossRefGoogle Scholar
  37. Portet F, Christensen H, Rudzicz F, Alexandersson J (2015) Perspectives on Speech and Language Interaction for Daily Assistive Technology: Overall Introduction to the Special Issue Part 3. ACM—Transactions on Speech and Language Processing, 7(2).Google Scholar
  38. Potamianos, G., Neti, C. (2001). Automatic speechreading of impaired speech. AVSP 2001-International Conference on Auditory-Visual Speech Processing.Google Scholar
  39. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011a). The subspace Gaussian mixture model - A structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.CrossRefGoogle Scholar
  40. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N. et al. (2011b). The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.Google Scholar
  41. Ravanelli, M., & Omologo, M. (2015). Contaminated speech training methods for robust DNN-HMM distant speech recognition. INTERSPEECH 2015 (pp. 756–760). Dresden, Germany.Google Scholar
  42. Ravanelli, M., Cristoforetti, L., Gretter, R., Pellin, M., Sosi, A., & Omologo, M. (2015). The DIRHA-english corpus and related tasks for distant-speech recognition in domestic environments. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 275–282.Google Scholar
  43. Rudzicz, F. (2011). Acoustic transformations to improve the intelligibility of dysarthric speech. Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, pp. 11–21.Google Scholar
  44. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.CrossRefzbMATHGoogle Scholar
  45. Ryan, W., & Burk, K. (1974). Perceptual and acoustic correlates in the speech of males. Journal of Communication Disorders, 7, 181–192.CrossRefGoogle Scholar
  46. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. D. (2015). Detection and classification of audio scenes and events. IEEE Transactions on Multimedia, 17(10), 1733–1746.CrossRefGoogle Scholar
  47. Takeda, N., Thomas, G., & Ludlow, C. (2000). Aging effects on motor units in the human thyroarytenoid muscle. Laryngoscope, 110, 1018–1025.CrossRefGoogle Scholar
  48. Thiemann, J., Vincent, E. (2013). An experimental comparison of source separation and beamforming techniques for microphone array signal enhancement. MLSP—23rd IEEE International Workshop on Machine Learning for Signal Processing, 2013, Southampton, United Kingdom.Google Scholar
  49. Vacher, M., Serignat, J., Chaillol, S., Istrate, D., & Popescu, V. (2006). Speech and sound use in a remote monitoring system for health care. In: P. Sojka & K. P. I Kopecek (eds) Text speech and dialogue, LNCS 4188/2006, Springer, Berlin, Vol. 4188/2006, pp. 711–718.Google Scholar
  50. Vacher, M., Portet, F., Fleury, A., & Noury, N. (2011). Development of audio sensing technology for ambient assisted living: Applications and challenges. International Journal of E-Health and Medical Communications, 2(1), 35–54.CrossRefGoogle Scholar
  51. Vacher, M., Lecouteux, B., & Portet, F. (2012). Recognition of voice commands by multisource ASR and noise cancellation in a smart home environment. EUSIPCO (European Signal Processing Conference), Bucarest, Romania, pp. 1663–1667.
  52. Vacher, M., Lecouteux, B., Chahuara, P., Portet, F., Meillon, B., & Bonnefond, N. (2014). The Sweet-Home speech and multimodal corpus for home automation interaction. The 9th edition of the Language Resources and Evaluation Conference (LREC) (pp. 4499–4506). Reykjavik, Iceland.Google Scholar
  53. Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B., Chahuara, P. (2015a). Evaluation of a context-aware voice interface for Ambient Assisted Living: Qualitative user study vs. quantitative system evaluation. ACM Transactions on Accessible Computing, 7(2), 5:1–5:36.Google Scholar
  54. Vacher, M., Lecouteux, B., Serrano-Romero, J., Ajili, M., Portet, F., Rossato, S. (2015b). Speech and speaker recognition for home automation: Preliminary results. 8th International Conference Speech Technology and Human-Computer Dialogue ”SpeD 2015”, IEEE, Bucarest, Romania, Proceedings of the 8th International Conference Speech Technology and Human-Computer Dialogue, pp. 181–190.Google Scholar
  55. Vacher, M., Bouakaz, S., Bobillier Chaumon, M. E., Aman, F., Khan, R. A., & Bekkadja, S., et al. (2016). The CIRDO corpus: comprehensive audio/video database of domestic falls of elderly people. 10th International Conference on Language Resources and Evaluation (LREC 2016), ELRA (pp. 1389–1396). Portoroz, Slovenia.Google Scholar
  56. Valin, J. M. (2006). Speex: A free codec for free speech. Australian National Linux Conference, Dunedin, New Zealand.Google Scholar
  57. Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., & Matassoni, M. (2013). The Second ’CHiME’ Speech Separation and Recognition Challenge: An overview of challenge systems and outcomes. 2013 IEEE Automatic Speech Recognition and Understanding Workshop (pp. 162–167). Olomouc, Czech Republic.Google Scholar
  58. Vipperla, R., Renals, S., & Frankel, J. (2008). Longitudinal study of ASR performance on ageing voices. 9th International Conference on Speech Science and Speech Technology (InterSpeech 2008) (pp. 2550–2553). Brisbane, Australia.Google Scholar
  59. Vipperla, R. C., Wolters, M., Georgila, K., Renals, S. (2009). Speech input from older users in smart environments: Challenges and perspectives. HCI Internat.: Universal Access in Human-Computer Interaction. Intelligent and Ubiquitous Interaction Environments.Google Scholar
  60. Vlasenko, B., Prylipko, D., Philippou-Hübner, D., & Wendemuth, A. (2011). Vowels formants analysis allows straightforward detection of high arousal acted and spontaneous emotions. Proceedings of Interspeech, 2011, 1577–1580.Google Scholar
  61. Vlasenko, B., Prylipko, D., & Wendemuth, A. (2012). Towards robust spontaneous speech recognition with emotional speech adapted acoustic models. Proceedings of the KI 2012.Google Scholar
  62. Wölfel, M., & McDonough, J. (2009). Distant speech recognition. Hoboken: Wiley.Google Scholar
  63. World Health Organization (2003). What are the main risk factors for disability in old age and how can disability be prevented? Available from:
  64. Xu, H., Povey, D., Mangu, L., & Zhu, J. (2011). Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828.,
  65. Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Yu, M. F. C., et al. (2015). The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices. IEEE Automatic Speech Recognition and Understanding Workshop.Google Scholar
  66. Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 215–219). Florence, Italy.Google Scholar
  67. Zouari, L., Chollet, G. (2006). Efficient gaussian mixture for speech recognition. 18th International Conference on Pattern Recognition, 2006. ICPR 2006, Vol. 4, pp. 294–297.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Univ. Grenoble Alpes, CNRS, Grenoble INP, LIGGrenobleFrance

Personalised recommendations