Silent Speech Interaction for Ambient Assisted Living Scenarios

  • António TeixeiraEmail author
  • Nuno Vitor
  • João Freitas
  • Samuel Silva
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10297)


In many Ambient Assisted Living (AAL) contexts, the speech signal cannot be used or speech recognition performance is highly affected due to ambient noise from televisions or music players. Trying to address these difficulties resulted in the exploration of Silent Speech interfaces (SSI), making use of other means to obtain information regarding what the user is uttering, even when no acoustic speech signal is produced.

The automatic recognition of what has been said, based only on images of the face, is the purpose of Visual Speech Recognition (VSR) systems, a type of SSI. However, despite the potential of VSR for enabling the interaction of older adults with new AAL applications, and current advances in SSI technologies, no real VSR application can be found in the literature.

Based on recent work in SSI, for European Portuguese, a first working application of VSR targeting older adults is presented along with and results from an initial evaluation. The system performed well, enabling real-time control of a media player with an accuracy of 81.3% and performing classification in around 1.3 s. At this stage, the results vary from speaker to speaker and the system performs better if the words are correctly articulated. The effect of distance of the speaker to the video apparatus (a Kinect One) proved not to be an issue in terms of the system accuracy.


Silent Speech Interfaces (SSI) Visual Speech Recognition (VSR) Ambient Assisted Living (AAL) Elderly 



Research partially funded by IEETA Research Unit funding (UID/CEC/00127/2013) and Marie Curie Actions IRIS (ref. 610986, FP7-PEOPLE-2013-IAPP). Samuel Silva acknowledges funding from FCT grant SFRH/BPD/108151/2015. The authors also thank the participants in the evaluation.


  1. 1.
    Abreu, H.: Visual speech recognition for European Portuguese. Master thesis, Universidade do Minho (2014)Google Scholar
  2. 2.
    Bradski, G., Kaehler, A.: Learning OpenCV: computer vision with the OpenCV library. O’Reilly Media, Inc. (2008)Google Scholar
  3. 3.
    Brumberg, J.S., Nieto-Castanon, A., Kennedy, P.R., Guenther, F.H.: Brain-computer interfaces for speech communication. Speech Commun. 52(4), 367–379 (2010). CrossRefGoogle Scholar
  4. 4.
    Dalka, P., Bratoszewski, P., Czyzewski, A.: Visual lip contour detection for the purpose of speech recognition. In: Proceedings of the International Signals and Electronic Systems (ICSES) Conference, pp. 1–4, September 2014Google Scholar
  5. 5.
    De Smedt, K.: 11 computational models of incremental grammatical encoding. In: Computational Psycholinguistics: AI and Connectionist Models of Human Language Processing, pp. 279–307 (1996)Google Scholar
  6. 6.
    Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010)CrossRefGoogle Scholar
  7. 7.
    Freitas, J., Candeias, S., Dias, M.S., Lleida, E., Ortega, A., Teixeira, A., Orvalho, V.: The IRIS project: a liaison between industry and academia towards natural multimodal communication. In: Proceedings of the IberSPeech. Las Palmas, Spain (2014)Google Scholar
  8. 8.
    Freitas, J., Teixeira, A., Sales Dias, M., Silva, S.: An Introduction to Silent Speech Interfaces. Springer, Heidelberg (2016)Google Scholar
  9. 9.
    Freitas, J., Teixeira, A., Bastos, C., Dias, M.: Towards a multimodal silent speech interface for European Portuguese. In: Speech Technologies, pp. 125–149. InTech (2011)Google Scholar
  10. 10.
    Freitas, J., Teixeira, A., Dias, M.S.: Towards a silent speech interface for portuguese. In: Proceedings o the Biosignals, pp. 91–100 (2012)Google Scholar
  11. 11.
    Freitas, J., Teixeira, A., Dias, M.S.: Multimodal silent speech interface based on video, depth, surface electromyography and ultrasonic doppler: data collection and first recognition results. In: International Workshop on Speech Production in Automatic Speech Recognition (2013)Google Scholar
  12. 12.
    Freitas, J., Teixeira, A.J., Dias, M.S.: Multimodal corpora for silent speech interaction. In: LREC, pp. 4507–4511 (2014)Google Scholar
  13. 13.
    Frisky, A.Z.K., Wang, C.Y., Santoso, A., Wang, J.C.: Lip-based visual speech recognition system. In: Proceedings of the International Security Technology (ICCST) Carnahan Conference, pp. 315–319, September 2015Google Scholar
  14. 14.
    Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE (2012)Google Scholar
  15. 15.
    Gokturk, S.B., Yalcin, H., Bamji, C.: A time-of-flight depth sensor-system description, issues and solutions. In: Conference on Computer Vision and Pattern Recognition Workshopp, CVPRW 2004, pp. 35–35. IEEE (2004)Google Scholar
  16. 16.
    Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: using kinect-style depth cameras for dense 3D modeling of indoor environments. Int. J. Robot. Res. 31(5), 647–663 (2012)CrossRefGoogle Scholar
  17. 17.
    Lanaria, V.: VLC, the world’s most popular media player, turns 15 years old: here’s why you should download it now (2016)Google Scholar
  18. 18.
    Levelt, W.J.: Models of word production. Trends Cogn. Sci. 3(6), 223–232 (1999)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Matsumoto, M.: Silent speech decoder using adaptive collection. In: Proceedings of the Companion Publication of the 19th International Conference on Intelligent User Interfaces, IUI Companion 2014, ACM, New York, pp. 73–76 (2014).
  20. 20.
  21. 21.
  22. 22.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using kinect. In: BmVC, vol. 1, p. 3 (2011)Google Scholar
  23. 23.
    Porbadnigk, A., Wester, M., p Calliess, J., Schultz, T.: Eeg-based speech recognition impact of temporal effects. In: 2nd International Conference on Bio-inspired Systems and Signal Processing (Biosignals 2009) (2009)Google Scholar
  24. 24.
    Rabiner, L., Juang, B.H.: Fundamentals of speech recognition. Prentice hall, Englewood Cliffs (1993)zbMATHGoogle Scholar
  25. 25.
    Rao, R.A., Mersereau, R.M.: Lip modeling for visual speech recognition. In: Proceedings of the Conference on Signals, Systems and Computers Record of the Twenty-Eighth Asilomar Conference vol. 1, pp. 587–590, 1 October 1994Google Scholar
  26. 26.
    Rodriguez, Y.L., Teixeira, A.: On the detection and classification of frames from European Portuguese oral and nasal vowels. In: Proceedings of the FALA 2010 (2010)Google Scholar
  27. 27.
    Saenko, K., Darrell, T., Glass, J.R.: Articulatory features for robust visual speech recognition. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, ACM, New York, pp. 152–158 (2004).
  28. 28.
    Sahni, H., Bedri, A., Reyes, G., Thukral, P., Guo, Z., Starner, T., Ghovanloo, M.: The tongue and ear interface: a wearable system for silent speech recognition. In: Proceedings of the 2014 ACM International Symposium on Wearable Computers, ISWC 2014, ACM, New York, pp. 47–54 (2014).
  29. 29.
    Seikel, J.A., King, D.W., Drumright, D.G.: Anatomy and physiology for speech, language, and hearing. Delmar Learning, 4th edn. (2009)Google Scholar
  30. 30.
    Teixeira, A., Almeida, N., Pereira, C., Silva, M., Vieira, D., Silva, S.: Applications of the multimodal interaction architecture in ambient assisted living. In: Dahl, D. (ed.) Multimodal Interaction with W3C Standards: Towards Natural User Interfaces to Everything, pp. 271–291. Springer, New York (2016)Google Scholar
  31. 31.
    TeraRanger: Time-of-flight principle (2016).
  32. 32.
    Wand, M., Koutn, J., et al.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119. IEEE (2016)Google Scholar
  33. 33.
    Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. arXiv preprint arXiv:1301.4558 (2007)
  34. 34.
    Witten, I.H., Frank, E., Hall, M.A.: Data Mining - Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)Google Scholar
  35. 35.
    Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: 2013 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 1–5. IEEE (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • António Teixeira
    • 1
    • 2
    Email author
  • Nuno Vitor
    • 1
  • João Freitas
    • 3
    • 4
  • Samuel Silva
    • 2
  1. 1.Department of Electronics Telecommunication and InformaticsUniversity of AveiroAveiroPortugal
  2. 2.Institute of Electronics and Informatics Engineering of Aveiro (IEETA)AveiroPortugal
  3. 3.Microsoft Language Development Center (MLDC)LisbonPortugal
  4. 4.DefinedCrowdLisbonPortugal

Personalised recommendations