Abstract
Brain and muscular activity originates the change in shape or position of articulators such as the tongue or lips and, as a consequence, the vocal tract assumes different configurations. Most of these changes, namely of articulators and tract, are internal and are not easy to measure but, in some cases, like the lips or the tongue tip, such changes are visible or have visible effects. Even without the production of speech sound, these different configurations of articulators provide valuable information that can be used in the context of silent speech interfaces (SSIs). In this chapter, the reader finds an overview of the technologies used to assess articulatory and visual aspects of speech production and how researchers have exploited their capabilities for the development of silent speech interfaces.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acher A, Perrier P, Savariaux C, Fougeron C (2014) Speech production after glossectomy: methodological aspects. Clin Linguist Phon 28:241–256
Alghowinem S, Wagner M, Goecke R (2013) AusTalk—The Australian speech database: design framework, recording experience and localisation. In: 8th Int. Conf. on Information Technology in Asia (CITA 2013). IEEE, pp 1–7
Babani D, Toda T, Saruwatari H, Shikano K (2011) Acoustic model training for non-audible murmur recognition using transformed normal speech data. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP 2011) 5224–5227. doi:10.1109/ICASSP.2011.5947535
Bacsfalvi P, Bernhardt BM (2011) Long-term outcomes of speech therapy for seven adolescents with visual feedback technologies: ultrasound and electropalatography. Clin Linguist Phon 25:1034–1043
Bastos R, Dias MS (2009) FIRST—fast invariant to rotation and scale transform: invariant image features for augmented reality and computer vision. VDM, Saarbrücken
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: European Conference on Computer Vision (ECCV 2006). Springer, Berlin, pp 404–417
Brown DR III, Keenaghan K, Desimini S (2005) Measuring glottal activity during voiced speech using a tuned electromagnetic resonating collar sensor. Meas Sci Technol 16:2381
Burnham D, Estival D, Fazio S, Viethen J, Cox F, Dale R, Cassidy S, Epps J, Togneri R, Wagner M (2011) Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable black box. Proc Interspeech 2011:841–844
Carstens Medizinelektronik (2016) 3D Electromagnetic Articulograph [WWW Document]. URL http://www.articulograph.de/. Accessed 4 April 2016
Carvalho P, Oliveira T, Ciobanu L, Gaspar F, Teixeira L, Bastos R, Cardoso J, Dias M, Côrte-Real L (2013) Analysis of object description methods in a video object tracking environment. Mach Vis Appl 24:1149–1165. doi:10.1007/s00138-013-0523-z
Cleland J, Scobbie JM, Wrench AA (2015) Using ultrasound visual biofeedback to treat persistent primary speech sound disorders. Clin Linguist Phon 1–23
Denby B (2013) Down with sound, the story of silent speech. In: Workshop on Speech production in automatic speech recognition
Denby B, Schultz T, Honda K, Hueber T, Gilbert JM, Brumberg JS (2010) Silent speech interfaces. Speech Commun 52:270–287. doi:10.1016/j.specom.2009.08.002
Fabre D, Hueber T, Badin P (2014) Automatic animation of an articulatory tongue model from ultrasound images using Gaussian mixture regression. Proc Interspeech 2014:2293–2297
Fagan MJ, Ell SR, Gilbert JM, Sarrazin E, Chapman PM (2008) Development of a (silent) speech recognition system for patients following laryngectomy. Med Eng Phys 30:419–425. doi:10.1016/j.medengphy.2007.05.003
Florescu VM, Crevier-Buchman L, Denby B, Hueber T, Colazo-Simon A, Pillot-Loiseau C, Roussel-Ragot P, Gendrot C, Quattrocchi S (2010) Silent vs vocalized articulation for a portable ultrasound-based silent speech interface. Proc Interspeech 2010:450–453
Francisco AA, Jesse, A, Groen MA, McQueen JM (2014) Audiovisual temporal sensitivity in typical and dyslexic adult readers. Proc Interspeech 2014
Freitas J, Teixeira A, Dias MS, Bastos C (2011) Towards a multimodal silent speech interface for European Portuguese. In: Speech technologies, InTech, Ivo Ipsic (Ed.), pp 125–149. doi:10.5772/16935
Freitas J, Teixeira A, Vaz F, Dias MS (2012) Automatic speech recognition based on ultrasonic doppler sensing for European Portuguese. In: Advances in speech and language technologies for iberian languages, communications in computer and information science. Springer, Berlin, pp 227–236. doi:10.1007/978-3-642-35292-8_24
Freitas J, Teixeira A, Dias MS (2014) Can Ultrasonic Doppler help detecting nasality for silent speech interfaces? An exploratory analysis based on alignement of the Doppler signal with velum aperture information from real-time MRI. In: International conference on physiological computing systems (PhyCS 2014). pp 232–239
Gilbert JM, Rybchenko SI, Hofe R, Ell SR, Fagan MJ, Moore RK, Green P (2010) Isolated word recognition of silent speech using magnetic implants and sensors. Med Eng Phys 32:1189–1197. doi:10.1016/j.medengphy.2010.08.011
Gonzalez JA, Cheah LA, Bai J, Ell SR, Gilbert JM, Moore RK, Green PD (2014) Analysis of phonetic similarity in a silent speech interface based on permanent magnetic articulography. Proc Interspeech 2014:1018–1022
Gurbuz S, Tufekci Z, Patterson E, Gowdy JN (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2001). IEEE, pp 177–180.
Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference. Manchester, UK, p. 50.
Heracleous P, Hagita N (2010) Non-audible murmur recognition based on fusion of audio and visual streams. Proc Interspeech 2010:2706–2709
Heracleous P, Nakajima Y, Lee A, Saruwatari H, Shikano K (2003) Accurate hidden Markov models for non-audible murmur (NAM) recognition based on iterative supervised adaptation. IEEE Work. Autom. Speech Recognit. Underst. (ASRU 2003). doi:10.1109/ASRU.2003.1318406
Heracleous P, Badin P, Bailly G, Hagita N (2011) A pilot study on augmented speech communication based on electro-magnetic articulography. Pattern Recognit Lett 32:1119–1125
Hofe R, Ell SR, Fagan MJ, Gilbert JM, Green PD, Moore RK, Rybchenko SI (2010) Evaluation of a silent speech interface based on magnetic sensing. Proc Interspeech 2010:246–249
Hofe R, Bai J, Cheah LA, Ell SR, Gilbert JM, Moore RK, Green PD (2013a) Performance of the MVOCA silent speech interface across multiple speakers. In: Proc. of Interspeech 2013. pp 1140–1143
Hofe R, Ell SR, Fagan MJ, Gilbert JM, Green PD, Moore RK, Rybchenko SI (2013b) Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech Commun. 55:22–32. doi:10.1016/j.specom.2012.02.001
Holzrichter JF, Foundation JH, Davis C (2009) Characterizing silent and pseudo-silent speech using radar-like sensors. Interspeech 2009:656–659
Hu R, Raj B (2005) A robust voice activity detector using an acoustic Doppler radar. In: IEEE workshop on automatic speech recognition and understanding (ASRU 2005). IEEE, pp 319–324
Hueber T, Benaroya E-L, Chollet G, Denby B, Dreyfus G, Stone M (2009) Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface. Proc Interspeech 2009:640–643
Hueber T, Benaroya EL, Chollet G, Denby B, Dreyfus G, Stone M (2010) Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun 52:288–300. doi:10.1016/j.specom.2009.11.004
Hueber T, Bailly G, Denby B (2012) Continuous articulatory-to-acoustic mapping using phone-based trajectory HMM for a silent speech interface. Proc Interspeech 2012:723–726
Ishii S, Toda T, Saruwatari H, Sakti S, Nakamura, S (2011) Blind noise suppression for Non-Audible Murmur recognition with stereo signal processing. IEEE Work. Autom. Speech Recognit. Underst. 494–499. doi:10.1109/ASRU.2011.6163981
Itoi M, Miyazaki R, Toda T, Saruwatari H, Shikano K (2012) Blind speech extraction for non-audible murmur speech with speaker’s movement noise. In: IEEE International symposium on signal processing and information technology (ISSPIT 2012). IEEE, pp 320–325.
Jawbone (n.d.) Jawbone Headset [WWW Document]. https://jawbone.com
Jennings DL, Ruck DW (1995) Enhancing automatic speech recognition with an ultrasonic lip motion detector. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 1995). IEEE, pp 868–871.
Jou S-C, Schultz T, Waibel A (2004) Adaptation for soft whisper recognition using a throat microphone. Proc Interspeech 2004
Kalaiselvi K, Vishnupriya MS (2014) Non-audible murmur (NAM) voice conversion by wavelet transform. Int. J.
Kalgaonkar K, Raj B (2007) Acoustic Doppler sonar for gait recognition. In: IEEE conference on advanced video and signal based surveillance (AVSS 2007). Ieee, pp 27–32. doi:10.1109/AVSS.2007.4425281
Kalgaonkar K, Raj B (2008) Ultrasonic Doppler sensor for speaker recognition. In: IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2008). Ieee, pp 4865–4868. doi:10.1109/ICASSP.2008.4518747
Kalgaonkar K, Raj B (2009) One-handed gesture recognition using ultrasonic Doppler sonar. In: IEEE International conference on acoustics, speech and signal processing (ICASSP 2009). IEEE, pp 1889–1892. doi:10.1109/ICASSP.2009.4959977
Kalgaonkar K, Hu RHR, Raj B (2007) Ultrasonic Doppler sensor for voice activity detection. IEEE Signal Process Lett 14:754–757. doi:10.1109/LSP.2007.896450
Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive representation for local image descriptors. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (CVPR 2004). IEEE, pp II–506.
Kroos C (2012) Evaluation of the measurement precision in three-dimensional electromagnetic articulography (Carstens AG500). J Phon 40:453–465
Lawson E, Scobbie JM, Stuart-Smith J (2015) The role of anterior lingual gesture delay in coda/r/lenition: an ultrasound tongue imaging study. Proc 18th ICPhS
Livescu K, Zhu B, Glass J (2009) On the phonetic information in ultrasonic microphone signals. In: IEEE Int. Conf. on acoustics, speech and signal processing (ICASSP 2009). IEEE, pp 4621–4624.
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110
Magloughlin L (2016) Accounting for variability in North American English/?: Evidence from children’s articulation. J Phon 54:51–67
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
McLoughlin IV (2014) The use of low-frequency ultrasound for voice activity. Proc Interspeech 2014:1553–1557
Mielke J (2011) An articulatory study of rhotic vowels in Canadian French. In: Proc. of the Canadian Acoustical Association
Miller AL, Finch KB (2011) Corrected high-frame rate anchored ultrasound with software alignment. J Speech Lang Hear Res 54:471–486
Nakajima Y (2005) Development and evaluation of soft silicone NAM microphone. In: Technical Report IEICE, SP2005-7
Nakajima Y, Kashioka H, Shikano K, Campbell N (2003a) Non-audible murmur recognition. Eurospeech 2601–2604
Nakajima Y, Kashioka H, Shikano K, Campbell N (2003b) Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP 2003) 5. doi:10.1109/ICASSP.2003.1200069
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2014) Lipreading using convolutional neural network. Proc Interspeech 2014
Otani M, Shimizu S, Hirahara T (2008) Vocal tract shapes of non-audible murmur production. Acoust Sci Technol 29:195–198
Perkell JS, Cohen MH, Svirsky MA, Matthies ML, Garabieta I, Jackson MTT (1992) Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. J Acoust Soc Am 92:3078–3096
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91:1306–1326
Quatieri TF, Brady K, Messing D, Campbell JP, Campbell WM, Brandstein MS, Weinstein CJ, Tardelli JD, Gatewood PD (2006) Exploiting nonacoustic sensors for speech encoding. IEEE Trans. Audio. Speech. Lang. Processing 14. doi:10.1109/TSA.2005.855838
Raj B, Kalgaonkar K, Harrison C, Dietz P (2012) Ultrasonic Doppler sensing in HCI. IEEE Perv Comput 11:24–29. doi:10.1109/MPRV.2012.17
Scobbie JM, Wrench AA, van der Linden M (2008) Head-Probe stabilisation in ultrasound tongue imaging using a headset to permit natural head movement. In: Proceedings of the 8th International seminar on speech production, pp 373–376.
Scott AD, Wylezinska M, Birch MJ, Miquel ME (2014) Speech MRI: morphology and function. Phys Medica 30:604–618. doi:10.1016/j.ejmp.2014.05.001
Shaikh AA, Kumar DK, Yau WC, Che Azemin MZ, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: 3rd International congress on image and signal processing (CISP 2010). IEEE, pp 327–330.
Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated Korean word recognition. Pattern Recognit 44:559–571
Silva S, Teixeira A (2015) Unsupervised segmentation of the vocal tract from real-time MRI sequences. Comput Speech Lang 33:25–46. doi:10.1016/j.csl.2014.12.003
Srinivasan S, Raj B, Ezzat T (2010) Ultrasonic sensing for robust speech recognition. In: IEEE Int. Conf. on acoustics, speech and signal processing (ICASSP 2010). doi:10.1109/ICASSP.2010.5495039
Stork DG, Hennecke ME (1996) Speechreading by humans and machines: models, systems, and applications. Springer, New York
Tao F, Busso C (2014) lipreading approach for isolated digits recognition under whisper and neutral speech. Proc Interspeech 2014
Toda T (2012) Statistical approaches to enhancement of body-conducted speech detected with non-audible murmur microphone. In: ICME International Conference on Complex Medical Engineering (CME 2012). IEEE, pp 623–628.
Toth AR, Kalgaonkar K, Raj B, Ezzat T (2010) Synthesizing speech from Doppler signals. In: IEEE Int. Conf. on acoustics, speech and signal processing (ICASSP 2010). pp 4638–4641
Tran V-A, Bailly G, Lœvenbruck H, Toda T (2009) Multimodal HMM-based NAM-to-speech conversion. Interspeech 2009:656–659
Tran VA, Bailly G, Loevenbruck H, Toda T (2010) Improvement to a NAM-captured whisper-to-speech system. Speech Commun 52:314–326. doi:10.1016/j.specom.2009.11.005
Tran T, Mariooryad S, Busso C (2013) Audiovisual corpus to analyze whisper speech. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2013). pp 8101–8105. doi:10.1109/ICASSP.2013.6639243
Turton D (2015) Determining categoricity in English /l/-darkening: A principal component analysis of ultrasound spline data. In: Proc. 18th ICPhS.
Vietti A, Spreafico L, Galatà V (2015) An ultrasound study of the phonetic allophony of Tyrolean/r. In: Proc. 18th ICPhS.
Wand M, Koutník J, Schmidhuber J (2016) Lipreading with long short-term memory. arXiv Prepr. arXiv1601.08188.
Wang J, Samal A, Green JR, Rudzicz F (2012a) Sentence recognition from articulatory movements for silent speech interfaces. In: IEEE International conference on acoustics, speech and signal processing (ICASSP 2012). IEEE, pp 4985–4988.
Wang J, Samal, Green JR, Rudzicz F (2012b). Whole-word recognition from articulatory movements for silent speech interfaces. Proc Interspeech 2012
Wang J, Balasubramanian A, Mojica de la Vega L, Green JR, Samal A, Prabhakaran B (2013) Word recognition from continuous articulatory movement time-series data using symbolic representations. In: ACL/ISCA workshop on speech and language processing for assistive technologies, Grenoble, France, pp 119–127
Wang J, Samal A, Green JR (2014) Across-speaker articulatory normalization for speaker-independent silent speech recognition contribution of tongue lateral to consonant production. Proc Interspeech 2014:1179–1183
Whalen DH, McDonough J (2015) Taking the laboratory into the field. Annu Rev Linguist 1:395–415
Whalen DH, Iskarous K, Tiede MK, Ostry DJ, Lehnert-Lehouillier H, Vatikiotis-Bateson E, Hailey DS (2005) The haskins optically corrected ultrasound system (HOCUS). J Speech Lang Hear Res 48:543–553
Xu K, Yang Y, Stone M, Jaumard-Hakoun A, Leboullenger C, Dreyfus G, Roussel P, Denby B (2016) Robust contour tracking in ultrasound tongue image sequences. Clin. Linguist. Phon. 0, 1–15. doi:10.3109/02699206.2015.1110714
Yaling L, Wenjuan Y, Minghui D (2010) Feature extraction based on lsda for lipreading. In: International Conference on Multimedia Technology (ICMT), 2010. IEEE, pp 1–4.
Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia 11:1254–1265
Zharkova N, Hewlett N (2009) Measuring lingual coarticulation from midsagittal tongue contours: Description and example calculations using English /t/ and /a/. J. Phon. 37:248–256. doi:http://dx.doi.org/10.1016/j.wocn.2008.10.005
Zharkova N, Hewlett N, Hardcastle WJ (2012) An ultrasound study of lingual coarticulation in/s V/syllables produced by adults and typically developing children. J Int Phon Assoc 42:193–208
Zharkova N, Gibbon FE, Hardcastle WJ (2015) Quantifying lingual coarticulation using ultrasound imaging data collected with and without head stabilisation. Clin Linguist Phon 29:249–265
Zhu B (2008) Multimodal speech recognition with ultrasonic sensors. M.Sc. Thesis, Massachusetts Institute of Technology
Zhu B, Hazen TJ, Glass JR (2007) Multimodal speech recognition with ultrasonic sensors. Interspeech 2007:662–665
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2017 The Author(s)
About this chapter
Cite this chapter
Freitas, J., Teixeira, A., Dias, M.S., Silva, S. (2017). SSI Modalities II: Articulation and Its Consequences. In: An Introduction to Silent Speech Interfaces. SpringerBriefs in Electrical and Computer Engineering(). Springer, Cham. https://doi.org/10.1007/978-3-319-40174-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-40174-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40173-7
Online ISBN: 978-3-319-40174-4
eBook Packages: EngineeringEngineering (R0)