Circuits, Systems, and Signal Processing

, Volume 38, Issue 8, pp 3521–3547 | Cite as

Automatic Hypernasality Detection in Cleft Palate Speech Using CNN

  • Xiyue Wang
  • Ming Tang
  • Sen Yang
  • Heng Yin
  • Hua Huang
  • Ling HeEmail author


Automatic hypernasality detection in cleft palate speech can facilitate diagnosis by speech-language pathologists. This paper describes a feature-independent end-to-end algorithm that uses a convolutional neural network (CNN) to detect hypernasality in cleft palate speech. A speech spectrogram is adopted as the input. The average F1-scores for the hypernasality detection task are 0.9485 and 0.9746 using a dataset that is spoken by children and a dataset that is spoken by adults, respectively. The experiments explore the influence of the spectral resolution on the hypernasality detection performance in cleft palate speech. Higher spectral resolution can highlight the vocal tract parameters of hypernasality, such as formants and spectral zeros. The CNN learns efficient features via a two-dimensional filtering operation, while the feature extraction performance of shallow classifiers is limited. Compared with deep neural network and shallow classifiers, CNN realizes the highest F1-score of 0.9485. Comparing various network architectures, the convolutional filter of size 1 × 8 achieves the highest F1-score in the hypernasality detection task. The selected filter size of 1 × 8 considers more frequency information and is more suitable for hypernasality detection than the filters of size 3 × 3, 4 × 4, 5 × 5, and 6 × 6. According to an analysis of hypernasality-sensitive vowels, the experimental result concludes that the vowel /i/ is the most sensitive vowel to hypernasality. Compared with state-of-the-art literature, the proposed CNN-based system realizes a better detection performance. The results of an experiment that is conducted on a heterogeneous corpus demonstrate that CNN can better handle the speech variability compared with the shallow classifiers.


Cleft palate speech Hypernasality Convolutional neural network End-to-end Speech spectrogram 



This work is supported by the National Natural Science Foundation of China 61503264.


  1. 1.
    C. Agarwal, A. Sharma, Image understanding using decision tree based machine learning, in International Conference on Information Technology and Multimedia (IEEE, 2012), pp. 1–8Google Scholar
  2. 2.
    E. Akafi, M. Vali, N. Moradi, Detection of hypernasal speech in children with cleft palate, in 19th Iranian Conference of Biomedical Engineering (ICBME) (IEEE, 2013), pp. 237–241Google Scholar
  3. 3.
    A. Amelot, L. Crevier-Buchman, S. Maeda, Observations of velopharyngeal closure mechanism in horizontal and lateral direction from fiberscopic data, in 15th International Congress of Phonetic Sciences, 2003, pp. 3021–3024Google Scholar
  4. 4.
    T. Ananthakrishna, K. Shama, U.C. Niranjan, k-means nearest neighbor classifier for voice pathology, in Proceedings of the IEEE Indicon (IEEE, 2004), pp. 352–354Google Scholar
  5. 5.
    V. Ananthanatarajan, S. Jothilakshmi, Segmentation of continuous speech into consonant and vowel units using formant frequencies. Int. J. Comput. Appl. 56(15), 24–27 (2012)Google Scholar
  6. 6.
    M. Andreas, H.N. Florian, B. Tobias, N.T. Elmar, S. Florian, N. Emeka, S. Maria, Automatic detection of articulation disorders in children with cleft lip and palate. J. Acoust. Soc. Am. 126(5), 2589–2602 (2009)CrossRefGoogle Scholar
  7. 7.
    J.R.O. Arroyave, J.F.V. Bonilla, Automatic detection of hypernasality in children, in International Work-Conference on the Interplay Between Natural and Artificial Computation (IWINAC) (Springer, 2011), pp. 167–174Google Scholar
  8. 8.
    Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)zbMATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    M. Bianchini, F. Scarselli, On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE Trans. Neural Netw. Learn. Syst. 25(8), 1553–1565 (2014)CrossRefGoogle Scholar
  10. 10.
    P. Birch, B. Gumoes, S. Prytz, A. Karle, H. Stavad, J. Sundberg, Effects of a velopharyngeal opening on the sound transfer characteristics of the vowel [a]. Speech Music Hear. Q. Prog. Status Rep. 43, 9–15 (2002)Google Scholar
  11. 11.
    T. Bocklet, K. Riedhammer, U. Eysholdt, E. Nöth, Automatic phoneme analysis in children with Cleft Lip and Palate, in IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2013), pp. 7572–7576Google Scholar
  12. 12.
    D.A. Cairns, J.H. Hansen, J.E. Riski, A noninvasive technique for detecting hypernasal speech using a nonlinear operator. IEEE Trans. Biomed. Eng. 43(1), 35–45 (1996)CrossRefGoogle Scholar
  13. 13.
    M.A. Carbonneau, E. Granger, Y. Attabi, G. Gagnon, Feature learning from spectrograms for assessment of personality traits. IEEE Trans. Affect. Comput. (2016). CrossRefGoogle Scholar
  14. 14.
    G. Carneiro, J. Nascimento, A.P. Bradley, Automated analysis of unregistered multi-view mammograms with deep learning. IEEE Trans. Med. Imaging 36(11), 2355–2365 (2017)CrossRefGoogle Scholar
  15. 15.
    G. Castellanos, G. Daza, L. Sanchez, O. Castrillon, J. Suarez, Acoustic speech analysis for hypernasality detection in children, in International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE, 2006), pp. 5507–5510Google Scholar
  16. 16.
    M. Cernak, S. Tong, Nasal speech sounds detection using connectionist temporal classification, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5574–5578Google Scholar
  17. 17.
    S. Chambon, M.N. Galtier, P.J. Arnal, G. Wainrib, A. Gramfort, A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series. IEEE Trans. Rehabil. Eng. 26(4), 758–769 (2018)Google Scholar
  18. 18.
    Y. Chen, H. Jiang, C. Li, X. Jia, P. Ghamisi, Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 54(10), 6232–6251 (2016)CrossRefGoogle Scholar
  19. 19.
    C.D.L. Cruz, B. Santhanam, A joint EMD and Teager-Kaiser energy approach towards normal and nasal speech analysis, in 50th Asilomar Conference on Signals, Systems and Computers (IEEE, 2016), pp. 429–433Google Scholar
  20. 20.
    J.R. Deller, J.H. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals (Prentice-Hall, Englewood Cliffs, 1993)Google Scholar
  21. 21.
    T. Dodderi, M. Narra, S.M. Varghese, D.T. Deepak, Spectral analysis of hypernasality in cleft palate children: a pre-post surgery comparison. J. Clin. Diagn. Res. 10(1), 1–3 (2016)CrossRefGoogle Scholar
  22. 22.
    A.K. Dubey, S.M. Prasanna, S. Dandapat, Pitch-adaptive front-end feature for hypernasality detection, in Interspeech 2018, 2018, pp. 372–376Google Scholar
  23. 23.
    A.K. Dubey, S.R.M. Prasanna, S. Dandapat, Zero time windowing analysis of hypernasality in speech of Cleft Lip and palate children, in Twenty Second National Conference on Communication (NCC) (IEEE, 2016), pp. 1–6Google Scholar
  24. 24.
    A.K. Dubey, A. Tripathi, S. Prasanna, S. Dandapat, Detection of hypernasality based on vowel space area. J. Acoust. Soc. Am. 143(5), 412–417 (2018)CrossRefGoogle Scholar
  25. 25.
    T. Fawcett, ROC graphs: notes and practical considerations for researchers. Mach. Learn. 31(1), 1–38 (2004)MathSciNetGoogle Scholar
  26. 26.
    H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)CrossRefGoogle Scholar
  27. 27.
    W.T. Fitch, J. Giedd, Morphology and development of the human vocal tract: a study using magnetic resonance imaging. J. Acoust. Soc. Am. 106(1), 1511–1522 (1999)CrossRefGoogle Scholar
  28. 28.
    E.S. Fonseca, J.C. Pereira, Normal versus pathological voice signals. IEEE Eng. Med. Biol. Mag. 28(5), 44–48 (2009)CrossRefGoogle Scholar
  29. 29.
    S.K. Gaikwad, B.W. Gawali, P. Yannawar, A review on speech recognition technique. Int. J. Comput. Appl. 10(3), 16–24 (2010)Google Scholar
  30. 30.
    L.J. Gerstman, Classification of self-normalized vowels. IEEE Trans. Audio Electroacoust. 16(1), 78–80 (1968)CrossRefGoogle Scholar
  31. 31.
    H.R. Gilbert, M.P. Robb, Y. Chen, Formant frequency development: 15 to 36 months. J. Voice 11(3), 260–266 (1997)CrossRefGoogle Scholar
  32. 32.
    X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323Google Scholar
  33. 33.
    M. Golabbakhsh, F. Abnavi, E.M. Kadkhodaei, F. Derakhshandeh, F. Khanlar, P. Rong, D.P. Kuehn, Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech. J. Acoust. Soc. Am. 141(2), 929–935 (2017)CrossRefGoogle Scholar
  34. 34.
    S. Haque, M.H. Ali, A.K.M.F. Haque, Cross-gender acoustic differences in hypernasal speech and detection of hypernasality, in International Workshop on Computational Intelligence (IWCI) (IEEE, 2017), pp. 187–191Google Scholar
  35. 35.
    S. Haque, M. Hanif, A.K.M. Fazlul, Variability of acoustic features of hypernasality and it’s assessment. Int. J. Adv. Comput. Sci. Appl. 7(9), 195–201 (2016)Google Scholar
  36. 36.
    L. He, J. Zhang, Q. Liu, J. Zhang, H. Yin, M. Lech, Automatic detection of glottal stop in cleft palate speech. Biomed. Signal Process. Control 39, 230–236 (2018)CrossRefGoogle Scholar
  37. 37.
    L. He, J. Zhang, Q. Liu, H. Yin, M. Lech, Automatic evaluation of hypernasality and consonant misarticulation in cleft palate speech. IEEE Signal Process. Lett. 21(10), 1298–1301 (2014)CrossRefGoogle Scholar
  38. 38.
    G. Henningsson, D.P. Kuehn, D. Sell, T. Sweeney, J.E. Trost-Cardamone, T.L. Whitehill, Universal parameters for reporting speech outcomes in individuals with cleft palate. Cleft Palate Craniofac. J. 45(1), 1–17 (2008)CrossRefGoogle Scholar
  39. 39.
    G.E. Henningsson, A.M. Isberg, Velopharyngeal movement patterns in patients alternating between oral and glottal articulation: a clinical and cineradiographical study. Cleft Palate J. 23(1), 1–9 (1986)CrossRefGoogle Scholar
  40. 40.
    J. Hillenbrand, L.A. Getty, M.J. Clark, K. Wheeler, Acoustic characteristics of American English vowels. J. Acoust. Soc. Am. 97(1), 3099–3111 (1995)CrossRefGoogle Scholar
  41. 41.
    G.E. Hinton, A practical guide to training restricted Boltzmann machines, in Neural Networks: Tricks of the Trade, ed. by G. Montavon, G.B. Orr, K.R. Müller (Springer, Berlin, 2012), pp. 599–619CrossRefGoogle Scholar
  42. 42.
    C. Huang, Analysis of speaker variability, in Seventh European Conference on Speech Communication and Technology (Eurospeech) (2001), pp. 1377–1380Google Scholar
  43. 43.
    S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. arXiv:1502.03167
  44. 44.
    I. Jacobi, On variation and change in diphthongs and long vowels of spoken Dutch. Ph.D. Dissertation, Universiteit of Amsterdam, 2009Google Scholar
  45. 45.
    R. Kataoka, D.W. Warren, D.J. Zajac, R. Mayo, R.W. Lutz, The relationship between spectral characteristics and perceived hypernasality in children. J. Acoust. Soc. Am. 109(1), 2181–2189 (2001)CrossRefGoogle Scholar
  46. 46.
    D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, 2014. arXiv preprint arXiv:1412.6980
  47. 47.
    N. Krüger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A.J. Rodríguezsánchez, L. Wiskott, Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1847–1871 (2013)CrossRefGoogle Scholar
  48. 48.
    Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–443 (2015)CrossRefGoogle Scholar
  49. 49.
    G.S. Lee, C.P. Wang, C.C. Yang, T.B. Kuo, Voice low tone to high tone ratio: a potential quantitative index for vowel [a:] and its nasalization. IEEE Trans. Biomed. Eng. 53(7), 1437–1439 (2006)CrossRefGoogle Scholar
  50. 50.
    G.S. Lee, C.P. Wang, S. Fu, Evaluation of hypernasality in vowels using voice low tone to high tone ratio. Cleft Palate Craniofac. J. 46(1), 47–52 (2009)CrossRefGoogle Scholar
  51. 51.
    S. Lee, A. Potamianos, S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)CrossRefGoogle Scholar
  52. 52.
    C.X. Ling, J. Huang, H. Zhang, AUC: a better measure than accuracy in comparing learning algorithms, in Conference of the Canadian Society for Computational Studies of Intelligence (Springer, 2003), pp. 329–341Google Scholar
  53. 53.
    A. Maier, C. Hacker, E. Noth, E. Nkenke, T. Haderlein, F. Rosanowski, M. Schuster, Intelligibility of Children with cleft lip and palate: evaluation by speech recognition techniques, in 18th International Conference on Pattern Recognition (ICPR) (IEEE, 2006), pp. 274–277Google Scholar
  54. 54.
    A. Maier, C. Hacker, M. Schuster, Analysis of hypernasal speech in children with cleft lip and palate, in International Conference on Text, Speech and Dialogue (TSD) (Springer, 2008), pp. 389–396Google Scholar
  55. 55.
    A. Mirzaei, M. Vali, Detection of hypernasality from speech signal using group delay and wavelet transform, in 6th International Conference on Computer and Knowledge Engineering (ICCKE) (IEEE, 2017), pp. 189–193Google Scholar
  56. 56.
    J.B. Moon, D.P. Kuehn, J.J. Huisman, Measurement of velopharyngeal closure force during vowel production. Cleft Palate Craniofac. J. 31(5), 356–363 (1994)CrossRefGoogle Scholar
  57. 57.
    D. Morrison, R. Wang, L.C. De Silva, Ensemble methods for spoken emotion recognition in call-centres. Speech Commun. 49(2), 98–112 (2007)CrossRefGoogle Scholar
  58. 58.
    R.G. Nieto, J.I. Marín-Hurtado, L.M. Capacho-Valbuena, A.A. Suarez, Pattern recognition of hypernasality in voice of patients with cleft and lip palate, in XIX Symposium on Image, Signal Processing and Artificial Vision (IEEE, 2015), pp. 1–5Google Scholar
  59. 59.
    K. Nikitha, S. Kalita, C. Vikram, M. Pushpavathi, S.M. Prasanna, Hypernasality severity analysis in cleft lip and palate speech using vowel space area, in Interspeech, 2017, pp. 1829–1833Google Scholar
  60. 60.
    L. Nord, G. Ericsson, Acoustic investigation of cleft palate speech before and after speech therapy. Speech Transm. Lab. Q. Prog. Status Rep. 26(4), 15–27 (1985)Google Scholar
  61. 61.
    J.R. Orozco-Arroyave, J.F. Vargas-Bonilla, J.D. Arias-Londoño, S. Murillo-Rendón, G. Castellanos-Domínguez, J.F. Garcés, Nonlinear dynamics for hypernasality detection in spanish vowels and words. Cognit. Comput. 5(4), 448–457 (2013)CrossRefGoogle Scholar
  62. 62.
    J.R. Orozco-Arroyave, J.D. Arias-Londoño, J.F. Vargas-Bonilla, S. Skodda, J. Rusz, K. Daqrouq, F. Hönig, E. Nöth, Characterization methods for the detection of multiple voice disorders: neurological, functional, and laryngeal diseases. IEEE J. Biomed. Health Inform. 19(6), 1820–1828 (2015)CrossRefGoogle Scholar
  63. 63.
    D. Palaz, R. Collobert, Analysis of cnn-based speech recognition system using raw speech as input, in Interspeech, 2015, pp. 11–15Google Scholar
  64. 64.
    A. Parush, D.J. Ostry, Superior lateral pharyngeal wall movements in speech. J. Acoust. Soc. Am. 80(3), 749–756 (1986)CrossRefGoogle Scholar
  65. 65.
    D.B. Pisoni, Variability of vowel formant frequencies and the quantal theory of speech: a first report. Phonetica 37(5–6), 285–305 (1980)CrossRefGoogle Scholar
  66. 66.
    R. Prasad, S.R. Kadiri, S.V. Gangashetty, B. Yegnanarayana, Discriminating nasals and approximants in English language using zero time windowing, in Interspeech 2018, 2018, pp. 177–181Google Scholar
  67. 67.
    D.K. Rah, Y.L. Ko, C. Lee, D.W. Kim, A noninvasive estimation of hypernasality using a linear predictive model. Ann. Biomed. Eng. 29(7), 587–594 (2001)CrossRefGoogle Scholar
  68. 68.
    W. Ryan, C. Hawkins, Ultrasonic measurement of lateral pharyngeal wall movement at the velopharyngeal port. Cleft Palate J. 13, 156–164 (1976)Google Scholar
  69. 69.
    L. Salhi, A. Cherif, Selection of pertinent acoustic features for detection of pathological voices, in 5th International Conference on Modeling, Simulation and Applied Optimization (ICMSAO) (IEEE, 2013), pp. 1–6Google Scholar
  70. 70.
    J. Schmidhuber, Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)CrossRefGoogle Scholar
  71. 71.
    M. Schuster, A. Maier, T. Bocklet, E. Nkenke, A. Holst, U. Eysholdt, F. Stelzle, Automatically evaluated degree of intelligibility of children with different cleft type from preschool and elementary school measured by automatic speech recognition. Int. J. Pediatr. Otorhinolaryngol. 76(3), 362–369 (2012)CrossRefGoogle Scholar
  72. 72.
    B.L. Smith, M.K. Kenney, S. Hussain, A longitudinal investigation of duration and temporal variability in children’s speech production. J. Acoust. Soc. Am. 99(1), 2344–2349 (1996)CrossRefGoogle Scholar
  73. 73.
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  74. 74.
    P. Tarun, C.Y. Espy-Wilson, B.H. Story, Simulation and analysis of nasalized vowels based on magnetic resonance imaging data. J. Acoust. Soc. Am. 121(6), 3858–3873 (2007)CrossRefGoogle Scholar
  75. 75.
    E. Verteletskaya, K. Sakhnov, B. Simak, Pitch detection algorithms and voiced/unvoiced classification for noisy speech, in International Conference on Systems, Signals and Image Processing (IEEE, 2009), pp. 1–5Google Scholar
  76. 76.
    P. Vijayalakshmi, T. Nagarajan, J. Rav, Selective pole modification-based technique for the analysis and detection of hypernasality, in IEEE Region 10 Conference TENCON 2009–2009 (IEEE, 2009), pp. 1–5Google Scholar
  77. 77.
    P. Vijayalakshmi, M.R. Reddy, O.S. Douglas, Acoustic analysis and detection of hypernasality using a group delay function. IEEE Trans. Biomed. Eng. 54(4), 621–629 (2007)CrossRefGoogle Scholar
  78. 78.
    C.M. Vikram, A. Tripathi, S. Kalita, S.R. Mahadeva Prasanna, Estimation of hypernasality scores from cleft lip and palate speech, in Interspeech, 2018, pp. 1701–1705Google Scholar
  79. 79.
    A.P. Vogel, H.M. Ibrahim, S. Reilly, N. Kilpatrick, A comparative study of two acoustic measures of hypernasality. J. Speech Lang. Hear. Res. 52(6), 1640–1651 (2009)CrossRefGoogle Scholar
  80. 80.
    X.Y. Wang, Y.P. Huang, J.H. Qian, L. He, H. Huang, H. Yin, Initial and final segmentation in cleft palate speech based on acoustic characteristics. Comput. Eng. Appl. 54(8), 123–136 (2018)Google Scholar
  81. 81.
    W. Yin, H. Schütze, B. Xiang, B. Zhou, Abcnn: attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Comput. Linguist. 4, 259–272 (2015)CrossRefGoogle Scholar
  82. 82.
    W. Zhang, G. Li, L. Wang, Application of improved spectral subtraction algorithm for speech emotion recognition, in Fifth International Conference on Big Data and Cloud Computing (IEEE, 2015), pp. 213–216Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Electrical Engineering and Information TechnologySichuan UniversityChengduChina
  2. 2.Hospital of StomatologySichuan UniversityChengduChina

Personalised recommendations