Skip to main content
Log in

Unsupervised Speech Signal-to-Symbol Transformation for Language Identification

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper presents a new approach for unsupervised segmentation and labeling of acoustically homogeneous segments from the speech signals. The virtual labels, thus obtained, are used to build unsupervised acoustic models in the absence of manual transcriptions. We refer to this approach as unsupervised speech signal-to-symbol transformation. This approach mainly involves three steps: (i) segmenting the speech signal into acoustically homogeneous regions, (ii) assigning consistent labels to the acoustic segments with similar characteristics and (iii) iterative modeling of the acoustic segments sharing the same label. This work focuses on improving initial segmentation and acoustic segment labeling. A new kernel-Gram matrix-based approach is proposed for segmentation. The number of segments is automatically determined using this approach, and performance comparable to the state-of-the-art algorithms is achieved. The segment labeling is formulated in a graph clustering framework. Graph clustering methods require extensive computational resources for large datasets. A new graph growing-based strategy is proposed to make the algorithm scalable. A two-stage iterative modeling is used to refine the segment boundaries and segment labels alternately. The proposed method achieves highest normalized mutual information and purity on TIMIT dataset. Quality assessment of the virtual labels is performed by building a language identification (LID) system for Indian languages. A bigram language model is built using these virtual phones. The LID system built using these virtual labels and corresponding language model performs very close to the system trained using manual labels and an i-vector-based LID system. The fusion of unsupervised LID system scores from our approach and the i-vector approach outperforms the LID system built under the supervision of manual labels by a relative margin of 31.19% demonstrating the effectiveness of unsupervised LID systems that can be at par with supervised systems by using virtual labels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://sites.google.com/a/iith.ac.in/shekhar-nayak/resources.

References

  1. J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)

    Google Scholar 

  2. E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011)

    Google Scholar 

  3. R. Andre-Obrecht, A new statistical approach for the automatic segmentation of continuous speech signals. IEEE Trans. Acoust. Speech Signal Process. 36(1), 29–40 (1988)

    Google Scholar 

  4. M. Bacchiani, M. Ostendorf, Joint lexicon, acoustic unit inventory and model design. Speech Commun. 29(2), 99–114 (1999)

    Google Scholar 

  5. M. Bacchiani, M. Ostendorf, Y. Sagisaka, K. Paliwal, Unsupervised learning of non-uniform segmental units for acoustic modeling in speech recognition, in: Proceedings of IEEE ASR Workshop, pp. 141–142 (1995)

  6. S. Bhati, S. Nayak, K.S.R. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications, in Proceedings of Interspeech, pp. 2133–2137 (2017)

  7. S. Bhati, S. Nayak, K. Sri Rama Murty, Unsupervised segmentation of speech signals using kernel-gram matrices. In: Computer Vision, Pattern Recognition, Image Processing, and Graphics: 6th National Conference, NCVPRIPG, Mandi, India, Revised Selected Papers 6, pp. 139–149. Springer (2017)

  8. F. Brugnara, D. Falavigna, M. Omologo, Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12(4), 357–370 (1993)

    Google Scholar 

  9. N. Brümmer, E. De Villiers, The bosaris toolkit: theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865 (2013)

  10. W. Campbell, T. Gleason, J. Navratil, D. Reynolds, W. Shen, E. Singer, P. Torres-Carrasquillo, Advanced language recognition using cepstra and phonotactics: Mitll system performance on the NIST 2005 language recognition evaluation. In: IEEE Speaker and Language Recognition Workshop, Odyssey, pp. 1–8 (2006)

  11. S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in Proceedings of DARPA broadcast news transcription and understanding workshop, vol. 8, pp. 127–132. Virginia, USA (1998)

  12. R. Čmejla, P. Sovka, Estimation of boundaries between speech units using bayesian changepoint detectors, in International Conference on Text, Speech and Dialogue, pp. 291–298. Springer (2001)

  13. C. Corredor-Ardoy, J.L. Gauvain, M. Adda-Decker, L. Lamel, Language identification with language-independent acoustic models, in Proceedings of Eurospeech. Citeseer (1997)

  14. P. Dai, U. Iurgel, G. Rigoll, A novel feature combination approach for spoken document classification with support vector machines, in Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5. Citeseer (2003)

  15. R.A. Davis, T.C.M. Lee, G.A. Rodriguez-Yam, Structural break estimation for nonstationary time series models. J. Am. Stat. Assoc. 101(473), 223–239 (2006)

    MathSciNet  MATH  Google Scholar 

  16. N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction, in Proceedings of Interspeech (2011)

  17. L.F. D’Haro, R. Cordoba, C. Salamea, J.D. Echeverry, Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5342–5346 (2014)

  18. M. Diez, A. Varona, M. Penagarikano, L.J. Rodriguez-Fuentes, G. Bordel, On the projection of pllrs for unbounded feature distributions in spoken language recognition. IEEE Signal Process. Lett. 21(9), 1073–1077 (2014)

    Google Scholar 

  19. S. Dusan, L. Rabiner, On the relation between maximum spectral transition positions and phone boundaries. In: Ninth International Conference on Spoken Language Processing (2006)

  20. I.A. Eckley, P. Fearnhead, R. Killick, Analysis of changepoint models. Bayesian Time Series Models pp. 205–224 (2011)

  21. Y.P. Estevan, V. Wan, O. Scharenborg, Finding maximum margin segments in speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV–937 (2007)

  22. I. Fonagy, K. Magdics, Speed of utterance in phrases of different lengths. Lang. Speech 3(4), 179–192 (1960)

    Google Scholar 

  23. G.D. Forney, The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)

    MathSciNet  Google Scholar 

  24. J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, Timit acoustic-phonetic continuous speech corpus. Linguist. Data Consort. 10(5), (1993)

  25. J.L. Gauvain, A. Messaoudi, H. Schwenk, Language recognition using phone latices, in Proceedings of Interspeech (2004)

  26. H. Gish, K. Ng, A segmental speech model with applications to word spotting. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 447–450 (1993)

  27. A.S. Jayram, V. Ramasubramanian, T.V. Sreenivas, Language identification using parallel sub-word recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–32 (2003)

  28. H. Kasuya, H. Wakita, Speech segmentation and feature normalization based on area functions. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 29–32 (1976)

  29. S. Kesiraju, R. Pappagari, L. Ondel, L. Burget, N. Dehak, S. Khudanpur, J. Černockỳ, S.V. Gangashetty, Topic identification of spoken documents using unsupervised acoustic unit discovery. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5745–5749 (2017)

  30. V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digit. Signal Proc. 35, 86–94 (2014)

    Google Scholar 

  31. C.H. Lee, F.K. Soong, B.H. Juang, A segment model based approach to speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 501–541 (1988)

  32. C.Y. Lee, J. Glass, A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 40–49 (2012)

  33. S.J. Leow, E.S. Chng, C.H. Lee, Language-resource independent speech segmentation using cues from a spectrogram image. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5813–5817 (2015)

  34. H. Li, B. Ma, A phonotactic language model for spoken language identification. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 515–522 (2005)

  35. H. Li, B. Ma, C.H. Lee, A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)

    Google Scholar 

  36. H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)

    Google Scholar 

  37. B. Ma, C. Guan, H. Li, C.H. Lee, Multilingual speech recognition with language identification, in Proceedings of Interspeech (2002)

  38. C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)

    MATH  Google Scholar 

  39. A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance (Technical report, DTIC Document, 1997)

  40. K. Mounika, L.H. Sivanand Achanta, V.G. Suryakanth, A.K. Vuppala, An investigation of deep neural network architectures for language recognition in Indian languages, in Proceedings of Interspeech, pp. 2930–2933 (2016)

  41. J. Mrozinski, E.W. Whittaker, P. Chatain, S. Furui, Automatic sentence segmentation of speech for automatic summarization. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)

  42. Y.K. Muthusamy, E. Barnard, R.A. Cole, Reviewing automatic language identification. IEEE Signal Process. Mag. 11(4), 33–41 (1994)

    Google Scholar 

  43. A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 53–58 (2005)

  44. A. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)

    Google Scholar 

  45. Y. Qiao, N. Shimomura, N. Minematsu, Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3989–3992 (2008)

  46. L.R. Rabiner, Multirate Digital Signal Processing (Prentice Hall, Upper Saddle River, 1996)

    Google Scholar 

  47. O. Räsänen, U. Laine, T. Altosaar, An improved speech segmentation quality measure: the r-value, in Proceedings of Interspeech (2009)

  48. O. Räsänen, U. Laine, T. Altosaar, Blind segmentation of speech using non-linear filtering methods. In: Speech Technologies. InTech (2011)

  49. D.R. Reddy, Segmentation of speech sounds. J. Acoust. Soc. Am. 40(2), 307–312 (1966)

    Google Scholar 

  50. J. Reed, C.H. Lee, A study on music genre classification based on universal acoustic models. In: ISMIR, pp. 89–94 (2006)

  51. J. Reichardt, S. Bornholdt, Statistical mechanics of community detection. Phys. Rev. E 74(1), 016110 (2006)

    MathSciNet  Google Scholar 

  52. H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)

    MATH  Google Scholar 

  53. A. Sarkar, T.V. Sreenivas, Automatic speech segmentation using average level crossing rate information. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2005)

  54. S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)

    MATH  Google Scholar 

  55. O. Scharenborg, V. Wan, M. Ernestus, Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. J. Acoust. Soc. Am. 127(2), 1084–1095 (2010)

    Google Scholar 

  56. B. Schölkopf, K. Tsuda, J.P. Vert, Kernel Methods in Computational Biology (MIT Press, Cambridge, 2004)

    Google Scholar 

  57. M.H. Siu, H. Gish, A. Chan, W. Belfield, Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision, in Proceedings of Interspeech (2010)

  58. M.H. Siu, H. Gish, A. Chan, W. Belfield, S. Lowe, Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. Comput. Speech Lang. 28(1), 210–223 (2014)

    Google Scholar 

  59. D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp. 165–170 (2016)

  60. A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and mel cepstral coefficients. In: IEEE Spoken Language Technology Workshop (SLT), pp. 597–602 (2016)

  61. R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, E. Dupoux, A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling, in Proceedings of Interspeech, pp. 3179–3183 (2015)

  62. R. Tong, B. Ma, H. Li, E.S. Chng, A target-oriented phonotactic front-end for spoken language recognition. IEEE Trans. Audio Speech Lang. Process. 17(7), 1335–1347 (2009)

    Google Scholar 

  63. R. Tong, B. Ma, D. Zhu, H. Li, E.S. Chng, Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)

  64. K. Vijayan, H. Li, H. Sun, K.A. Lee, On the importance of analytic phase of speech signals in spoken language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5194–5198 (2018)

  65. V. Vuuren, L. Bosch, T. Niesler, Unconstrained speech segmentation using deep neural networks. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 248–254 (2015)

  66. H. Wang, T. Lee, C.C. Leung, B. Ma, H. Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 264–277 (2015)

    Google Scholar 

  67. H. Wang, C.C. Leung, T. Lee, B. Ma, H. Li, An acoustic segment modeling approach to query-by-example spoken term detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5157–5160 (2012)

  68. Y. Yan, E. Barnard, An approach to automatic language identification based on language-dependent phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. 3511–3514 (1995)

  69. W. Zhang, R.A. Clark, Y. Wang, W. Li, Unsupervised language identification based on latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)

    Google Scholar 

  70. M.A. Zissman, Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)

    Google Scholar 

  71. M.A. Zissman, E. Singer, Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–305 (1994)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shekhar Nayak.

Additional information

Funding was provided by the Ministry of Human Resource Development.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhati, S., Nayak, S. & Kodukula, S.R.M. Unsupervised Speech Signal-to-Symbol Transformation for Language Identification. Circuits Syst Signal Process 39, 5169–5197 (2020). https://doi.org/10.1007/s00034-020-01408-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-020-01408-8

Keywords

Navigation