Unsupervised Speech Signal-to-Symbol Transformation for Language Identification

Bhati, Saurabhchand; Nayak, Shekhar; Kodukula, Sri Rama Murty

doi:10.1007/s00034-020-01408-8

Unsupervised Speech Signal-to-Symbol Transformation for Language Identification

Published: 28 April 2020

Volume 39, pages 5169–5197, (2020)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Saurabhchand Bhati¹^na1,
Shekhar Nayak¹^na1 &
Sri Rama Murty Kodukula¹

283 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents a new approach for unsupervised segmentation and labeling of acoustically homogeneous segments from the speech signals. The virtual labels, thus obtained, are used to build unsupervised acoustic models in the absence of manual transcriptions. We refer to this approach as unsupervised speech signal-to-symbol transformation. This approach mainly involves three steps: (i) segmenting the speech signal into acoustically homogeneous regions, (ii) assigning consistent labels to the acoustic segments with similar characteristics and (iii) iterative modeling of the acoustic segments sharing the same label. This work focuses on improving initial segmentation and acoustic segment labeling. A new kernel-Gram matrix-based approach is proposed for segmentation. The number of segments is automatically determined using this approach, and performance comparable to the state-of-the-art algorithms is achieved. The segment labeling is formulated in a graph clustering framework. Graph clustering methods require extensive computational resources for large datasets. A new graph growing-based strategy is proposed to make the algorithm scalable. A two-stage iterative modeling is used to refine the segment boundaries and segment labels alternately. The proposed method achieves highest normalized mutual information and purity on TIMIT dataset. Quality assessment of the virtual labels is performed by building a language identification (LID) system for Indian languages. A bigram language model is built using these virtual phones. The LID system built using these virtual labels and corresponding language model performs very close to the system trained using manual labels and an i-vector-based LID system. The fusion of unsupervised LID system scores from our approach and the i-vector approach outperforms the LID system built under the supervision of manual labels by a relative margin of 31.19% demonstrating the effectiveness of unsupervised LID systems that can be at par with supervised systems by using virtual labels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gaussian Segmentation and Tokenization for Low Cost Language Identification

A Phonetic Segmentation Procedure Based on Hidden Markov Models

Automatic Phonetic Segmentation Using the Kaldi Toolkit

Notes

https://sites.google.com/a/iith.ac.in/shekhar-nayak/resources.

References

J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
Google Scholar
E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011)
Google Scholar
R. Andre-Obrecht, A new statistical approach for the automatic segmentation of continuous speech signals. IEEE Trans. Acoust. Speech Signal Process. 36(1), 29–40 (1988)
Google Scholar
M. Bacchiani, M. Ostendorf, Joint lexicon, acoustic unit inventory and model design. Speech Commun. 29(2), 99–114 (1999)
Google Scholar
M. Bacchiani, M. Ostendorf, Y. Sagisaka, K. Paliwal, Unsupervised learning of non-uniform segmental units for acoustic modeling in speech recognition, in: Proceedings of IEEE ASR Workshop, pp. 141–142 (1995)
S. Bhati, S. Nayak, K.S.R. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications, in Proceedings of Interspeech, pp. 2133–2137 (2017)
S. Bhati, S. Nayak, K. Sri Rama Murty, Unsupervised segmentation of speech signals using kernel-gram matrices. In: Computer Vision, Pattern Recognition, Image Processing, and Graphics: 6th National Conference, NCVPRIPG, Mandi, India, Revised Selected Papers 6, pp. 139–149. Springer (2017)
F. Brugnara, D. Falavigna, M. Omologo, Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12(4), 357–370 (1993)
Google Scholar
N. Brümmer, E. De Villiers, The bosaris toolkit: theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865 (2013)
W. Campbell, T. Gleason, J. Navratil, D. Reynolds, W. Shen, E. Singer, P. Torres-Carrasquillo, Advanced language recognition using cepstra and phonotactics: Mitll system performance on the NIST 2005 language recognition evaluation. In: IEEE Speaker and Language Recognition Workshop, Odyssey, pp. 1–8 (2006)
S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in Proceedings of DARPA broadcast news transcription and understanding workshop, vol. 8, pp. 127–132. Virginia, USA (1998)
R. Čmejla, P. Sovka, Estimation of boundaries between speech units using bayesian changepoint detectors, in International Conference on Text, Speech and Dialogue, pp. 291–298. Springer (2001)
C. Corredor-Ardoy, J.L. Gauvain, M. Adda-Decker, L. Lamel, Language identification with language-independent acoustic models, in Proceedings of Eurospeech. Citeseer (1997)
P. Dai, U. Iurgel, G. Rigoll, A novel feature combination approach for spoken document classification with support vector machines, in Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5. Citeseer (2003)
R.A. Davis, T.C.M. Lee, G.A. Rodriguez-Yam, Structural break estimation for nonstationary time series models. J. Am. Stat. Assoc. 101(473), 223–239 (2006)
MathSciNet MATH Google Scholar
N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction, in Proceedings of Interspeech (2011)
L.F. D’Haro, R. Cordoba, C. Salamea, J.D. Echeverry, Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5342–5346 (2014)
M. Diez, A. Varona, M. Penagarikano, L.J. Rodriguez-Fuentes, G. Bordel, On the projection of pllrs for unbounded feature distributions in spoken language recognition. IEEE Signal Process. Lett. 21(9), 1073–1077 (2014)
Google Scholar
S. Dusan, L. Rabiner, On the relation between maximum spectral transition positions and phone boundaries. In: Ninth International Conference on Spoken Language Processing (2006)
I.A. Eckley, P. Fearnhead, R. Killick, Analysis of changepoint models. Bayesian Time Series Models pp. 205–224 (2011)
Y.P. Estevan, V. Wan, O. Scharenborg, Finding maximum margin segments in speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV–937 (2007)
I. Fonagy, K. Magdics, Speed of utterance in phrases of different lengths. Lang. Speech 3(4), 179–192 (1960)
Google Scholar
G.D. Forney, The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
MathSciNet Google Scholar
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, Timit acoustic-phonetic continuous speech corpus. Linguist. Data Consort. 10(5), (1993)
J.L. Gauvain, A. Messaoudi, H. Schwenk, Language recognition using phone latices, in Proceedings of Interspeech (2004)
H. Gish, K. Ng, A segmental speech model with applications to word spotting. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 447–450 (1993)
A.S. Jayram, V. Ramasubramanian, T.V. Sreenivas, Language identification using parallel sub-word recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–32 (2003)
H. Kasuya, H. Wakita, Speech segmentation and feature normalization based on area functions. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 29–32 (1976)
S. Kesiraju, R. Pappagari, L. Ondel, L. Burget, N. Dehak, S. Khudanpur, J. Černockỳ, S.V. Gangashetty, Topic identification of spoken documents using unsupervised acoustic unit discovery. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5745–5749 (2017)
V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digit. Signal Proc. 35, 86–94 (2014)
Google Scholar
C.H. Lee, F.K. Soong, B.H. Juang, A segment model based approach to speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 501–541 (1988)
C.Y. Lee, J. Glass, A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 40–49 (2012)
S.J. Leow, E.S. Chng, C.H. Lee, Language-resource independent speech segmentation using cues from a spectrogram image. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5813–5817 (2015)
H. Li, B. Ma, A phonotactic language model for spoken language identification. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 515–522 (2005)
H. Li, B. Ma, C.H. Lee, A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)
Google Scholar
H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)
Google Scholar
B. Ma, C. Guan, H. Li, C.H. Lee, Multilingual speech recognition with language identification, in Proceedings of Interspeech (2002)
C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)
MATH Google Scholar
A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance (Technical report, DTIC Document, 1997)
K. Mounika, L.H. Sivanand Achanta, V.G. Suryakanth, A.K. Vuppala, An investigation of deep neural network architectures for language recognition in Indian languages, in Proceedings of Interspeech, pp. 2930–2933 (2016)
J. Mrozinski, E.W. Whittaker, P. Chatain, S. Furui, Automatic sentence segmentation of speech for automatic summarization. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)
Y.K. Muthusamy, E. Barnard, R.A. Cole, Reviewing automatic language identification. IEEE Signal Process. Mag. 11(4), 33–41 (1994)
Google Scholar
A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 53–58 (2005)
A. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)
Google Scholar
Y. Qiao, N. Shimomura, N. Minematsu, Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3989–3992 (2008)
L.R. Rabiner, Multirate Digital Signal Processing (Prentice Hall, Upper Saddle River, 1996)
Google Scholar
O. Räsänen, U. Laine, T. Altosaar, An improved speech segmentation quality measure: the r-value, in Proceedings of Interspeech (2009)
O. Räsänen, U. Laine, T. Altosaar, Blind segmentation of speech using non-linear filtering methods. In: Speech Technologies. InTech (2011)
D.R. Reddy, Segmentation of speech sounds. J. Acoust. Soc. Am. 40(2), 307–312 (1966)
Google Scholar
J. Reed, C.H. Lee, A study on music genre classification based on universal acoustic models. In: ISMIR, pp. 89–94 (2006)
J. Reichardt, S. Bornholdt, Statistical mechanics of community detection. Phys. Rev. E 74(1), 016110 (2006)
MathSciNet Google Scholar
H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)
MATH Google Scholar
A. Sarkar, T.V. Sreenivas, Automatic speech segmentation using average level crossing rate information. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2005)
S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)
MATH Google Scholar
O. Scharenborg, V. Wan, M. Ernestus, Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. J. Acoust. Soc. Am. 127(2), 1084–1095 (2010)
Google Scholar
B. Schölkopf, K. Tsuda, J.P. Vert, Kernel Methods in Computational Biology (MIT Press, Cambridge, 2004)
Google Scholar
M.H. Siu, H. Gish, A. Chan, W. Belfield, Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision, in Proceedings of Interspeech (2010)
M.H. Siu, H. Gish, A. Chan, W. Belfield, S. Lowe, Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. Comput. Speech Lang. 28(1), 210–223 (2014)
Google Scholar
D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp. 165–170 (2016)
A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and mel cepstral coefficients. In: IEEE Spoken Language Technology Workshop (SLT), pp. 597–602 (2016)
R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, E. Dupoux, A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling, in Proceedings of Interspeech, pp. 3179–3183 (2015)
R. Tong, B. Ma, H. Li, E.S. Chng, A target-oriented phonotactic front-end for spoken language recognition. IEEE Trans. Audio Speech Lang. Process. 17(7), 1335–1347 (2009)
Google Scholar
R. Tong, B. Ma, D. Zhu, H. Li, E.S. Chng, Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)
K. Vijayan, H. Li, H. Sun, K.A. Lee, On the importance of analytic phase of speech signals in spoken language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5194–5198 (2018)
V. Vuuren, L. Bosch, T. Niesler, Unconstrained speech segmentation using deep neural networks. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 248–254 (2015)
H. Wang, T. Lee, C.C. Leung, B. Ma, H. Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 264–277 (2015)
Google Scholar
H. Wang, C.C. Leung, T. Lee, B. Ma, H. Li, An acoustic segment modeling approach to query-by-example spoken term detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5157–5160 (2012)
Y. Yan, E. Barnard, An approach to automatic language identification based on language-dependent phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. 3511–3514 (1995)
W. Zhang, R.A. Clark, Y. Wang, W. Li, Unsupervised language identification based on latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)
Google Scholar
M.A. Zissman, Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Google Scholar
M.A. Zissman, E. Singer, Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–305 (1994)

Download references

Author information

Saurabhchand Bhati and Shekhar Nayak have contributed equally to this work.

Authors and Affiliations

Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Kandi, Sangareddy, Telangana, 502285, India
Saurabhchand Bhati, Shekhar Nayak & Sri Rama Murty Kodukula

Authors

Saurabhchand Bhati
View author publications
You can also search for this author in PubMed Google Scholar
Shekhar Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Sri Rama Murty Kodukula
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shekhar Nayak.

Additional information

Funding was provided by the Ministry of Human Resource Development.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhati, S., Nayak, S. & Kodukula, S.R.M. Unsupervised Speech Signal-to-Symbol Transformation for Language Identification. Circuits Syst Signal Process 39, 5169–5197 (2020). https://doi.org/10.1007/s00034-020-01408-8

Download citation

Received: 26 December 2018
Revised: 24 March 2020
Accepted: 26 March 2020
Published: 28 April 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s00034-020-01408-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Speech Signal-to-Symbol Transformation for Language Identification

Abstract

Access this article

Similar content being viewed by others

Gaussian Segmentation and Tokenization for Low Cost Language Identification

A Phonetic Segmentation Procedure Based on Hidden Markov Models

Automatic Phonetic Segmentation Using the Kaldi Toolkit

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised Speech Signal-to-Symbol Transformation for Language Identification

Abstract

Access this article

Similar content being viewed by others

Gaussian Segmentation and Tokenization for Low Cost Language Identification

A Phonetic Segmentation Procedure Based on Hidden Markov Models

Automatic Phonetic Segmentation Using the Kaldi Toolkit

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation