Abstract
This paper presents a new approach for unsupervised segmentation and labeling of acoustically homogeneous segments from the speech signals. The virtual labels, thus obtained, are used to build unsupervised acoustic models in the absence of manual transcriptions. We refer to this approach as unsupervised speech signal-to-symbol transformation. This approach mainly involves three steps: (i) segmenting the speech signal into acoustically homogeneous regions, (ii) assigning consistent labels to the acoustic segments with similar characteristics and (iii) iterative modeling of the acoustic segments sharing the same label. This work focuses on improving initial segmentation and acoustic segment labeling. A new kernel-Gram matrix-based approach is proposed for segmentation. The number of segments is automatically determined using this approach, and performance comparable to the state-of-the-art algorithms is achieved. The segment labeling is formulated in a graph clustering framework. Graph clustering methods require extensive computational resources for large datasets. A new graph growing-based strategy is proposed to make the algorithm scalable. A two-stage iterative modeling is used to refine the segment boundaries and segment labels alternately. The proposed method achieves highest normalized mutual information and purity on TIMIT dataset. Quality assessment of the virtual labels is performed by building a language identification (LID) system for Indian languages. A bigram language model is built using these virtual phones. The LID system built using these virtual labels and corresponding language model performs very close to the system trained using manual labels and an i-vector-based LID system. The fusion of unsupervised LID system scores from our approach and the i-vector approach outperforms the LID system built under the supervision of manual labels by a relative margin of 31.19% demonstrating the effectiveness of unsupervised LID systems that can be at par with supervised systems by using virtual labels.
Similar content being viewed by others
References
J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. IEEE Circuits Syst. Mag. 11(2), 82–108 (2011)
R. Andre-Obrecht, A new statistical approach for the automatic segmentation of continuous speech signals. IEEE Trans. Acoust. Speech Signal Process. 36(1), 29–40 (1988)
M. Bacchiani, M. Ostendorf, Joint lexicon, acoustic unit inventory and model design. Speech Commun. 29(2), 99–114 (1999)
M. Bacchiani, M. Ostendorf, Y. Sagisaka, K. Paliwal, Unsupervised learning of non-uniform segmental units for acoustic modeling in speech recognition, in: Proceedings of IEEE ASR Workshop, pp. 141–142 (1995)
S. Bhati, S. Nayak, K.S.R. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications, in Proceedings of Interspeech, pp. 2133–2137 (2017)
S. Bhati, S. Nayak, K. Sri Rama Murty, Unsupervised segmentation of speech signals using kernel-gram matrices. In: Computer Vision, Pattern Recognition, Image Processing, and Graphics: 6th National Conference, NCVPRIPG, Mandi, India, Revised Selected Papers 6, pp. 139–149. Springer (2017)
F. Brugnara, D. Falavigna, M. Omologo, Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12(4), 357–370 (1993)
N. Brümmer, E. De Villiers, The bosaris toolkit: theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865 (2013)
W. Campbell, T. Gleason, J. Navratil, D. Reynolds, W. Shen, E. Singer, P. Torres-Carrasquillo, Advanced language recognition using cepstra and phonotactics: Mitll system performance on the NIST 2005 language recognition evaluation. In: IEEE Speaker and Language Recognition Workshop, Odyssey, pp. 1–8 (2006)
S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the bayesian information criterion, in Proceedings of DARPA broadcast news transcription and understanding workshop, vol. 8, pp. 127–132. Virginia, USA (1998)
R. Čmejla, P. Sovka, Estimation of boundaries between speech units using bayesian changepoint detectors, in International Conference on Text, Speech and Dialogue, pp. 291–298. Springer (2001)
C. Corredor-Ardoy, J.L. Gauvain, M. Adda-Decker, L. Lamel, Language identification with language-independent acoustic models, in Proceedings of Eurospeech. Citeseer (1997)
P. Dai, U. Iurgel, G. Rigoll, A novel feature combination approach for spoken document classification with support vector machines, in Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5. Citeseer (2003)
R.A. Davis, T.C.M. Lee, G.A. Rodriguez-Yam, Structural break estimation for nonstationary time series models. J. Am. Stat. Assoc. 101(473), 223–239 (2006)
N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction, in Proceedings of Interspeech (2011)
L.F. D’Haro, R. Cordoba, C. Salamea, J.D. Echeverry, Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5342–5346 (2014)
M. Diez, A. Varona, M. Penagarikano, L.J. Rodriguez-Fuentes, G. Bordel, On the projection of pllrs for unbounded feature distributions in spoken language recognition. IEEE Signal Process. Lett. 21(9), 1073–1077 (2014)
S. Dusan, L. Rabiner, On the relation between maximum spectral transition positions and phone boundaries. In: Ninth International Conference on Spoken Language Processing (2006)
I.A. Eckley, P. Fearnhead, R. Killick, Analysis of changepoint models. Bayesian Time Series Models pp. 205–224 (2011)
Y.P. Estevan, V. Wan, O. Scharenborg, Finding maximum margin segments in speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV–937 (2007)
I. Fonagy, K. Magdics, Speed of utterance in phrases of different lengths. Lang. Speech 3(4), 179–192 (1960)
G.D. Forney, The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, Timit acoustic-phonetic continuous speech corpus. Linguist. Data Consort. 10(5), (1993)
J.L. Gauvain, A. Messaoudi, H. Schwenk, Language recognition using phone latices, in Proceedings of Interspeech (2004)
H. Gish, K. Ng, A segmental speech model with applications to word spotting. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 447–450 (1993)
A.S. Jayram, V. Ramasubramanian, T.V. Sreenivas, Language identification using parallel sub-word recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–32 (2003)
H. Kasuya, H. Wakita, Speech segmentation and feature normalization based on area functions. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 29–32 (1976)
S. Kesiraju, R. Pappagari, L. Ondel, L. Burget, N. Dehak, S. Khudanpur, J. Černockỳ, S.V. Gangashetty, Topic identification of spoken documents using unsupervised acoustic unit discovery. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5745–5749 (2017)
V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digit. Signal Proc. 35, 86–94 (2014)
C.H. Lee, F.K. Soong, B.H. Juang, A segment model based approach to speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 501–541 (1988)
C.Y. Lee, J. Glass, A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 40–49 (2012)
S.J. Leow, E.S. Chng, C.H. Lee, Language-resource independent speech segmentation using cues from a spectrogram image. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5813–5817 (2015)
H. Li, B. Ma, A phonotactic language model for spoken language identification. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 515–522 (2005)
H. Li, B. Ma, C.H. Lee, A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)
H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)
B. Ma, C. Guan, H. Li, C.H. Lee, Multilingual speech recognition with language identification, in Proceedings of Interspeech (2002)
C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)
A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance (Technical report, DTIC Document, 1997)
K. Mounika, L.H. Sivanand Achanta, V.G. Suryakanth, A.K. Vuppala, An investigation of deep neural network architectures for language recognition in Indian languages, in Proceedings of Interspeech, pp. 2930–2933 (2016)
J. Mrozinski, E.W. Whittaker, P. Chatain, S. Furui, Automatic sentence segmentation of speech for automatic summarization. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)
Y.K. Muthusamy, E. Barnard, R.A. Cole, Reviewing automatic language identification. IEEE Signal Process. Mag. 11(4), 33–41 (1994)
A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 53–58 (2005)
A. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)
Y. Qiao, N. Shimomura, N. Minematsu, Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3989–3992 (2008)
L.R. Rabiner, Multirate Digital Signal Processing (Prentice Hall, Upper Saddle River, 1996)
O. Räsänen, U. Laine, T. Altosaar, An improved speech segmentation quality measure: the r-value, in Proceedings of Interspeech (2009)
O. Räsänen, U. Laine, T. Altosaar, Blind segmentation of speech using non-linear filtering methods. In: Speech Technologies. InTech (2011)
D.R. Reddy, Segmentation of speech sounds. J. Acoust. Soc. Am. 40(2), 307–312 (1966)
J. Reed, C.H. Lee, A study on music genre classification based on universal acoustic models. In: ISMIR, pp. 89–94 (2006)
J. Reichardt, S. Bornholdt, Statistical mechanics of community detection. Phys. Rev. E 74(1), 016110 (2006)
H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)
A. Sarkar, T.V. Sreenivas, Automatic speech segmentation using average level crossing rate information. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2005)
S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)
O. Scharenborg, V. Wan, M. Ernestus, Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. J. Acoust. Soc. Am. 127(2), 1084–1095 (2010)
B. Schölkopf, K. Tsuda, J.P. Vert, Kernel Methods in Computational Biology (MIT Press, Cambridge, 2004)
M.H. Siu, H. Gish, A. Chan, W. Belfield, Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision, in Proceedings of Interspeech (2010)
M.H. Siu, H. Gish, A. Chan, W. Belfield, S. Lowe, Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. Comput. Speech Lang. 28(1), 210–223 (2014)
D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp. 165–170 (2016)
A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and mel cepstral coefficients. In: IEEE Spoken Language Technology Workshop (SLT), pp. 597–602 (2016)
R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, E. Dupoux, A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling, in Proceedings of Interspeech, pp. 3179–3183 (2015)
R. Tong, B. Ma, H. Li, E.S. Chng, A target-oriented phonotactic front-end for spoken language recognition. IEEE Trans. Audio Speech Lang. Process. 17(7), 1335–1347 (2009)
R. Tong, B. Ma, D. Zhu, H. Li, E.S. Chng, Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–I (2006)
K. Vijayan, H. Li, H. Sun, K.A. Lee, On the importance of analytic phase of speech signals in spoken language recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5194–5198 (2018)
V. Vuuren, L. Bosch, T. Niesler, Unconstrained speech segmentation using deep neural networks. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 248–254 (2015)
H. Wang, T. Lee, C.C. Leung, B. Ma, H. Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 264–277 (2015)
H. Wang, C.C. Leung, T. Lee, B. Ma, H. Li, An acoustic segment modeling approach to query-by-example spoken term detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5157–5160 (2012)
Y. Yan, E. Barnard, An approach to automatic language identification based on language-dependent phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. 3511–3514 (1995)
W. Zhang, R.A. Clark, Y. Wang, W. Li, Unsupervised language identification based on latent Dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)
M.A. Zissman, Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
M.A. Zissman, E. Singer, Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. I–305 (1994)
Author information
Authors and Affiliations
Corresponding author
Additional information
Funding was provided by the Ministry of Human Resource Development.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhati, S., Nayak, S. & Kodukula, S.R.M. Unsupervised Speech Signal-to-Symbol Transformation for Language Identification. Circuits Syst Signal Process 39, 5169–5197 (2020). https://doi.org/10.1007/s00034-020-01408-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-020-01408-8