Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 841)


The objective of this paper is to develop an unsupervised method for segmentation of speech signals into phoneme-like units. The proposed algorithm is based on the observation that the feature vectors from the same segment exhibit higher degree of similarity than the feature vectors across the segments. The kernel-Gram matrix of an utterance is formed by computing the similarity between every pair of feature vectors in the Gaussian kernel space. The kernel-Gram matrix consists of square patches, along with the principle diagonal, corresponding to different phoneme-like segments in the speech signal. It detects the number of segments, as well as their boundaries automatically. The proposed approach does not assume any information about input utterances like exact distribution of segment length or correct number of segments in an utterance. The proposed method out-performs the state-of-the-art blind segmentation algorithms on Zero Resource 2015 databases and TIMIT database.


Blind segmentation Gaussian kernel Kernel-Gram matrix Phonetic segmentation 


  1. 1.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  2. 2.
    Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)CrossRefGoogle Scholar
  3. 3.
    Furui, S.: Digital Speech Processing: Synthesis, and Recognition. CRC Press, Boca Raton (2000)Google Scholar
  4. 4.
    Wang, A., et al.: An industrial strength audio search algorithm. In: ISMIR, vol. 2003, pp. 7–13, Washington, D.C. (2003)Google Scholar
  5. 5.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  6. 6.
    Gales, M.J., Young, S.J.: Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)CrossRefGoogle Scholar
  7. 7.
    Brugnara, F., Falavigna, D., Omologo, M.: Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12(4), 357–370 (1993)CrossRefGoogle Scholar
  8. 8.
    Demuynck, K., Laureys, T.: A comparison of different approaches to automatic speech segmentation. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2002. LNCS (LNAI), vol. 2448, pp. 277–284. Springer, Heidelberg (2002). Scholar
  9. 9.
    Scharenborg, O., Ernestus, M., Wan, V.: Segmentation of speech: child’s play? (2007)Google Scholar
  10. 10.
    Rybach, D., Gollan, C., Schluter, R., Ney, H.: Audio segmentation for speech recognition using segment features. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4197–4200. IEEE (2009)Google Scholar
  11. 11.
    Davy, M., Godsill, S.: Detection of abrupt spectral changes using support vector machines an application to audio signal segmentation. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 1313–1316. IEEE (2002)Google Scholar
  12. 12.
    Dusan, S., Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. In: Ninth International Conference on Spoken Language Processing (2006)Google Scholar
  13. 13.
    Aversano, G., Esposito, A., Marinaro, M.: A new text-independent method for phoneme segmentation. In: Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems, MWSCAS 2001, vol. 2, pp. 516–519. IEEE (2001)Google Scholar
  14. 14.
    Goodwin, M.M., Laroche, J.: Audio segmentation by feature-space clustering using linear discriminant analysis and dynamic programming. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 131–134. IEEE (2003)Google Scholar
  15. 15.
    Estevan, Y.P., Wan, V., Scharenborg, O.: Finding maximum margin segments in speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. 937–940. IEEE (2007)Google Scholar
  16. 16.
    Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)CrossRefGoogle Scholar
  17. 17.
    Micallef, P., Chilton, T.: Automatic identification of phoneme boundaries using a mixed parameter model. In: Fifth European Conference on Speech Communication and Technology (1997)Google Scholar
  18. 18.
    van Santen, J.P., Sproat, R.: High-accuracy automatic segmentation. In: EUROSPEECH (1999)Google Scholar
  19. 19.
    Chang, J.W., Glass, J.R.: Segmentation and modeling in segment-based recognition. In: Fifth European Conference on Speech Communication and Technology (1997)Google Scholar
  20. 20.
    Qiao, Y., Shimomura, N., Minematsu, N.: Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 3989–3992. IEEE (2008)Google Scholar
  21. 21.
    Leow, S.J., Chng, E.S., Lee, C.-H.: Language-resource independent speech segmentation using cues from a spectrogram image. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5813–5817. IEEE (2015)Google Scholar
  22. 22.
    Stan, A., Valentini-Botinhao, C., Orza, B., Giurgiu, M.: Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 597–602. IEEE (2016)Google Scholar
  23. 23.
    Khanagha, V., Daoudi, K., Pont, O., Yahia, H.: Phonetic segmentation of speech signal using local singularity analysis. Digit. Signal Proc. 35, 86–94 (2014)CrossRefGoogle Scholar
  24. 24.
    Rasanen, O., Laine, U., Altosaar, T.: Blind segmentation of speech using non-linear filtering methods. In: Speech Technologies. InTech (2011)Google Scholar
  25. 25.
    Lee, C., Glass, J.: A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 40–49. Association for Computational Linguistics (2012)Google Scholar
  26. 26.
    Vert, J.-P., Tsuda, K., Schölkopf, B.: A primer on kernel methods. In: Kernel Methods in Computational Biology, pp. 35–70 (2004)Google Scholar
  27. 27.
    Rabiner, L.R.: Multirate Digital Signal Processing. Prentice Hall PTR, Upper Saddle River (1996)Google Scholar
  28. 28.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, no. 34, pp. 226–231 (1996)Google Scholar
  29. 29.
    Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical report N, vol. 93 (1993)Google Scholar
  30. 30.
    Versteegh, M., Thiolliere, R., Schatz, T., Cao, X.-N., Anguera, X., Jansen, A., Dupoux, E.: The zero resource speech challenge 2015. In: Interspeech, pp. 3169–3173 (2015)Google Scholar
  31. 31.
    Jansen, A., Van Durme, B.: Efficient spoken term discovery using randomized algorithms. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 401–406. IEEE (2011)Google Scholar
  32. 32.
    Räsänen, O., Doyle, G., Frank, M.C.: Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  33. 33.
    Lyzinski, V., Sell, G., Jansen, A.: An evaluation of graph clustering methods for unsupervised term discovery. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  34. 34.
    Vuuren, V., Bosch, L., Niesler, T.: Unconstrained speech segmentation using deep neural networks. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, ICPRAM 2015, vol. 1, pp. 248–254 (2015)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Electrical EngineeringIIT HyderabadHyderabadIndia

Personalised recommendations