Skip to main content
Log in

Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a new method that detects the repeated keyword/phrase patterns from speech utterances by performing pattern discovery at the phoneme level. Prior to this method, we have developed a pattern discovery method using frame-level features. Even though the method’s performance is decent, it allows a considerable number of false positives. This paper aims to extract the desired keywords by examining the match between the speech utterances at the phoneme level instead of frame level to reduce the false positives and improve the accuracy. In this work, initially, we segment the speech utterances into phoneme-like regions using an affinity matrix. Then, the matched phoneme regions present in a pair of speech utterances are identified. A new 3-neighbor depth-first search traversal technique is proposed to discover the sequence of phoneme matches. Finally, the distance scores in the sequence of phoneme matches are validated to identify the desired keyword patterns. The performance of the proposed method is evaluated on the Hindi and Bengali news databases and compared with the state-of-the-art techniques. Based on the detected keyword patterns, the speech utterances are divided into groups using a standard clustering algorithm. The derived clusters represent the broader domain-specific groups which are useful for efficient speech retrieval task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets used in the current study are available at the IIT Kharagpur speech group repository, http://cse.iitkgp.ac.in/~ksrao/res.html

References

  1. S. Bhati, H. Kamper, K. Sri Rama Murty, Phoneme based embedded segmental K-means for unsupervised term discovery, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 5169–5173

  2. S. Bhati, S. Nayak, K. Sri Rama Murty, in Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications, ed. by INTERSPEECH (Stockholm, 2017), pp. 2133–2137

  3. S. Bhati, N. Shekhar, S. Bhati, N. Shekhar, K. Sri Rama Murty, Unsupervised speech signal-to-symbol transformation for language identification. Circuits Syst. Signal Process. 39, 5169–5197 (2020)

    Article  Google Scholar 

  4. A. Black, P. Taylor, R. Caley, The Festival Speech Sythesis System. http://www.festvox.orgl/festival/

  5. J. Chang, J.W. Fisher, in Parallel Sampling of DP Mixture Models Using Sub-Clusters Splits (Red Hook, 2013), pp. 620–628

  6. H. Chen, C.C. Leung, L. Xie, B. Ma, H. Li, in Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study, ed. by INTERSPEECH (Dresden, 2015), pp. 3189–3193

  7. C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, M. Bacchiani, state-of-the-art speech recognition with sequence-to-sequence models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 4774–4778

  8. Creating Label Files for Training Data. http://www.cs.columbia.edu/~ecooper/tts/labels.html

  9. S. Dusan, L. Rabiner, in On the Relation Between Maximum Spectral Transition Positions and Phone Boundaries (Pittsburgh, 2006), pp. 645–648

  10. J. Franke, M. Muller, S. Stuker, A. Waibel, in Phoneme Boundary Detection Using Deep Bidirectional LSTMs (Paderborn, 2016), pp. 1–5

  11. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Tech. Rep. N 93, 1–79 (1993)

    Google Scholar 

  12. R. Giuseppe, B. Paolo, R. Pierluigi, Spoken dialog system: from theory to technology, in Workshop Toni Mian, pp. 1–4 (2007)

  13. D. Gorur, C. Edward Rasmussen, Dirichlet process gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25(4), 616–626 (2010)

    Article  MathSciNet  Google Scholar 

  14. S. Igor, S. Petr, M. Pavel, B. Lukas, K. Martin, C. Jan, Phoneme based acoustics keyword spotting in informal continuous speech. Lecture Notes Comput. Sci. 3658, 302–309 (2005)

    Article  Google Scholar 

  15. A. Jansen, B.V. Durme, Efficient spoken term discovery using randomized algorithms, in IEEE Workshop on Automatic Speech Recognition Understanding (ASRU) (Hawaii, 2011), pp. 401–406

  16. A. Jansen, C. Kenneth, H. Hynek, in Towards Spoken Term Discovery at Scale with Zero Resources, ed. by INTERSPEECH (Chiba, 2010), pp. 1676–1679

  17. S. Jayaram, A constructive definition of the Dirichlet prior. Statistica Sinica 4, 639–650 (1994)

    MathSciNet  MATH  Google Scholar 

  18. H. Kamper, A. Jansen, S. Goldwater, Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(4), 669–679 (2016)

    Article  Google Scholar 

  19. H. Kamper, A. Jansen, S. King, S. Goldwater, Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings, in IEEE Spoken Language Technology Workshop (SLT), (Nevada, 2014), pp. 100–105

  20. V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digital Signal Process. 35, 86–94 (2014)

    Article  Google Scholar 

  21. R. Kishore Kumar, B. Lokendra, K. Sreenivasa Rao, A robust unsupervised pattern discovery and clustering of speech signals. Pattern Recognit. Lett. 116, 254–261 (2018)

    Article  Google Scholar 

  22. F. Kreuk, J. Keshet, Y. Adi, in Self-supervised Contrastive Learning for Unsupervised Phoneme Segmentation, ed. by INTERSPEECH (Shanghai, 2020), pp. 3700–3704

  23. F. Kreuk, Y. Sheena, J. Keshet, Y. Adi, in Phoneme Boundary Detection Using Learnable Segmental Features (Barcelona, 2020), pp. 8089–8093

  24. H. Lee, L. Lee, Enhanced spoken term detection using support vector machines and weighted pseudo examples. IEEE Trans. Audio Speech Lang. Process. 21(6), 1272–1284 (2013)

    Article  Google Scholar 

  25. L. Lee, J. Glass, H. Lee, C. Chan, Spoken content retrieval-beyond cascading speech recognition with text retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1389–1420 (2015)

    Article  Google Scholar 

  26. S.J. Leow, E.S. Chng, C. Lee, Language-resource independent speech segmentation using cues from a spectrogram image, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (South Brisbane, 2015), pp. 5813–5817

  27. K. Levin, K. Henry, A. Jansen, K. Livescu, Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings, in IEEE Workshop on Automatic Speech Recognition and Understanding (Olomouc, 2013), pp. 410–415

  28. A. Li-chun Wang, in An Industrial-Strength Audio Search Algorithm (Baltimore, 2003), pp. 1–7

  29. L. Martha, J. Gareth, Spoken content retrieval: a survey of techniques and technologies. Found. Trends Inf. Retr. 5(4–5), 235–422 (2012)

    Google Scholar 

  30. M.E. Newman, Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133-1-066133–5 (2004)

    Article  Google Scholar 

  31. R. Pappagari Raghavendra, R. Kallola, K. Sri Rama Murty, in Query Word Retrieval from Continuous Speech Using GMM Posteriorgrams (Bangalore, 2014), pp. 1–6

  32. A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech, in IEEE Workshop on Automatic Speech Recognition and Understanding (San Juan, 2005), pp. 53–58

  33. A.S. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)

    Article  Google Scholar 

  34. D. Ram, L. Miculicich, H. Bourlard, in CNN Based Query by Example Spoken Term Detection (Hyderabad, 2018), pp. 92–96

  35. P.B. Ramteke, S.G. Koolagudi, Phoneme boundary detection from speech: a rule based approach. Speech Commun. 107, 1–17 (2019)

    Article  Google Scholar 

  36. O. Rasanen, M. Andrea, C. Blandon, Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics, in Interspeech (Shanghai, 2020), pp. 4871–4875

  37. O. Räsänen, D. Gabriel, C. Michael, F., in Unsupervised Word Discovery from Speech Using Automatic Segmentation into Syllable-like Units, ed. by INTERSPEECH (Dresden, 2015), pp. 3204–3208

  38. O. Räsänen, U.K. Laine, T. Altosaar, in An Improved Speech Segmentation Quality Measure: The r-Value. ed. by INTERSPEECH (Brighton, 2009), pp. 1851–1854

  39. O. Räsänen, U.K. Laine, T. Altosaar, Blind Segmentation of Speech Using Non-linear Filtering Methods (Speech Technologies, 2011), pp. 106–124

  40. G. Simon, H. Tobias, B. Markus, S. Gerhard, Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 91, 1–15 (2015)

    Google Scholar 

  41. A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients, in IEEE Spoken Language Technology Workshop (SLT) (San Juan, Puerto Rico, 2016), pp. 597–602

  42. Y. Zhang, J.R. Glass, in Towards Multi-speaker Unsupervised Speech Pattern Discovery (Texas, 2010), pp. 4366–4369

Download references

Acknowledgements

The authors would like to thank Annu Debnath and Sutapa Bhattacharya (speakers) for their support in the creation of Hindi and Bengali speech corpora.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kishore Kumar Ravi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ravi, K.K., Krothapalli, S.R. Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals. Circuits Syst Signal Process 41, 2088–2117 (2022). https://doi.org/10.1007/s00034-021-01876-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01876-6

Keywords

Navigation