Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals

Ravi, Kishore Kumar; Krothapalli, Sreenivasa Rao

doi:10.1007/s00034-021-01876-6

Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals

Published: 15 November 2021

Volume 41, pages 2088–2117, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Kishore Kumar Ravi¹ &
Sreenivasa Rao Krothapalli¹

381 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

This paper proposes a new method that detects the repeated keyword/phrase patterns from speech utterances by performing pattern discovery at the phoneme level. Prior to this method, we have developed a pattern discovery method using frame-level features. Even though the method’s performance is decent, it allows a considerable number of false positives. This paper aims to extract the desired keywords by examining the match between the speech utterances at the phoneme level instead of frame level to reduce the false positives and improve the accuracy. In this work, initially, we segment the speech utterances into phoneme-like regions using an affinity matrix. Then, the matched phoneme regions present in a pair of speech utterances are identified. A new 3-neighbor depth-first search traversal technique is proposed to discover the sequence of phoneme matches. Finally, the distance scores in the sequence of phoneme matches are validated to identify the desired keyword patterns. The performance of the proposed method is evaluated on the Hindi and Bengali news databases and compared with the state-of-the-art techniques. Based on the detected keyword patterns, the speech utterances are divided into groups using a standard clustering algorithm. The derived clusters represent the broader domain-specific groups which are useful for efficient speech retrieval task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Natural Language Processing: History, Evolution, Application, and Future Work

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Article Open access 11 May 2024

Data Availability

The datasets used in the current study are available at the IIT Kharagpur speech group repository, http://cse.iitkgp.ac.in/~ksrao/res.html

References

S. Bhati, H. Kamper, K. Sri Rama Murty, Phoneme based embedded segmental K-means for unsupervised term discovery, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 5169–5173
S. Bhati, S. Nayak, K. Sri Rama Murty, in Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications, ed. by INTERSPEECH (Stockholm, 2017), pp. 2133–2137
S. Bhati, N. Shekhar, S. Bhati, N. Shekhar, K. Sri Rama Murty, Unsupervised speech signal-to-symbol transformation for language identification. Circuits Syst. Signal Process. 39, 5169–5197 (2020)
Article Google Scholar
A. Black, P. Taylor, R. Caley, The Festival Speech Sythesis System. http://www.festvox.orgl/festival/
J. Chang, J.W. Fisher, in Parallel Sampling of DP Mixture Models Using Sub-Clusters Splits (Red Hook, 2013), pp. 620–628
H. Chen, C.C. Leung, L. Xie, B. Ma, H. Li, in Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study, ed. by INTERSPEECH (Dresden, 2015), pp. 3189–3193
C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, M. Bacchiani, state-of-the-art speech recognition with sequence-to-sequence models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 4774–4778
Creating Label Files for Training Data. http://www.cs.columbia.edu/~ecooper/tts/labels.html
S. Dusan, L. Rabiner, in On the Relation Between Maximum Spectral Transition Positions and Phone Boundaries (Pittsburgh, 2006), pp. 645–648
J. Franke, M. Muller, S. Stuker, A. Waibel, in Phoneme Boundary Detection Using Deep Bidirectional LSTMs (Paderborn, 2016), pp. 1–5
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Tech. Rep. N 93, 1–79 (1993)
Google Scholar
R. Giuseppe, B. Paolo, R. Pierluigi, Spoken dialog system: from theory to technology, in Workshop Toni Mian, pp. 1–4 (2007)
D. Gorur, C. Edward Rasmussen, Dirichlet process gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25(4), 616–626 (2010)
Article MathSciNet Google Scholar
S. Igor, S. Petr, M. Pavel, B. Lukas, K. Martin, C. Jan, Phoneme based acoustics keyword spotting in informal continuous speech. Lecture Notes Comput. Sci. 3658, 302–309 (2005)
Article Google Scholar
A. Jansen, B.V. Durme, Efficient spoken term discovery using randomized algorithms, in IEEE Workshop on Automatic Speech Recognition Understanding (ASRU) (Hawaii, 2011), pp. 401–406
A. Jansen, C. Kenneth, H. Hynek, in Towards Spoken Term Discovery at Scale with Zero Resources, ed. by INTERSPEECH (Chiba, 2010), pp. 1676–1679
S. Jayaram, A constructive definition of the Dirichlet prior. Statistica Sinica 4, 639–650 (1994)
MathSciNet MATH Google Scholar
H. Kamper, A. Jansen, S. Goldwater, Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(4), 669–679 (2016)
Article Google Scholar
H. Kamper, A. Jansen, S. King, S. Goldwater, Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings, in IEEE Spoken Language Technology Workshop (SLT), (Nevada, 2014), pp. 100–105
V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digital Signal Process. 35, 86–94 (2014)
Article Google Scholar
R. Kishore Kumar, B. Lokendra, K. Sreenivasa Rao, A robust unsupervised pattern discovery and clustering of speech signals. Pattern Recognit. Lett. 116, 254–261 (2018)
Article Google Scholar
F. Kreuk, J. Keshet, Y. Adi, in Self-supervised Contrastive Learning for Unsupervised Phoneme Segmentation, ed. by INTERSPEECH (Shanghai, 2020), pp. 3700–3704
F. Kreuk, Y. Sheena, J. Keshet, Y. Adi, in Phoneme Boundary Detection Using Learnable Segmental Features (Barcelona, 2020), pp. 8089–8093
H. Lee, L. Lee, Enhanced spoken term detection using support vector machines and weighted pseudo examples. IEEE Trans. Audio Speech Lang. Process. 21(6), 1272–1284 (2013)
Article Google Scholar
L. Lee, J. Glass, H. Lee, C. Chan, Spoken content retrieval-beyond cascading speech recognition with text retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1389–1420 (2015)
Article Google Scholar
S.J. Leow, E.S. Chng, C. Lee, Language-resource independent speech segmentation using cues from a spectrogram image, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (South Brisbane, 2015), pp. 5813–5817
K. Levin, K. Henry, A. Jansen, K. Livescu, Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings, in IEEE Workshop on Automatic Speech Recognition and Understanding (Olomouc, 2013), pp. 410–415
A. Li-chun Wang, in An Industrial-Strength Audio Search Algorithm (Baltimore, 2003), pp. 1–7
L. Martha, J. Gareth, Spoken content retrieval: a survey of techniques and technologies. Found. Trends Inf. Retr. 5(4–5), 235–422 (2012)
Google Scholar
M.E. Newman, Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133-1-066133–5 (2004)
Article Google Scholar
R. Pappagari Raghavendra, R. Kallola, K. Sri Rama Murty, in Query Word Retrieval from Continuous Speech Using GMM Posteriorgrams (Bangalore, 2014), pp. 1–6
A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech, in IEEE Workshop on Automatic Speech Recognition and Understanding (San Juan, 2005), pp. 53–58
A.S. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)
Article Google Scholar
D. Ram, L. Miculicich, H. Bourlard, in CNN Based Query by Example Spoken Term Detection (Hyderabad, 2018), pp. 92–96
P.B. Ramteke, S.G. Koolagudi, Phoneme boundary detection from speech: a rule based approach. Speech Commun. 107, 1–17 (2019)
Article Google Scholar
O. Rasanen, M. Andrea, C. Blandon, Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics, in Interspeech (Shanghai, 2020), pp. 4871–4875
O. Räsänen, D. Gabriel, C. Michael, F., in Unsupervised Word Discovery from Speech Using Automatic Segmentation into Syllable-like Units, ed. by INTERSPEECH (Dresden, 2015), pp. 3204–3208
O. Räsänen, U.K. Laine, T. Altosaar, in An Improved Speech Segmentation Quality Measure: The r-Value. ed. by INTERSPEECH (Brighton, 2009), pp. 1851–1854
O. Räsänen, U.K. Laine, T. Altosaar, Blind Segmentation of Speech Using Non-linear Filtering Methods (Speech Technologies, 2011), pp. 106–124
G. Simon, H. Tobias, B. Markus, S. Gerhard, Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 91, 1–15 (2015)
Google Scholar
A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients, in IEEE Spoken Language Technology Workshop (SLT) (San Juan, Puerto Rico, 2016), pp. 597–602
Y. Zhang, J.R. Glass, in Towards Multi-speaker Unsupervised Speech Pattern Discovery (Texas, 2010), pp. 4366–4369

Download references

Acknowledgements

The authors would like to thank Annu Debnath and Sutapa Bhattacharya (speakers) for their support in the creation of Hindi and Bengali speech corpora.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Kishore Kumar Ravi & Sreenivasa Rao Krothapalli

Authors

Kishore Kumar Ravi
View author publications
You can also search for this author in PubMed Google Scholar
Sreenivasa Rao Krothapalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kishore Kumar Ravi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ravi, K.K., Krothapalli, S.R. Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals. Circuits Syst Signal Process 41, 2088–2117 (2022). https://doi.org/10.1007/s00034-021-01876-6

Download citation

Received: 22 December 2020
Revised: 03 October 2021
Accepted: 06 October 2021
Published: 15 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s00034-021-01876-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Natural Language Processing: History, Evolution, Application, and Future Work

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Natural Language Processing: History, Evolution, Application, and Future Work

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation