Abstract
Feature extraction in speech signals under the influence of background excitation is a challenging task. In this research, we propose phoneme subspace integrated with the linear visual assessment tendency (LVAT) algorithm to retrieve the audio feature based on spectral depth analysis. LVAT algorithm performs a clustering of different spectral features to define the intensity of signal weight. The Fast Fourier transform (FFT) projects selection of weight estimated samples from the signal for phoneme subspace. The FFT-phoneme subspace combination enhances the feature by analyzing the low, middle and high-frequency signals based on phone subspace weight update. Traditional feature extraction techniques like mel frequency cepstral coefficients, linear predictor cepstral coefficients and power normalized cepstral coefficients are analyzed under different noise conditions and compared with the results of clustering with power normalized cepstral coefficients. The experimental results demonstrate improvement in the performance by comparing the objective measures such as sensitivity, specificity, accuracy and recognition rate.
Similar content being viewed by others
References
Abdelaziz AH et al (2015) Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:863–876
Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42:12–22
Fartash M et al (2013) A scale-rate filter selection method in the spectro-temporal domain for phoneme classification. Comput Electr Eng 39:1537–1548
Ferdinand fuhrmann (2015). http://www.dtic.upf.edu/~ffuhrmann/PhD/data/. Accessed 10 Sept 2015
Galluccia L et al (2013) Clustering with a new distance measure based on a dual rooted tree. Inf Sci 251:96–113
Ganapathy S et al (2014) Robust feature extraction using modulation filtering of autoregressive models. IEEE/ACM Trans Audio Speech Lang Process 22:1285–1295
Gao B, Woo WL (2014) Wearable audio monitoring: content-based processing methodology and implementation. IEEE Trans Hum Mach Syst 44:222–233
Gerazov B, Ivanovski Z (2015) Kernel power flow orientation coefficients for noise-robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:407–419
Govindan SM et al (2014) Adaptive wavelet shrinkage for noise robust speaker recognition. Digit Signal Proc 33:180–190
Havens TC, Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans Knowl Data Eng 24:813–822
Hermansky H, Hanson BA, Wakita H (1985) Perceptually based linear predictive analysis of speech. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 509–512
Hu Y, Loizou P (2007) Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun 49:588–601
Jalalvand A et al (2015)) Robust continuous digit recognition using reservoir computing. Comput Speech Lang 30:135–158
Jensen J, Tan Z-H (2015) Minimum mean-square error estimation of mel-frequency cepstral features—a theoretically consistent approach. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 23:186–197
Joshi V et al (2015) Sub-band based histogram equalization in cepstral domain for speech recognition. Speech Commun 69:46–65
Kallasjoki H et al (2014) Estimating uncertainty to improve exemplar-based feature enhancement for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:368–380
Kim C, Stern RM (2012) Power-normalized cepstral coefficients (PNCC) for robust speech recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4101–4104
Kim C, Stern RM (2016) Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE Trans Audio Speech Lang Process 24:1315–1329
Kopelman NM et al (2015) CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour 15(5):1179–1191. https://doi.org/10.1111/1755-0998.12387
Li Y et al (2013) Feature space generalized variable parameter HMMs for noise robust recognition. In Interspeech, pp 2968–2972
Li J et al (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:pp 745–777
Loizou P (2017) NOIZEUS: a noisy speech corpus for evaluation of speech enhancement algorithm. Speech Commun 49:588–601
Moritz N et al (2015) An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:1926–1937
Noll AM (1969) Bell Telephone Laboratories, Inc, Pitch determination of human speech by the harmonic product spectrum. The harmonic spectrum, and a maximum likelihood estimate, symposium on computer processing in communications
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Am Assoc Adv Sci 27(6191):1491–1496
Sainath TN et al (2017) Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 25(5):965–979
Saxena R, Singh K (2013) Fractional Fourier transform: a novel tool for signal processing. J Indian Inst Sci 85(1):11–26
Seltzer ML et al (2013) An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7398–7402
Shrawankar U, Thakare VM (2013) Techniques for feature extraction in speech recognition system: a comparative study. arXiv preprint arXiv:1305.1145
Su R et al (2015) Automatic complexity control of generalized variable parameter HMMs for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:102–114
Sun Y et al (2015) Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed Signal Process Control 18:80–90
Tzanetakis G (2015) Music analysis, retrieval and synthesis for audio signal (Marsyas). http://marsyasweb.appspot.com/download/data_sets/. Accessed 11 Sept 2015
Wang L et al (2010) Enhanced visual analysis for cluster tendency assessment and data partitioning. IEEE Trans Knowl Data Eng 22:1401–1414
Wang H et al (2014) An effective image representation method using kernel classification. In: IEEE 26th international conference on tools with artificial intelligence
Yan Y et al (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995
Zheng F, Zhang G, Song Z (2001) Comparison of different implementations of MFCC. J Comput Sci Technol 16:582–589
Zhou J et al (2014) Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization. Inf Sci 257:115–126
Acknowledgements
The authors sincerely thank Mr. Bin Gao for giving access details to use the Standard English Language Speech Database for Speaker Recognition (ELSDS). This data set is used in experimental analysis of this research.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Therese, S.S., Lingam, C. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system. J Ambient Intell Human Comput (2017). https://doi.org/10.1007/s12652-017-0653-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12652-017-0653-7