Skip to main content
Log in

A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Feature extraction in speech signals under the influence of background excitation is a challenging task. In this research, we propose phoneme subspace integrated with the linear visual assessment tendency (LVAT) algorithm to retrieve the audio feature based on spectral depth analysis. LVAT algorithm performs a clustering of different spectral features to define the intensity of signal weight. The Fast Fourier transform (FFT) projects selection of weight estimated samples from the signal for phoneme subspace. The FFT-phoneme subspace combination enhances the feature by analyzing the low, middle and high-frequency signals based on phone subspace weight update. Traditional feature extraction techniques like mel frequency cepstral coefficients, linear predictor cepstral coefficients and power normalized cepstral coefficients are analyzed under different noise conditions and compared with the results of clustering with power normalized cepstral coefficients. The experimental results demonstrate improvement in the performance by comparing the objective measures such as sensitivity, specificity, accuracy and recognition rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  • Abdelaziz AH et al (2015) Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:863–876

    Google Scholar 

  • Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42:12–22

    Article  Google Scholar 

  • Fartash M et al (2013) A scale-rate filter selection method in the spectro-temporal domain for phoneme classification. Comput Electr Eng 39:1537–1548

    Article  Google Scholar 

  • Ferdinand fuhrmann (2015). http://www.dtic.upf.edu/~ffuhrmann/PhD/data/. Accessed 10 Sept 2015

  • Galluccia L et al (2013) Clustering with a new distance measure based on a dual rooted tree. Inf Sci 251:96–113

  • Ganapathy S et al (2014) Robust feature extraction using modulation filtering of autoregressive models. IEEE/ACM Trans Audio Speech Lang Process 22:1285–1295

    Article  Google Scholar 

  • Gao B, Woo WL (2014) Wearable audio monitoring: content-based processing methodology and implementation. IEEE Trans Hum Mach Syst 44:222–233

    Article  Google Scholar 

  • Gerazov B, Ivanovski Z (2015) Kernel power flow orientation coefficients for noise-robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:407–419

    Article  Google Scholar 

  • Govindan SM et al (2014) Adaptive wavelet shrinkage for noise robust speaker recognition. Digit Signal Proc 33:180–190

    Article  Google Scholar 

  • Havens TC, Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans Knowl Data Eng 24:813–822

    Article  Google Scholar 

  • Hermansky H, Hanson BA, Wakita H (1985) Perceptually based linear predictive analysis of speech. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 509–512

  • Hu Y, Loizou P (2007) Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun 49:588–601

    Article  Google Scholar 

  • Jalalvand A et al (2015)) Robust continuous digit recognition using reservoir computing. Comput Speech Lang 30:135–158

    Article  Google Scholar 

  • Jensen J, Tan Z-H (2015) Minimum mean-square error estimation of mel-frequency cepstral features—a theoretically consistent approach. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 23:186–197

    Article  Google Scholar 

  • Joshi V et al (2015) Sub-band based histogram equalization in cepstral domain for speech recognition. Speech Commun 69:46–65

    Article  Google Scholar 

  • Kallasjoki H et al (2014) Estimating uncertainty to improve exemplar-based feature enhancement for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:368–380

    Article  Google Scholar 

  • Kim C, Stern RM (2012) Power-normalized cepstral coefficients (PNCC) for robust speech recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4101–4104

  • Kim C, Stern RM (2016) Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE Trans Audio Speech Lang Process 24:1315–1329

  • Kopelman NM et al (2015) CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour 15(5):1179–1191. https://doi.org/10.1111/1755-0998.12387

  • Li Y et al (2013) Feature space generalized variable parameter HMMs for noise robust recognition. In Interspeech, pp 2968–2972

  • Li J et al (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22:pp 745–777

    Article  Google Scholar 

  • Loizou P (2017) NOIZEUS: a noisy speech corpus for evaluation of speech enhancement algorithm. Speech Commun 49:588–601

  • Moritz N et al (2015) An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:1926–1937

    Google Scholar 

  • Noll AM (1969) Bell Telephone Laboratories, Inc, Pitch determination of human speech by the harmonic product spectrum. The harmonic spectrum, and a maximum likelihood estimate, symposium on computer processing in communications

  • Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Am Assoc Adv Sci 27(6191):1491–1496

    Google Scholar 

  • Sainath TN et al (2017) Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 25(5):965–979

  • Saxena R, Singh K (2013) Fractional Fourier transform: a novel tool for signal processing. J Indian Inst Sci 85(1):11–26

  • Seltzer ML et al (2013) An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7398–7402

  • Shrawankar U, Thakare VM (2013) Techniques for feature extraction in speech recognition system: a comparative study. arXiv preprint arXiv:1305.1145

  • Su R et al (2015) Automatic complexity control of generalized variable parameter HMMs for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 23:102–114

    Article  Google Scholar 

  • Sun Y et al (2015) Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed Signal Process Control 18:80–90

    Article  Google Scholar 

  • Tzanetakis G (2015) Music analysis, retrieval and synthesis for audio signal (Marsyas). http://marsyasweb.appspot.com/download/data_sets/. Accessed 11 Sept 2015

  • Wang L et al (2010) Enhanced visual analysis for cluster tendency assessment and data partitioning. IEEE Trans Knowl Data Eng 22:1401–1414

    Article  Google Scholar 

  • Wang H et al (2014) An effective image representation method using kernel classification. In: IEEE 26th international conference on tools with artificial intelligence

  • Yan Y et al (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995

  • Zheng F, Zhang G, Song Z (2001) Comparison of different implementations of MFCC. J Comput Sci Technol 16:582–589

    Article  MATH  Google Scholar 

  • Zhou J et al (2014) Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization. Inf Sci 257:115–126

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors sincerely thank Mr. Bin Gao for giving access details to use the Standard English Language Speech Database for Speaker Recognition (ELSDS). This data set is used in experimental analysis of this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Shanthi Therese.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Therese, S.S., Lingam, C. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system. J Ambient Intell Human Comput (2017). https://doi.org/10.1007/s12652-017-0653-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-017-0653-7

Keywords

Navigation