Skip to main content
Log in

Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

New auditory-inspired speech processing methods are presented in this paper, combining spectral subtraction and two-dimensional non-linear filtering techniques originally conceived for image processing purposes. In particular, mathematical morphology operations, like erosion and dilation, are applied to noisy speech spectrograms using specifically designed structuring elements inspired in the masking properties of the human auditory system. This is effectively complemented with a pre-processing stage including the conventional spectral subtraction procedure and auditory filterbanks. These methods were tested in both speech enhancement and automatic speech recognition tasks. For the first, time-frequency anisotropic structuring elements over grey-scale spectrograms were found to provide a better perceptual quality than isotropic ones, revealing themselves as more appropriate—under a number of perceptual quality estimation measures and several signal-to-noise ratios on the Aurora database—for retaining the structure of speech while removing background noise. For the second, the combination of Spectral Subtraction and auditory-inspired Morphological Filtering was found to improve recognition rates in a noise-contaminated version of the Isolet database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. See [47] for a discussion on the acoustic cues employed by humans in each of the levels.

  2. To pave the way for later notation we are going to introduce our own names for the logarithmic scales mz and ERB rate, respectively.

  3. The basic quantity over which perception of sound is measured is sound pressure level which is a normalised, logarithmic sound pressure—unless otherwise noted log refers in this paper to base-10 logarithms— \(L_p = 20\log\frac{p}{p_0} (\hbox{dB SPL})\) where p is sound pressure and \(p_0 = 20 \; {\mu}Pa\) is the reference sound pressure, the lowest audible pressure for human ears at mid-frequencies. A related quantity is sound (intensity) level, a normalised, logarithmic intensity level \(L_I = 10\log\frac{I}{I_0} (\hbox{dB SL})\) where Ip 2 is the acoustic intensity, an energy related quantity. When using I 0 = 10−12 N/m 2 for reference, both levels can be equated and we drop the subindex:

    $$ L = 20\log\frac{p}{p_0} (\hbox{dB SPL}) = 10\log\frac{I}{I_0} (\hbox{dB SL}) $$

    Both dB SPL and dB SL will be simply notated as dB.

  4. At least not with the reduced number of filterbanks suggested by psychoacoustical experiments.

References

  1. Baker J. The Dragon system—an overview. IEEE Trans Acoust Speech Signal Process. 1975;23(1):24–29

    Article  Google Scholar 

  2. Beerends J, Hekstra A, Rix A, Hollier M. Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment. Part II: psychoacoustic model. J Audio Eng Soc. 2002;50(10):765–78

    Google Scholar 

  3. Berouti M, Schwartz R, Makhoul J Enhancement of speech corrupted by acoustic noise. IEEE Int Conf Acoust Speech Signal Process. 1979;4:208–211. IEEE.

    Google Scholar 

  4. Bourlard H, Morgan N. Hybrid HMM/ANN systems for speech recognition: overview and new research directions. Adapt Process Seq Data Struct. 1998;389–417.

  5. Cole R, Muthusamy Y, Fanty M. The isolet spoken letter database. 2011. http://www.cslu.ogi.edu/corpora/isolet.

  6. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66

    Article  Google Scholar 

  7. Dougherty ER, Lotufo RA. Hands-on morphological image processing. Tutorial texts in optical engineering, vol. TT59. SPIE press 2003.

  8. Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21

    Article  Google Scholar 

  9. Evans N, Mason J, Roach M, et al. Noise compensation using spectrogram morphological filtering. In: Proceedings of the 4th IASTED International Conference on Signal and Image Processing. 2002. pp. 157–61.

  10. Ezeiza A, López de Ipiña K, Hernández C, Barroso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput. 2012. pp. 1–6.

  11. Fastl H, Zwicker E. Psycho-acoustics: facts and models, 3rd edn. New York: Springer; 2007.

    Google Scholar 

  12. Faundez-Zanuy M, Hussain A, Mekyska J, Sesa-Nogueras E, Monte-Moreno E, Esposito A, Chetouani M, Garre-Olmo J, Abel A, Smekal Z, López de Ipiña K. Biometric applications related to human beings: there is life beyond security. Cogn Comput. 2012; 1–16.

  13. Florentine M, Fastl H, Buus S. Temporal integration in normal hearing, cochlear impairment, and impairment simulated by masking. J Acoust Soc Am. 1998; 84(1):195–203.

    Article  PubMed  CAS  Google Scholar 

  14. Gelbart D, Hemmert W, Holmberg M, Morgan N. Noisy ISOLET and ISOLET testbeds. database. 2011. http://www.icsi.berkeley.edu/Speech/papers/eurospeech05-onset/isolet/.

  15. Glasberg B, Moore B. Derivation of auditory filter shapes from notched-noise data. Hear Res. 1990;47(1–2):103–38

    Article  PubMed  CAS  Google Scholar 

  16. Gonzalez R, Woods R Digital image processing. Boston: Addison-Wesley; 1993.

    Google Scholar 

  17. Greenberg S. The integration of phonetic knowledge in speech technology, Text, Speech and Language Technology vol. 25, chap. From here to utility. New York: Springer; 2005. pp. 107–132.

  18. Gunawan TS, Ambikairajah E, Epps J. Perceptual speech enhancement exploiting temporal masking properties of human auditory system. Speech Commun. 2010;52:381–93

    Article  Google Scholar 

  19. Hansen J, Pellom B. An effective quality evaluation protocol for speech enhancement algorithms. In: International Conference on Spoken Language Processing. Sydney, Australia; 1998. pp. 2819–22.

  20. Heckmann M, Domont X, Joublin F, Goerick C A hierarchical framework for spectro-temporal feature extraction. Speech Commun. 2010; (53):736–52.

  21. Hirsch H, Pearce D. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) 2000.

  22. Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.

    Article  Google Scholar 

  23. Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. In: Proceedings of the Interspeech. 2006; pp. 1447–50 .

  24. Hurmalainen A, Virtanen T Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2012; pp. 4113–16.

  25. Irino T, Patterson R A time-domain, level-dependent auditory filter: The gammachirp. J Acoust Soc Am 1997;101(1):412–19.

    Article  Google Scholar 

  26. Irino T, Patterson R. A dynamic compressive gammachirp auditory filterbank. IEEE Trans Audio Speech Lang Process. 2006;14(6):2222–32

    Article  Google Scholar 

  27. Jelinek F, Bahl L, Mercer R. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans Inf Theory. 1975;21(3):250–56

    Article  Google Scholar 

  28. Jesteadt W, Bacon SP, Lehman JR. Forward masking as a function of frequency, masker level, and signal delay. J Acoust Soc Am. 1982;71(4):950–62

    Article  PubMed  CAS  Google Scholar 

  29. Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7, 1982. pp. 1278–81.

  30. Loizou P. Matlab software. 2011. http://www.utdallas.edu/loizou/speech/software.htm.

  31. Martínez C, Goddard J, Milone D, Rufiner H. Bioinspired sparse spectro-temporal representation of speech for robust classification. Comput Speech Lang. 2012;26:336–48.

    Article  Google Scholar 

  32. Matheron G, Serra J. The birth of mathematical morphology. In: Proceedings of the 6th International Symposium on Mathematical Morphology. Sydney, Australia; 2002. pp. 1–16.

  33. Meddis R. Simulation of mechanical to neural transduction in the auditory receptor. J Acoust Soc Am. 1986;79(3):702–11

    Article  PubMed  CAS  Google Scholar 

  34. Meddis R. Simulation of auditory-neural transduction: further studies. J Acoust Soc Am. 1988;83(3):1056–63

    Article  PubMed  CAS  Google Scholar 

  35. Meyer B, Kollmeier B. Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 2010;53:753–67

    Google Scholar 

  36. Moore B, Glasberg B. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J Acoust Soc Am. 1983;74:750.

    Article  PubMed  CAS  Google Scholar 

  37. Moore B, Glasberg B. A revised model of loudness perception applied to cochlear hearing loss. Hear Res. 2004;188(1–2):70–88

    Article  PubMed  Google Scholar 

  38. Patterson R, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M. Complex sounds and auditory images. Aud Physiol Percept 1992;83:429–46

    Article  Google Scholar 

  39. Peláez-Moreno C, García-Moral A, Valverde-Albacete F. Analyzing phonetic confusions using formal concept analysis. J Acoust Soc Am. 2010;128(3):1377–90

    Article  PubMed  Google Scholar 

  40. Quackenbush S, Barnwell T, Clements M. Objective measures of speech quality. Upper Saddle River: Prentice Hall Englewood Cliffs; 1988.

    Google Scholar 

  41. Quatieri TF (2002) Discrete-time speech signal processing. Principles and Practice. Signal Processing. Upper Saddle River: Prentice Hall; 2002.

    Google Scholar 

  42. Rabiner L, Juang BH. Fundamentals of speech recognition. Signal Processing. Upper Saddle River: Prentice Hall; 1993.

    Google Scholar 

  43. Rix A, Hollier M, Hekstra A, Beerends J. Perceptual evaluation of speech quality (PESQ), the new ITU standard for end-to-end speech quality assessment. Part I: Time-delay compensation. J Acoust Soc Am. 2002;50(10):755–64

    Google Scholar 

  44. Scalart P, Filho J. Speech enhancement based on a priori signal to noise estimation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE 1986. pp. 629–32.

  45. Serra J, Soille P (eds). Mathematical morphology and its application to image processing. Computational imaging and vision. Kluwer Academic 1994.

  46. Stevens SS, Volkmann J, Newman EB. A scale for the measurement of the psychological magnitude of pitch. J Acoust Soc Am. 1937;8:185–90.

    Article  Google Scholar 

  47. Summerfield Q, Culling J. Auditory segregation of competing voices: absence of effectes of FM or AM coherence. Philos Trans R Soc Lond. 1992;336:357–66

    Article  CAS  Google Scholar 

  48. ten Bosch L, Kirchhoff K. Editorial note: Bridging the gap between human and automatic speech recognition. Speech Commun. 2007;49(5):331–5

    Article  Google Scholar 

  49. Weiss NA, Hasset MJ. Introductory statistics. Addison- Wesley, Reading; 1993. pp. 407–08.

  50. Yeh J, Chen C. Auditory front-ends for noise-robust automatic speech recognition. In: 7th International Symposium on Chinese Spoken Language Process (ISCSLP), IEEE 2010. pp. 205–08.

  51. Yin H, Hohmann V, Nadeu C. Acoustic features for speech recognition based on gammatone filterbank and instantaneous frequency. Speech Commun. 2010;53:707–15.

    Google Scholar 

  52. Zwicker E, Feldtkeller R. The ear as a communication receiver. Woodbury: Acoustical Society of America; 1999.

    Google Scholar 

  53. Zwicker E, Jaroszewski A. Inverse frequency dependence of simultaneous tone-on-tone masking patterns at low levels. J Acoust Soc Am. 1982;71(6):1508–12.

    Article  PubMed  CAS  Google Scholar 

  54. Zwicker E, Terhardt E. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J Acoust Soc Am. 1980;68:1523

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the Spanish Ministry of Science and Innovation CICYT Project No. TEC2008-06382/TEC and TEC2011-26807.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joyner Cadore.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cadore, J., Valverde-Albacete, F.J., Gallardo-Antolín, A. et al. Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement. Cogn Comput 5, 426–441 (2013). https://doi.org/10.1007/s12559-012-9196-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-012-9196-6

keywords

Navigation