CSI Transactions on ICT

, Volume 5, Issue 2, pp 167–178 | Cite as

Processing of speech signals for robust recognition in practical environments

  • Vishala Pannala
Special Issue Visvesvaraya 2016 of CSIT


In automatic speech recognition systems, the information in the speech signal is traditionally retrieved in the form of feature vectors representing sub-word units and thereby converting the features into human readable text form. However, these systems perform poorly due to degradations of speech under varying environmental conditions. To improve the performance, the main issues to be considered are: (a) Determination of speech regions in the speech data collected in degraded environments, and (b) Recognition of speech sounds from the degraded speech in the detected speech regions. Although there exist wide variety of techniques which address these issues, most of them are applicable for clean speech synthetically degraded by stationary noise conditions, due to the need for large amount of training data for statistical modeling. The present work focuses on methods of processing the signals so as to determine the desired speech regions in degraded conditions. For this, signal processing methods are being explored to extract speech-specific characteristics independent of the characteristics of degradations.


Speech region detection Varying degradations Signal processing methods Speech-specific characteristics 


  1. 1.
    Digital Cellular Telecommunications System (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi Rate (AMR) Speech Traffic Channel; General Description. 1999Google Scholar
  2. 2.
    de Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930CrossRefGoogle Scholar
  3. 3.
    Aneeja G, Yegnanarayana B (2015) Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Trans Audio Speech Lang Process 23(4):705–717CrossRefGoogle Scholar
  4. 4.
    Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int 5(9):341–345Google Scholar
  5. 5.
    Camacho A, Harris J (2008) A sawtooth waveform inspired pitch estimator for speech and music. J Acoust Soc Am 124:1638–1652CrossRefGoogle Scholar
  6. 6.
    Chen SH, Wang JF (2002) A wavelet-based voice activity detection algorithm in noisy environments. In 9th International Conference on Electronics, Circuits and Systems, 3:995–998Google Scholar
  7. 7.
    Cho YD, Kondoz A (2001) Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process Lett 8(10):276–278CrossRefGoogle Scholar
  8. 8.
    Chu W, Alwan A (2012) SAFE: a statistical approach to F0 estimation under clean and noisy conditions. IEEE Trans Audio Speech Lang Process 20(3):933–944CrossRefGoogle Scholar
  9. 9.
    Craciun A, Gabrea M (2004) Correlation coefficient-based voice activity detector algorithm. Can Conf Electr Comput Eng 3:1789–1792Google Scholar
  10. 10.
    de Cheveigne A (1991) Speech F0 extraction based on Lickliders pitch perception model. ICPhS, pp. 218–221Google Scholar
  11. 11.
    Dhananjaya N, Yegnanarayana B (2010) Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process Lett 17(3):273–276CrossRefGoogle Scholar
  12. 12.
    Drugman T, Alwan A (2011) Joint robust voicing detection and pitch estimation based on residual harmonics. In: Proceedings of the Interspeech, pp 1973–1976Google Scholar
  13. 13.
    Evangelopoulos G, Maragos P (2005) Speech event detection using multi band modulation energy. In INTERSPEECH, pp. 685–688Google Scholar
  14. 14.
    Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) DARPA TIMIT acoustic phonetic continuous speech corpus CD-ROM. NIST, GaithersburgGoogle Scholar
  15. 15.
    Mantena GV, Rajendran S, Gangashetty SV, Yegnanarayana B, Prahallad KS (2011) Development of a spoken dialogue system for accessing agricultural information in Telugu. In: Proceedings of the 9th international conference on natural language processingGoogle Scholar
  16. 16.
    Ghosh PK, Tsiartas A, Narayanan SS (2011) Robust voice activity detection using long-term signal variability. IEEE Trans Audio Speech Lang Process 19(3):600–613CrossRefGoogle Scholar
  17. 17.
    Gonzalez S, Brookes M (2014) PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans Audio Speech Lang Process 22(2):518–530CrossRefGoogle Scholar
  18. 18.
    Gorriz JM, Ramirez J, Lang EW, Puntonet CG, Turias I (2010) Improved likelihood ratio test based voice activity detector applied to speech recognition. Speech Commun 52(78):664–677CrossRefGoogle Scholar
  19. 19.
    Haigh JA, Mason JS (1993) A voice activity detector based on cepstral analysis. In EUROSPEECH, pp. 1103–1106Google Scholar
  20. 20.
    Hughes T, Mierle K (2013) Recurrent neural networks for voice activity detection. In ICASSP, pp. 7378–7382Google Scholar
  21. 21.
    Kasi K, Zahorian S (2002) Yet another algorithm for pitch tracking. ICASSP 1:361–364Google Scholar
  22. 22.
    Kotnik B, Kacic Z, Horvat B (2001) A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm. In INTERSPEECH, pp. 197–200Google Scholar
  23. 23.
    Lee Y-C, Ahn S-S (2006) Statistical model-based VAD algorithm with wavelet transform. IEICE Trans Fundam Electron Commun Comput Sci E89–A(6):1594–1600CrossRefGoogle Scholar
  24. 24.
    Ma Y, Nishihara A (2013) Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J Audio Speech Music Process 1–18:2013Google Scholar
  25. 25.
    Markel JD (1972) The SIFT algorithm for fundamental frequency estimation. IEEE Trans Audio Electroacoust 20:367–377CrossRefGoogle Scholar
  26. 26.
    McLoughlin IV (2014) Super-audible voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 22(9):1424–1433CrossRefGoogle Scholar
  27. 27.
    Murthy HA, Yegnanarayana B (2011) Group delay functions and its applications in speech technology. Sadhana 36(5):745–782CrossRefGoogle Scholar
  28. 28.
    Nagarajan T, Prasad VK, Murthy H et al (2003) Minimum phase signal derived from root cepstrum. Electron Lett 39(12):941–942CrossRefGoogle Scholar
  29. 29.
    Nakatani T, Irino T (2004) Robust and accurate fundamental frequency estimation based on dominant harmonic components. J Acoust Soc Am 116(6):3690–3700CrossRefGoogle Scholar
  30. 30.
    Ng T, Zhang B, Nguyen L, Matsoukas S, Zhou Xinhui, Mesgarani Nima, Veselý Karel, Matějka Pavel (2012) Developing a speech activity detection system for the DARPA RATS program. INTERSPEECH 9:1–4Google Scholar
  31. 31.
    Noll AM (1967) Cepstrum pitch determination. J Acoust Soc Am 41:293–309CrossRefGoogle Scholar
  32. 32.
    Plante F, Meyer GF, Aubsworth WA (1995) A pitch extraction reference database. In Proc Euro Conf on speech commun (Eurospeech), Madrid, Spain, pp. 827–840Google Scholar
  33. 33.
    Rabiner LR, Cheng MJ, Rosenberg AE, McGonegal CA (1976) A comparative performance study of several pitch detection algorithms. IEEEASSP 24:399–418Google Scholar
  34. 34.
    Ramirez J, Segura JC, Benitez C, De La Torre A, Rubio A (2004) Efficient voice activity detection algorithms using long-term speech information. Speech commun 42(3):271–287CrossRefGoogle Scholar
  35. 35.
    Sadjadi SO, Hansen JHL (2013) Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process Lett 20(3):197–200CrossRefGoogle Scholar
  36. 36.
    Sarikaya R, Hansen JHL (1998) Robust speech activity detection in the presence of noise. In International Conference on Spoken Language Processing Google Scholar
  37. 37.
    Shimamura T, Kobayashi H (2001) Weighted autocorrelation for pitch extraction of noisy speech. IEEESAP 9(7):727–730Google Scholar
  38. 38.
    Siemund R, Höge H, Kunzmann S, Marasek K (2000) SPEECON-speech data for consumer devices. In: Proceedings of the LREC2000, pp 883–886Google Scholar
  39. 39.
    Sohn J, Kim NS (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3CrossRefGoogle Scholar
  40. 40.
    Sun X (2002) Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In ICASSP, pp. 333–336. IEEEGoogle Scholar
  41. 41.
    Talkin D (1995) A Robust algorithm for pitch tracking (RAPT). In: Kleijn WB, Paliwal KK (eds) Speech Coding and Synthesis, Elsevier, pp 497–518Google Scholar
  42. 42.
    Tan LN, Alwan A (2013) Multi-band summary correlogram-based pitch detection for noisy speech. Speech Commun 55(7–8):841–856CrossRefGoogle Scholar
  43. 43.
    Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition II: Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251CrossRefGoogle Scholar
  44. 44.
    Pannala V, Aneeja G, Kadiri SR, Yegnanarayana B (2016) Robust estimation of fundamental frequency using single frequency filtering approach. In INTERSPEECH, pp. 2155–2159Google Scholar
  45. 45.
    Yang N, Ba H, Cai W, Demirkol I, Heinzelman W (2014) BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Trans Audio Speech Lang Process 22(12):1833–1848CrossRefGoogle Scholar
  46. 46.
    Yegnanarayana B, Murty KSR (2009) Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans Audio Speech Lang Process 17(4):614–624CrossRefGoogle Scholar
  47. 47.
    Yegnanarayana B, Murthy HA (1992) Significance of group delay functions in spectrum estimation. IEEE Trans Signal Process 40(9):2281–2289CrossRefzbMATHGoogle Scholar
  48. 48.
    Zhang X-L, Wu J (2013) Denoising deep neural networks based voice activity detection. In: Proceedings of the 38th IEEE international conference on acoustic, speech, and signal processing, Vancouver, Canada, May 2013, pp 853–857Google Scholar

Copyright information

© CSI Publications 2017

Authors and Affiliations

  1. 1.Speech and Vision Lab, LTRCInternational Institute of Information Technology (IIIT)HyderabadIndia

Personalised recommendations