Abstract
This paper evaluates the impact of combined transcoding and packet loss degradation on speech as input for the interactive voice response service (IVR) and proposes a method for classification of user input according to speech quality. Careful optimization of a communication system and all of its segments need to be considered, as the quality of the user’s experience is becoming a more prominent part of the overall acceptance and desirability of modern service. Within our research, emulation environment was developed and the behavior of IVR analyzed under different packet loss and transcoding conditions. A set of frequently-used vocoders was tested on its performance with an automatic speech recognition module under degraded conditions. Further, quality estimation classifier was proposed, based on the Gaussian mixture models to determine best user’s input modality. Various train and test parameters were investigated to provide more detailed insight of input quality estimation for IVR service working under error prone conditions.
Similar content being viewed by others
References
Besacier, L., Bergamini, C., Vaufreydaz, D., & Castelli, E. (2001). The effect of speech and audio compression on speech recognition performance. In Proceedings of the IEEE fourth workshop on multimedia signal processing (pp. 301–306).
Mayorga, P., Besacier, L., Lamy, R., & Serignat, J.-F. (2003). Audio packet loss over IP and speech recognition. In IEEE workshop in automatic speech recognition and understanding (ASRU 2003) (pp. 607–612).
Nocito, C. D., & Scordilis, M. S. (2011). Monitoring jitter and packet loss in VoIP networks using speech quality features. In Proceedings of the IEEE consumer communications and networking conference (CCNC 2011) (pp. 685–686).
Lovrenčič, T., Štular, M., & Žgank, A. (2010). Influence of transcoding on quality of IVR service. In 19th international electrotechnical and computer science conference (ERK 2010) (Vol. 19, pp. 265–268).
Ding, L., & Goubran, R. A. (2003). Assessment of effects of packet loss on speech quality in VoIP. In Proceedings of the IEEE international workshop on haptic, audio and visual environments and their applications (HAVE 2003) (pp. 49–54).
Tymchenko, O., & Zayarnyuk, M. (2008). Modeling of packets loss in VoIP networks and measurement of speech quality. In Proceedings of the international conference on modern problems of radio engineering, telecommunications and computer science (Vol. 1, pp. 387).
Roychoudhuri, L., Al-Shaer, E., & Brewster, G. B. (2006). On the impact of loss and delay variation on internet packet audio transmission. Computer Communications, 29(10), 1578–1589.
Kim, H. K. (2008). Speech recognition over IP networks. In Z.-H. Tan & B. Lindberg (Eds.), Automatic speech recognition on mobile devices and over communication networks (Chap. 4, pp. 63–84). London: Springer.
Ramana, A. V., Parayitam, L., & Pala, M. S. (2012). Investigation of automatic speech recognition performance and mean opinion scores for different standard speech and audio codecs. IETE Journal of Research, 58(2), 121–129.
Pratsolis, D., Tsourakis, N., & Digalakis, V. (2007). Degradation of speech recognition performance over lossy data networks. In Proceedings of the 3rd ACM workshop on wireless multimedia networking and performance modeling (WMuNeP 2007) (pp. 88–91).
Besacier, L. (2008). Speech coding and packet loss effects on speech and speaker recognition. In Z.-H. Tan & B. Lindberg (Eds.), Automatic speech recognition on mobile devices and over communication networks (Chap. 2, pp. 27–39). London: Springer.
Atayero, A. A., Ayo, C. K., Nicholas, I.-O., & Ambrose, A. (2009). Implementation of ‘ASR4CRM’: An automated speech-enabled customer care service system. In IEEE EUROCON 2009 (pp. 1712–1715).
Delogu, C., Di Carlo, A., Rotundi, P., & Sartori, D. (1998). A comparison between DTMF and ASR IVR services through objective and subjective evaluation. In Proceedings of the IEEE 4th workshop on interactive voice technology for telecommunications applications (IVTTA 1998) (pp. 145–150).
Halimah, B. Z., Azlina, A., Behrang, P., & Choo, W. O. (2008). Voice recognition system for the visually impaired: Virtual cognitive approach. In International symposium on information technology (ITSim 2008) (Vol. 2, pp. 1–6).
Ndwe, T. J., Dlodlo, M., & Nichols, J. (2010). Comparison of touch and speech-enabled IVR systems in low literacy users. In International conference on user science and engineering (i-USEr 2010) (Vol. 1, 244–249).
Gonia, K., & SANS Institute. (2004). Latency and QoS for voice over IP (white paper). Retrived from http://www.sans.org/reading-room/whitepapers/voip/latency-qos-voice-ip-1349.
ITU-T. (2003). G.114, one-way transmission time. Retrived from http://www.itu.int/rec/T-REC-G.114.
Pitas, C. N., Panagopoulos, A. D., & Constantinou, P. (2013). Speech and video telephony quality characterization and prediction of live contemporary mobile communication networks. Wireless Personal Communications, 69(1), 153–174.
Rodrigues, D., Cerqueira, E., & Menteiro, E. (2009). QoE assessment of VoIP in next generation networks. In Proceedings of the 12th IFIP/IEEE international conference on management of multimedia and mobile networks and services (MMNS 2009) (Vol. 5842, pp. 94–105).
Agboma, F., & Liotta, A. (2008). QoE-aware QoS management. In Proceedings of the 6th international conference on advances in mobile computing and multimedia (MoMM) (pp. 111–116).
ITU-T. (2009). G.107, the E-model: A computational model for use in transmission planning. Retrived from https://www.itu.int/rec/T-REC-G.107.
ITU-T. (2001). P.862, perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Retrived from http://www.itu.int/rec/T-REC-P.862.
Rix, A. W., Beerends, J. G., Kim, D.-S., Kroon, P., & Ghitza, O. (2006). Objective assessment of speech and audio quality—Technology and applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1890–1901.
Mermelstein, P. (1976). Distance measures for speech recognition, psychological and instrumental. In C. H. Chen (Ed.), Pattern recognition and artificial intelligence (pp. 374–388). Oxford: Elsevier.
Olive, J. P. (1992). Mixed spectral representation—Formants and linear predictive coding (LPC). Journal of the Acoustical Society of America, 92, 1837–1840.
Hermansky, H. (1989). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87, 639–643.
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Levy, C., Linares, G., & Bonastre, J. F. (2006). GMM-based acoustic modeling for embedded speech pecognition. In Proceedings of the international conference on spoken language processing (ICSLP 06) (pp. 1726–1729).
Austin, S., Barry, C., & Chow, Y.-L. (1989). Improved HMM models for high-performance speech recognition. In Proceedings of the workshop on speech and natural language (pp. 249–255).
Hattori, H. (1992). Text independent speaker recognition using neural networks. In IEEE international conference on acoustics, speech, and signal processing (ICASSP-92) (Vol. 2, pp. 153–156).
Ganapathiraju, A. (2002). Support vector mashines for speech recognition. Ph.D. thesis, Faculty of Mississippi State University, Department of Electrical and Computer Engineering.
Rodriguez, E., Ruiz, B., & Garcia-Crespo, A. (1997). Speech/speaker recognition using a HMM/GMM hybrid model. Audio- and Video-Based Biometric Person Authentication, 1206, 227–234.
Poonam, B., Kant, A., Sharda, A., Kumar, S., & Gupta, S. (2008). Improved hybrid model of HMM/GMM for speech recognition (pp. 69–74). International book series “Information science and computing”. Institute of Information Theories and Applications (ITHEA).
Kinnunen, T., Karpov, E., & Fränti, P. (2006). Real-time speaker identification and verification. IEEE Transactions on Audio, Speech, and Language Processing, 14(1), 277–288.
Falk, T. H., Xu, O., & Chan, W.-Y. (2005). Non-intrusive GMM-based speech quality measurement. In IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (Vol. 1, pp. 125–128).
Jiang, H., Chen, S., & Yang, Y. (2010). Estimation of packet loss rate at wireless link of VANET–RPLE. In 6th international conference on wireless communications networking and mobile computing (WiCOM 2010) (pp. 1–5).
Kos, M., Grašič, M., & Kačič, Z. (2009). Online speech/music segmentation based on the variance mean of filter bank energy. In EURASIP Journal on Advances in Signal Processing 2009 (Vol. 2009, pp. 1–13).
Acknowledgments
This research was partly supported by the European Social Fund as part of the EU Operational Programme for Human Resources Development for the period 2007–2013 and partly supported by the Slovene Research Agency (ARRS) under Contract Number P2-0069. We gratefully acknowledge their support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lovrenčič, T., Štular, M., Kačič, Z. et al. QoS Estimation and Prediction of Input Modality in Degraded IP Networks. Wireless Pers Commun 80, 837–857 (2015). https://doi.org/10.1007/s11277-014-2044-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-014-2044-0