Skip to main content
Log in

The combined effect of speech codec quality and transmission delay on human performance during complex spoken interactions

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper examines the effect of interaction between speech codec output quality and simulated satellite or VoIP transmission delay time on talker performance in a complex interaction. A hardware test codec (both single and tandem) was compared against a number of processed speech reference conditions to determine the relative subjective quality of the test codecs against conditions with known Mean Opinion Scores (MOS). The two codec conditions plus an additional higher quality condition were then used in an experiment that examined the effect of the interaction of transmitted speech quality and simulated transmission delay on a speech shadowing task and an accompanying error repair task involving two speakers. One person (the “reader”) read a passage. The second person (the “shadower”) shadowed the read passage by repeating immediately the words spoken by the reader. The reader, whilst reading, also listened for errors spoken by the shadower and repaired those errors by verbally reporting them to the shadower. A significant interaction between codec quality and transmission delay was found for the error repair task, but only for cases where the shadower made a significant number of errors. These results suggest that, for highly complex interactions which involve significant cognitive load, human performance will degrade more rapidly with increases in delay for transmission systems using speech codecs with lower quality output. This is assumed to be due to the additional demands upon working memory imposed by the transmission delay.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thonpson, H. S., & Weinert, R. (1991). The HCRC map task corpus. Language and Speech, 34, 351–366.

    Google Scholar 

  • Bailly, G. (2003). Close shadowing of synthetic speech. International Journal of Speech Technology, 6, 11–19.

    Article  MATH  Google Scholar 

  • Baddeley, A. (1992). Working memory. Science, 255, 556–559.

    Article  Google Scholar 

  • Barnwell, T. (1980a). Correlation analysis of subjective and objective measures for speech quality. In Proc. IEEE international conference on acoustics, speech and signal processing ICASSP80, pp. 706–709, 1980.

  • Barnwell, T. (1980b). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality. In Proc. IEEE international conference on acoustics, speech and signal processing ICASSP80, pp. 710–713, 1980.

  • Barnwell, T., & Quackenbush, S. (1982). An analysis of objectively computable measures for speech quality testing. In Proc. IEEE international conference on acoustics, speech and signal processing ICASSP82, pp. 996–999, 1982.

  • Brazil, D. (1995). A grammar of speech, describing English language. London: Oxford University Press.

    Google Scholar 

  • Campana, E., Tanenhaus, M. K., Allen, J. F., & Remington, R. W. (2004). Evaluating the cognitive load in spoken language interfaces using a dual-task paradigm. In Proc. Interspeech 2004, pp. 1721–1724, 2004.

  • Chistovich, L. (1960). Classification of rapidly repeated speech sounds. Akusticheskii Zhurnal, 6, 392–398. (cited by Marslen-Wilson, 1985).

    Google Scholar 

  • Cisco Systems Inc. (2002). Internetworking technology handbook (3rd ed.).

  • Clark, J. E. (1983). Intelligibility comparisons for two synthetic and one natural speech source. Journal of Phonetics, 11, 37–49.

    Google Scholar 

  • Cox, R., & Kroon, R. (1996). Low bit-rate speech coders for multimedia communication. IEEE Communications Magazine (December).

  • Dimolitsas, S., Phipps, J. G., & Wong, A. (1995). Impact of delay on the voice transmission performance of mobile-satellite systems. In Tenth international conference on digital satellite communications, Brighton UK, May 1995.

  • Digital Voice Systems, Inc. (1999). AMBE-1000 vocoder chip user’s manual.

  • Egan, J. P. (1948). Articulation testing methods. Laryngoscope, 58, 955–991.

    Article  Google Scholar 

  • ETSI (1997). Digital cellular telecommunications system; Enhanced Full Rate (EFR) speech transcoding (GSM 06.60), European Telecommunications Standards Institute.

  • Fairbanks, G. (1958). Test of phonemic differentiation: the rhyme test. Journal of the Acoustical Society of America, 30, 596–600.

    Article  Google Scholar 

  • Gibson, J., & Wei, Bo. (2004). Tandem voice communications: Digital cellular, VoIP, and voice over Wi-Fi. In Proceedings of global telecommunications conference 2004 (GLOBECOM’04), 29 Nov–3 Dec, 2004, Dallas, Texas, Vol. 2, pp. 617–621.

  • Gros, L., Durin, V., & Chateau, N. (2008). Redrawing the link between customer satisfaction and speech quality. Acta Acustica united with Acustica, 94, 32–42.

    Article  Google Scholar 

  • Guastavino, C., Levitin, D. J., Spackman, S., Chan-You, A., & Cooperstock, J. R. (2006). Quantifying the perceptual effects of videoconferencing compression. In CHI 2006, conference on human factors in computing systems, Montréal, Canada, 22–27 April, 2006.

  • Halliday, M. (1967). Intonation and grammar in British English. The Hague: Mouton.

    Google Scholar 

  • Hecker, M., & Guttman, N. (1967). Survey of methods for measuring speech quality. Journal of the Audio Engineering Society, 15, 400–403.

    Google Scholar 

  • Hecker, M., & Williams, C. (1966). Choice of reference conditions for speech reference tests. Journal of the Acoustical Society of America, 39(5), 946–952.

    Article  Google Scholar 

  • House, A., Williams, C., Hecker, H., & Kryter, K. (1965). Articulation testing methods: consonantal differentiation with a closed response set. Journal of the Acoustical Society of America, 37, 159–166.

    Article  Google Scholar 

  • IEEE subcommittee on subjective measures (1969). IEEE recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, AU-17(3).

  • ITU (1988a). ITU-T Recommendation P.48, Intermediate reference system. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1988b). ITU-T Recommendation G.711, Pulse code modulation (PCM) of voice frequencies. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1996a). ITU-T Recommendation P.800 (08/96), Methods for subjective determination of transmission quality. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1996b). ITU-T Recommendation P. 830 (02/96), Subjective performance assessment of telephone-band and wideband digital codecs. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1996c). ITU-T Recommendation G.723.1 (03/96), Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1996d). ITU-T Recommendation G.810 (02/96), Modulated noise reference unit (MNRU). International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (1998). ITU-T Recommendation G.861 (02/98), Objective quality measurement of telephone-band speech codecs. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • ITU (2001). ITU-T Recommendation P.862 (02/01), Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. International Telecommunication Union—Telecommunication Standardization Sector, Geneva.

  • Lavie, N., Hirst, A., de Fockert, J., & Viding, E. (2004). Load theory of selective attention and cognitive control. Journal of Experimental Psychology: General, 133, 339–354.

    Article  Google Scholar 

  • Licklider, J., Bisberg, A., & Schwartzlander, H. (1959). An electronic device to measure the intelligibility of speech. National Electronics Conference, Proceedings, 15, 329–334.

    Google Scholar 

  • Makhoul, J., Viswanathan, R., & Russell, W. (1976). A framework for the objective evaluation of vocoder speech quality. In: Proceedings IEEE international conference on acoustics, speech and signal processing ICASSP76, pp. 103–106, 1976.

  • Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short latencies. Nature, 244, 522–523.

    Article  Google Scholar 

  • Marslen-Wilson, W. (1985). Speech shadowing and speech comprehension. Speech Communication, 4, 55–73.

    Article  Google Scholar 

  • Miller, G. (1956). The magical number seven plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81–97.

    Article  Google Scholar 

  • Minoli, D., & Minoli, E. (1998). Delivering voice over IP networks. New York: Wiley.

    Google Scholar 

  • Möller, S., & Raake, A. (2002). Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios. Speech Communication, 38, 47–75.

    Article  MATH  Google Scholar 

  • Munson, W. A., & Karlin, J. E. (1962). Isopreference method for evaluating speech transmission circuits. Journal of the Acoustical Society of America, 34(6), 762–774.

    Article  Google Scholar 

  • Nakatani, L. H., & Dukes, K. D. (1973). A sensitive test of speech communication quality. Journal of the Acoustical Society of America, 53(4), 1083–1092.

    Article  Google Scholar 

  • Parsa, V., & Jamieson, D. (2003). Interactions between speech coders and disordered speech. Speech Communication, 40, 365–385.

    Article  Google Scholar 

  • Pisoni, D., & Koen, E. (1982). Some comparisons of intelligibility of synthetic and natural speech at different speech-to-noise ratios. Journal of the Acoustical Society of America, 71(Suppl. 1), S94.

    Article  Google Scholar 

  • Pisoni, D., Nusbaum, H., Luce, P., & Schwab, E. (1983). Perceptual evaluation of synthetic speech: Some considerations of the user/system interface. In Proceedings IEEE international conference on acoustics, speech and signal processing ICASSP83, pp. 535–538, 1983.

  • Redding, C., DeMinco, N., & Linder, J. (2001). Voice quality assessment of vocoders in tandem configuration (NTIA Report 01-386). US. Department of Commerce, April 2001.

  • Rix, A. W., Beerends, J. G., Kim, D.-S., Kroon, P., & Ghitza, O. (2006). Objective assessment of speech and audio quality—Technology and applications. IEEE Transactions on Audio, Speech and Language Processing, 14(6), 1890–1901.

    Article  Google Scholar 

  • Schwab, E., Nusbaum, H., & Pisoni, D. (1985). Some effects of training on the perception of synthetic speech. Human Factors, 27(4), 395–408.

    Google Scholar 

  • Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12, 257–285.

    Article  Google Scholar 

  • Tseng, K.-K., Lai, Y.-C., & Lin, Y.-D. (2004). Perceptual codec and interaction aware playout algorithms and quality measurements for VoIP systems. IEEE Transactions on Consumer Electronics, 50, 297–305.

    Article  Google Scholar 

  • Viswanathan, R., Russell, W., & Makhoul, J. (1983). Objective speech quality evaluation of mediumband and narrowband real-time speech coders. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP83, pp. 543–546, 1983.

  • Voiers, W. D. (1977). Diagnostic evaluation of speech intelligibility. In M. Hawley (Ed.), Benchmark papers in acoustics : Vol. 11. Speech intelligibility and speaker recognition. Stroudsburg: Dowden Hutchinson and Ross.

    Google Scholar 

  • Voiers, W. D. (1982). Measurement of intrinsic deficiency in transmitted speech: the diagnostic discrimination test (DDT). In Proc. IEEE international conference on acoustics, speech and signal processing ICASSP82, Paris, France, pp. 703–705, 1982.

  • Voran, S. (1999a). Objective estimation of perceived speech quality—Part I: Development of the measuring normalizing block technique. IEEE Transactions on Speech and Audio Processing, 7, 371–382.

    Article  Google Scholar 

  • Voran, S. (1999b). Objective estimation of perceived speech quality—Part II: Evaluation of the measuring normalizing block technique. IEEE Transactions on Speech and Audio Processing, 7, 383–390.

    Article  Google Scholar 

  • Voran, S. (1999c). Advances in objective estimation of perceived speech quality. In Proc. 1999 IEEE speech coding workshop, Porvoo, Finland, 1999.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Mannell.

Additional information

A much shorter preliminary report on this research was presented at the 9th Australian International Conference on Speech Science and Technology held in Melbourne, Australia in December 2002 and also appeared in the conference proceedings distributed to attendees on CD-ROM.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mannell, R. The combined effect of speech codec quality and transmission delay on human performance during complex spoken interactions. Int J Speech Technol 9, 53–74 (2006). https://doi.org/10.1007/s10772-008-9006-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-008-9006-4

Keywords

Navigation