Skip to main content

A Dimension-Based Approach to Mouth-to-Ear Speech Transmission Quality

  • Chapter
  • First Online:
  • 483 Accesses

Part of the book series: T-Labs Series in Telecommunication Services ((TLABS))

Abstract

Both technological components of PSTN and VoIP speech transmission as well as aspects of the perception of transmitted speech are systematically described in this chapter. It is explained how quality is formed on the basis of perceived dimensions, how quality and quality dimensions can subjectively be measured, and how quality can be modeled on the basis of dimensions. State-of-the-art signal-based and parametric quality models are presented and the advantages of a new class of models with diagnostic capabilities are stressed. Finally, the research topics of this book are formulated: The identification of relevant quality dimensions, the development of an efficient subjective test method for dimension assessment, and the development of a parametric dimension-based model for speech quality.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Gestures as a part of human face-to-face communication are disregarded in this consideration, although they become more important, for example, in video telephony. The present work is focused solely on acoustic speech communication.

  2. 2.

    Speech comprehension is related to several other terms. A message to be comprehended by the recipient, besides her/his willingness to do so, depends on a series of factors. Given that the percentage of correctly identified word fragments, syllables, phonemes, and meaningless words of a transmission path is measured by articulation, articulation is the prerequisite for comprehensibility. Comprehensibility describes how well the speech signal, that is, the sign carrier, is capable to convey information. Comprehensibility, in turn, constitutes the prerequisite for intelligibility, itself describing the percentage of correctly identified meaningful words, phrases, or sentences. In parallel, communicability refers to how well a message can serve to communicate. Context as well as recipient’s knowledge factors influence the process of comprehension (whereas the definition of context depends on the level of comprehension). For more information, see Raake (2006, pp. 9–11), Jekosch (2005b, pp. 97–102), and Möller (2000, pp. 26–27).

  3. 3.

    The terms transmitter and receiver are ambiguous as they may also refer to the human beings. In order to avoid confusion, the technical entities for sending and receiving signals may also be called sending and receiving apparatus, respectively (Richards 1973, p. 14).

  4. 4.

    In traditional narrowband telephony, the (modified) intermediate reference system (IRS) is used as a reference (see ITU-T Rec. P.48 1988 and ITU-T Rec. P.830 1996, Annex D). For wideband telephony (50–7000 Hz), frequency weights are given in ITU-T Rec. P.79 (2007, Annex G) for a reference system that yields a loudness rating of 0 dB when compared to an IRS.

  5. 5.

    For simplicity reasons, no difference is being made between variables denoting signals in the time-continuous domain and the time-discrete domain. That is, \(x\) might represent the continuous signal \(x(t)\), \(t \in \mathbb R \), or the discrete version \(x(k)\), where \(t=kT\) and \(T=1/f_{\mathrm{s }}\). \(f_{\mathrm{s }}\) denotes the sampling rate and \(k \in \mathbb Z \). Moreover, the amplitude of time-discrete signals is assumed to be quantized.

  6. 6.

    The acoustic properties of the room also have an influence on the receiving signal \(x\), which, as a result, affects perception. These sound field components are represented by the gray arrows in Fig. 2.3 and are briefly discussed in Sect. 2.2.4.

  7. 7.

    Complementary information on Voice over IP (VoIP), in particular with regard to the time-varying behavior of packet-based networks and the resulting effects, can be found in Raake (2006).

  8. 8.

    Speech distortions that originate from early or late room reflections at send side are not compensated. In Brüggen (2001), however, methods are proposed for “sound decolorization” applicable in HFT-telephony.

  9. 9.

    Amplitudes of speech are not uniformly distributed. In contrast, Laplace or Gamma density distributions can be observed, indicating that low energy levels are most frequent in speech and should thus be quantified with a higher resolution than higher energy levels.

  10. 10.

    Note that the bandwidth for the highest bitrate mode is approximately 50–6600 Hz; the upper cutoff frequency decreases slightly with decreasing bitrate (Raake 2006, p. 59).

  11. 11.

    The assignment of RTP to the OSI framework is ambiguous. RTP is related to the transport layer, the session layer, as well as the presentation layer (Perkins 2003, p. 57).

  12. 12.

    With a preceding upsampling from \(f_{\mathrm{s }}=8\)  to 16 kHz.

  13. 13.

    Although this cannot ultimately be guaranteed, as new perceptual factors might emerge or perceptual factors that are important today might become irrelevant in the future.

  14. 14.

    In general, it is differentiated between monaural and binaural listening. In monaural listening, it is assumed that only one ear is engaged, typically with the usage of a handset terminal in a more or less quiet environment. In binaural listening with headphones, for example, it is further differentiated between monotic, diotic, and dichotic listening.

  15. 15.

    All events that are of multidimensional nature can be represented by points in a multidimensional space. For a description of these points, vector notation is used from now on. In particular, multidimensional events are represented by position vectors, indicated by boldface variables. By such vectors, points in a space are represented in relation to the origin of this space. Thus, a vector’s elements are equal to the coordinates of a point in space representing the event.

  16. 16.

    Experiments for investigating speech perception are regarded to be a special case in psycho-acoustics (Jekosch 2005b, p. 60).

  17. 17.

    Although this schematic is here employed for the auditory modality (i.e., for psycho-acoustic measurements), it can analogously be used for other psycho-physical measurements analyzing visual, tactile, olfactory, or gustatory perception. Instead of the terms sound event and auditory event, it can more generally be referred to physical event and perceptual event, respectively (Möller 2010, pp. 23–25).

  18. 18.

    “By reflection, the real experience of perception is interpreted and thus given intellectual properties” (Jekosch 2005b, p. 58).

  19. 19.

    “Psycho-physics measures the relationship between physical phenomena and phenomena of perception” (Jekosch 2005b, p. 61).

  20. 20.

    Note that quality is said to be of multidimensional nature in everyday language. However, according to the definitions provided here, integral quality is a one-dimensional (scalar) value, whereas both the perceptual event and the internal reference are of multidimensional nature.

  21. 21.

    Acceptability is typically measured as the ratio between the number of potential users and the number of actual users of a service, cf. Möller (2000, p. 14).

  22. 22.

    Efficiency is defined as “the resources expended in relation to the accuracy and completeness with which users achieve specified goals” (ETSI Guide EG 201 013 1997).

  23. 23.

    Note that the term “features” is here meant to include “quality” as defined in Sect. 2.3.4 as well.

  24. 24.

    Alternative, indirect methods for measuring similarity are given, for example, in Tsogo et al. (2000).

  25. 25.

    If the personal and external modifying factors can be assumed to be fixed, for example for a particular auditory experiment, the expectation \(\varvec{r}_0\) can be assumed to be fixed as well. For this specific setting, quality can be regarded as absolute.

  26. 26.

    In ITU-T Rec. P.800.1 (2006), a terminology is presented in order to avoid ambiguities between \({ MOS}\) values obtained in different types of tests, as well as from different instrumental models. It is refrained from using this terminology due to simplicity reasons: Listening-only tests are exclusively considered in this book.

  27. 27.

    The term “parametric” here is used as an indicator that the 10 scales are parameters of quality, “each [measuring] one aspect [...] of composite acceptability”. Thus, these parametric scales, except for the two scales directly related to quality, are here understood in a very similar way as the quality features, see Sect. 2.3.4.

  28. 28.

    Note that the proximity data itself can be obtained indirectly by subjective tests, see Tsogo et al. (2000), for example.

  29. 29.

    It is a part of this book to develop an analytic test method for non-expert listeners (see Sect. 2.7 and Chap. 4).

  30. 30.

    A systematic investigation on the “psychological role” of different user interfaces for speech transmission quality assessment as a function of the NB and WB channel bandwidth is presented Raake 2006, pp. 192–197.

  31. 31.

    In the past, test results from different experiments were recommended to be transformed according to the “equivalent \(Q\)-method”, using MNRU stimuli as reference conditions and a normalization procedure based on a fixed relation between \({ MOS}\) and \(Q\) of these references, see Möller (2000, pp. 123–129) for details. This method is not recommended today due to the perceptual inappropriateness of MNRU distortions compared to the distortions introduced by low-bitrate codecs, see discussion above.

  32. 32.

    General properties of the logistic as well as the log-logistic curves are described in Appendix A and, e.g, in Allnatt (1983, pp. 6–8).

  33. 33.

    Note that the variable \(t\) in Eq. (2.5) corresponds to the scaled quality \(b\) in the context of the present work. In the remainder of this section, the nomenclature used in Allnatt (1975) and Allnatt (1983) is used.

  34. 34.

    Note that Stevens and Galanter (1957) compared several data sets obtained both by ratio scaling and by category scaling. For the “prothetic” type of concepts (i.e., stimuli that change in perceptual intensity such as loudness), they showed that the relationship between the two scales is non-linear, and sometimes the category ratings are linearly related to the logarithm of the values obtained on the ratio scales, that is, apparent magnitude. As Allnatt (1975) argues, however, it is likely that such a relationship only applies over a limited range since conceptually there is no definite limit to magnitude, both with regard to physics and perception.

  35. 35.

    The term “external” means that analysis of preference takes place in relation to a given set of a-priori determined dimensions. In contrast, internal preference mapping is entirely based on a set of preference data (see, e.g., Mattila 2001 for an application example).

  36. 36.

    Allnatt distinguishes between \(I\) and \(J\), depending on whether the variables represent subjective values gained from a discrete category scale, or from a continuous scale.

  37. 37.

    In the following, it is assumed that the expectation \(\varvec{r}_0\) is known. In a further generalization of the following equation, it can in principal be replaced by an estimate \(\widehat{\varvec{r}}_0\) of the expectation.

  38. 38.

    Similar to Sect. 2.5, the perceptual event \(\varvec{w}_0\) is not directly available for practical modeling. Thus, the context-effects-reduced version \(\varvec{\beta }^{\prime }\) is used instead.

  39. 39.

    It is noteworthy that signal-based models usually transform the input signal(s) to an internal representation, which itself can result in parameters. Thus, the term “parametric” model, though being established, is not very precise.

  40. 40.

    In full-reference models, this transformation usually includes signal preprocessing such as level- and time-alignment, a perceptual transformation modeling part of the peripheral human auditory system, and a comparison unit.

  41. 41.

    Note that although \(R\) is defined for the range \(R \in [0; R_{o, max} + A]\) in Eq. (2.19), the zero is arbitrary and not absolute in the sense of a ratio scale, cf. Sect. 2.4.3.1.

  42. 42.

    Although for bandwidth distortions, the scale is practically bounded according to human hearing capabilities.

  43. 43.

    If all parameters are set to their default values and all impairment factors are taken into account, an \(R\)-value of \(R=93.2\) is obtained, corresponding to a standard ISDN connection (ITU-T Rec. G.107 2011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcel Wältermann .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Wältermann, M. (2013). A Dimension-Based Approach to Mouth-to-Ear Speech Transmission Quality. In: Dimension-based Quality Modeling of Transmitted Speech. T-Labs Series in Telecommunication Services. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35019-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35019-1_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35018-4

  • Online ISBN: 978-3-642-35019-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics