Experimental setup and procedure
It is likely that the average person participating in an online study neither sits in an acoustically optimized room nor uses high-end loudspeakers or headphones. Thus, it was decided that HALT should perform in ordinary non-optimized listening environments and with sound devices of diverse characteristics. For that reason, the laboratory experiment took place in a non-optimized laboratory room of the Hanover Music Lab (HML; see S1 in the Supplemental Materials for room acoustical measurements) with a variety of low- to average- and high-quality transducers. In general, we believe that it is difficult to assign a quality level to a device, as this involves weighting of playback device characteristics. Depending on the purpose, certain characteristics can be of differential importance. The assigned quality level in this study is only a subjective classification. A precise assignment is negligible, since we do not want to develop a quality index for playback devices. Due to the need for the length of the test procedure, only four devices were used:
-
Beyerdynamic DT 770 Pro 250 Ohm, closed circumaural, high-quality headphones.
-
No-name earbuds, open, intra-aural, low-quality headphones.
-
A pair of Yamaha HS8M loudspeakers (near field monitor) of average quality.
-
Apple MacBook Pro, 13” (Retina, early 2015) low-quality loudspeakers/laptop.
The opening angle of the laptop was 110°. The measurement device used was a head and torso simulator (HATS, GRAS 45BC-11 KEMAR). The loudspeakers and the HATS created an isosceles triangle with a long edge length of 1.11 m. All devices and furniture positions were marked with colored tape on the carpet floor to guarantee reliable reconstruction of the setup (see S2 in the Supplemental Materials).
Data collection in the laboratory was based on the browser-based survey platform SoSci Survey (www.soscisurvey.de; Leiner, 2020). A complete retest using all four devices was conducted. After giving demographic information, participants started with the average-quality loudspeaker condition (Yamaha HS8M), followed by the low-quality loudspeaker/laptop (Apple MacBook Pro), high-quality headphones (Beyerdynamic DT 770 Pro), and low-quality no-name headphones (see S3 in the Supplemental Materials for the procedure). During the experiment, the experimenter and the participant were located in separate rooms. Digital levels were monitored and recorded using a second screen in the experimenter’s room (split screen extension of the participant's computer). The digital amplification values for the loudspeaker and headphone playback were provided by the RME Totalmix FX software (version 1.65; Audio AG, 2020). In the laptop condition, Apple’s Audio MIDI Setup application was used to display the playback amplification. Each listening session lasted approximately 90 minutes, including instructions, pauses, and retests.
Stimuli and task development
We developed stimuli and associated tasks to control the basic level adjustment (A.1), to check for level invariances and unwanted level manipulations (A.2), to check for mono and stereo (A.3), and to estimate the lower frequency limits of playback devices (A.4). As the main principle for stimulus construction, a counting paradigm was used to set up a comprehensive test procedure. All stimuli were created on an Apple MacBook Pro, 13″ (mid-2012) using Logic Pro X. In general, researcher-developed stimuli were limited to −1 dBFS (decibels relative to full scale, true peak) to avoid clipping through the Gibbs phenomenon (Oppenheim & Schafer, 2014). For each condition and counting task, a separate stimulus was created to avoid the influence of memory effects on responses. To prevent forward and backward masking, a gap of around 200 ms between auditory events within stimuli was included (Plack, 2010). Most of the stimuli use noise as a main component. Pink noise (20 Hz to 20,000 Hz) was used, as its power spectral density is similar to music. Additionally, the signal covered a wide frequency range. As a result, a wide transmission range of the playback devices and also local peaks was made audible. Responses were collected via the SoSci Survey (www.soscisurvey.de) browser interface (Leiner, 2020). To respond to the counting tasks, subjects had to enter numerical values on the website. All tasks with the associated stimuli can be tested in a demo version of the HALT (http://testing.musikpsychologie.de/HALT_demo_no_screening/). The program code (R package) is freely available on GitHub (https://github.com/KilianSander/HALT).
(A.1.) Item development for basic level adjustment
Three stimulus classes/types (M = music, N = noise, L = loop) were used to develop test items (stimuli and task) for adjusting the volume. Stimulus M was an excerpt of 30 s from the song “Menschen Leben Tanzen Welt” (Jim Pandzko, 2017). This song is quite characteristic of pop music production, including low-frequency enhancement and strong amplitude compression (long-term LUFS [loudness units relative to full scale] = −8.4, range LU [loudness units] = 6, Level = −0.2 dBFS true peak). The task was to listen to the excerpt and set the volume to a personally comfortable level, which the participant would prefer in an online study.
The second stimulus (N) consisted of 12 low-level pink-noise segments at −46 dBFS true peak. The participants were instructed to adjust the volume in such a way that the noise segments could be barely heard but were still perceivable. This stimulus was used to set the baseline for the subsequent loop method (explained in the next section). The general idea of this task was that the participants would set the level just above the background noise in the room. As we are not aware of any studies on the topic, the level of −46 dBFS true peak was chosen arbitrarily. We were aiming for a final playback level of around 85 dBSPL (sound pressure level, A-weighted) including the 1 dB gain reduction to avoid clipping through the Gibbs phenomenon. The A-weighting accounts for the human perception, while Z-weighting represents a flat frequency response.
Stimulus L was comprised of low-level and high-level pink noise segments. Low-level noise segments were presented at irregular time intervals and always had the same level of −46 dBFS true peak. High-level segments were regularly presented at a level of −1 dBFS true peak to keep participants from increasing the volume. A loop stimulus always contained a true/correct number of noise segments (low-level and high-level). The task was to count all the heard segments. In this way, we created an objective decision criterion (true number/correct number of segments reported, too many/too few reported) to assume correct and incorrect sound level adjustments. When participants tried to solve the listening task by increasing the volume, the unpleasant loud noise events had a deterring function. If participants reported too many events, they had to repeat the task and were prompted to listen more carefully. If a participant reported too few counts, it was assumed that the volume was set too low, which meant that the task could not be solved correctly. Accordingly, participants were prompted to increase the volume by the smallest possible value and to repeat the task. If the true number of noise segments were reported, the participant progressed to the next task. Through the direct response in form of prompts, a feedback loop was created that allowed control of sound level adjustments.
After each of the three types of stimuli (M, N, L), participants were asked to rate the perceived loudness of a pop song (Jim Pandzko, 2017) on a three-point rating scale (too soft, comfortable, too loud). In addition, in all four playback conditions, the digital amplification values set by the participants were documented. In a later stage, the adjusted sound levels in all three adjustment-method conditions can be compared.
(A.2.) Item development for determining participants’ adjustment accuracy/manipulation check
The loop method (consisting of the loop stimulus and the loop task) described in the previous section guaranteed only a minimum volume. However, after successful completion of the loop method (true number of noise segments was reported), it was still unknown how loud or how accurate the volume was adjusted above the minimum volume. To build a method to assess the adjustment accuracy in internet experiments, we used a stimulus comprised of pink noise events at different levels (−52/−46/−40 dBFS true peak). Participants had to count all noise events they perceived. Since the participants previously went through the loop method (that ensured audibility of noise segments at −46 dBFS true peak), we assumed that all participants would hear the noise events at −46 dBFS true peak and louder (−40 dBFS true peak).
As all events in the stimulus were present in a different quantity, conclusions could be drawn from the response behavior as to which levels could not be heard by the participants.
The following series of events serves as an example: 3 × −52 dBFS true peak, 4 × −46 dBFS true peak, and 2 × −40 dBFS true peak (nine noise events in total). There are two ways to use the information obtained from this task. One is to check the accuracy of the set volume. Therefore, the task has to be presented directly after completing the loop method for sound level adjustment. If the participants identify nine events, every noise segment can probably be heard, meaning the volume is set too loud. If six events are counted, the −52 dBFS segments probably cannot be heard, resulting in the setting being called “accurate.” If only two events are counted, the volume presumably is set too low, although the loop method was completed. The participants may have correctly solved the loop task by chance. We apply moderate criteria (± 1 counts), classifying everyone in our example who counts five, six, or seven events as accurate. Participants who count more than seven are classified as “too loud” and those who count less than five as “too soft.” As the number of noise events in each condition and for each level is different, the classification criteria are applied to different thresholds in each condition.
Another way of using the counting responses is to detect possible unwanted level manipulations. The task is to be presented repeatedly at a later time. The first accuracy measurement serves as a baseline to help determine if the level settings have been manipulated. The second measurement is then used to identify whether the test taker is classified in the same group again (too soft, accurate, or too loud). If so, it can be assumed that the volume remained unchanged. In case of a volume change, the direction of a possible group-change indicates whether the volume was reduced (from “too loud” to “accurate,” from “accurate” to “too soft,” or from “too loud” to “too soft”) or increased (from “too soft” to “accurate,” from “accurate” to “too loud,” or from “too soft” to “too loud”). However, in our laboratory study, participants were instructed not to change the volume during the survey. The experimenter regularly checked for compliance. To check whether HALT could detect volume changes, we simulated two volume manipulations. We refer to the original stimulus set as condition 0 dB. In Duplicate A of the set, the overall level of the stimulus set was increased by 3 dB (+3 dB condition). In Duplicate B, the level was decreased by 3 dB (−3 dB condition). For all playback device conditions and level manipulations (−3 dB, 0 dB, +3 dB), there was one trial each.
(A.3.) Item development to check for mono/stereo playback settings
The stimulus consisted of pink noise events (−1 dBFS true peak) that alternated irregularly between the left and right channel. The noise never sounded on both channels at the same time. Between the two stereo channels, the number of events always differed for every playback device condition to avoid memory effects. The task was to count all audible noise segments on the right channel only. There was one trial for each playback device condition. In the case of mono playback, all noise events would have been audible on the right channel. As a result, a participant would have reported the total number of all noise segments. In the presence of interchanged channels or difficulties with right-left discrimination, we expected the participant to report the number of noise events from the left channel. If the number entered was equal to the number of noise events on the left channel, it was assumed that the channels were swapped. To control for difficulties with right-left discrimination, we created a visual task in which participants had to indicate the position of a circle relative to a triangle.
(A.4.) Item development to estimate the lower-frequency limits of playback devices
The stimuli consisted of randomly presented pure tones (−1 dBFS true peak) located between regularly presented sections of loud pink noise (−1 dBFS true peak). Again, the loud noise was added to prevent the subjects from increasing the volume to solve the task. The task gives an estimate of what the sound transducer can reproduce in a best-case scenario, when the capabilities of the playback devices are pushed to the limit. We, therefore, chose a high level for the pure tones (−1 dBFS true peak). In order to keep the workload low, we selected four frequencies (20, 60, 100, and 140 Hz) and tested their audibility in subtasks. Participants were asked to indicate the total number of pure tone events that they had heard. There was only one trial for each frequency. We assumed that the pure tones could only be heard if the playback device was capable of reproducing the respective frequency adequately. For interpretation, the entire reproduction chain and the perception of the participants has to be taken into account. As a control procedure, the lower-frequency limits of every transducer determined by the HATS measurements (see next section) were compared with those determined by HALT.
Electroacoustical analysis of playback devices used in the laboratory study
To assess the relationships between the results of the perceptual tasks and the electroacoustic properties of the reproduction setups, we measured total harmonic distortion (B.1), frequency responses and limits (B.2), linearity (B.3) and the stimulus level (B.4) for each playback device. We used a GRAS 45BC-11 KEMAR Head and Torso Simulator (HATS) with anthropometric pinnae and low-noise ear simulator in combination with an Audio Precision APx525 measurement system. The analysis and evaluation were conducted with a routine scripted with MATLAB. The electroacoustic parameters were selected according to the International Electrotechnical Commission (IEC) standard IEC 60268-5 (2003) for loudspeaker and the standard IEC 60268-7 (2010) for headphone measurement, and could be derived from logarithmic sweep measurements (Farina, 2000). To investigate the devices’ behavior under conditions comparable to the experimental conditions, we chose an open loop measurement approach (Begin, 2020) which could be interpreted as a sequential dual-channel fast Fourier transform (FFT) method (Müller & Massarani, 2001). Specifically, test signals were created as audio files, which were then transferred and played back with the actual reproduction setup (i.e., MacBook Pro ➔ RME Babyface ➔ loudspeaker/headphones; see S4 in the Supplemental Materials for details of the signal chain).
(B.1.) Analysis of total harmonic distortion
An analysis of harmonic distortion as a function of digital amplification gain (in dB) was carried out. Based on this method, the optimal voltage for driving the individual transducers with an acceptable influence of artifacts could be determined. In the case of loudspeaker reproduction, stereo presentation was assumed; for example, crosstalk from the right loudspeaker to the left ear was taken into account, and resulting total harmonic distortions (THD) were summed up accordingly. For better clarity, the THD is depicted in Table 1 as the average for left and right ears in % for specific frequencies.
Table 1 Total harmonic distortion in % for selected frequencies and amplification gain settings of the loudspeaker pair Yamaha HS8M Table 1 shows that the minimum mean THD of the Yamaha HS8M for the selected frequencies occurred for a digital playback level of −6 dBFS. For lower levels (−12...−40 dB) the THD values increase again, as the noise at multiples of the fundamental frequency is misinterpreted as harmonic distortion. This behavior is due to the THD calculation algorithm, which is based on short-time Fourier transform (Farina, 2000). The best-case digital gain settings concerning the resulting THD for all reproduction devices are shown in Table 2.
Table 2 Aggregated best-case THD in % for the reproduction devices depending on the digital gain settings and resulting SPL at 1 kHz The loudspeaker gave higher THD values than both headphone types, as expected given the relatively small membrane surface and displacement (Klippel, 2006). The excessive mean THD for the laptop (MacBook Pro) occurred mainly for low to mid-frequencies (0.1–1 kHz). However, as shown in the analysis of the frequency response (Fig. 2), the sound pressure generated over this frequency range was very low.
(B.2.) Analysis of frequency response and limits
The analysis of the frequency response in the following section is based on the magnitude spectra of the transfer functions of the individual devices. These can be found in Fig. 1 for the headphones and Fig. 2 for the loudspeakers. Both figures show the transfer functions for the left and right ears separately as third-octave smoothed magnitude responses. Figure 1 shows the range of five reseating measurements (taking off and putting on the headphones to account for positioning effects) of the headphones as shaded areas. The respective response curves denote the complex mean. The bold horizontal lines indicate the logarithmically sampled median magnitudes of the transfer functions in the range between 20 Hz and 20 kHz while frequency limits at which the magnitude fell below the median by 3, 6, and possibly even 20 dB are marked with vertical arrows. As stereo reproduction was used in all cases, the loudspeaker responses in Fig. 2 include crosstalk contributions (e.g., from the left speaker to right ear).
(B.3.) Analysis of linearity
The frequency responses were obtained for input voltages giving the best THD. However, investigations of other input voltages showed areas of nonlinear behavior of the respective devices, namely, areas in which changes in input voltage did not lead to the same changes in the acoustic output. This behavior can be explained by the loss of force when the voice coil leaves the magnet gap at high displacements (Klippel, 2006). Besides, it was expected that electronic consumer devices, such as the audio output of the laptop (MacBook Pro), contain integrated nonlinear dynamic processing such as compressors, expanders, and limiters to subjectively enhance the output of the low-quality built-in loudspeaker. Because there is no total control over the equipment used in online listening test scenarios, the influence of nonlinear behavior should be considered. Figure 3 shows the linearity of the devices under test.
The plots show the deviations of the magnitude responses for various gain settings relative to the response with the best THD. The curves are separated for better visibility. The dashed gray lines denote the respective 0 dB line for each gain. Perfect linearity would result in straight horizontal curves. Frequency areas with deviations beyond ±1 dB are marked with dashed curves. The average- to high-quality devices on the left side (panels A and C) showed only small deviations in magnitude response for most gain settings. It could be expected that the timbre of the reproduction would not vary with increasing or decreasing gain and level. In contrast, the low-quality devices (panels B and D) showed highly varying magnitude responses across different gain settings. In particular, the laptop (MacBook Pro, purple curves in the bottom right subfigure) showed large deviations throughout the investigated frequency and gain range. The measurements revealed that nonlinearities—in this case a mismatch between amplification gain and acoustic level—varied with both frequency and output voltage. This leads to the conclusion that the timbre of the reproduced audio stimuli might vary with gain setting or level. However, this observation is quantified according to physical acoustics while the perception of reproduced stimuli may lead to smaller and/or other deviations.
(B.4.) Analysis of stimulus levels
The previous electroacoustic analysis dealt with the individual reproduction systems independent of the specific stimuli. Subsequent investigations were related to the actual musical stimulus (stimulus M) that was used for the level adjustment process (see stimulus M in the section Item development for basic level adjustment [A.1]). Keeping the previously analyzed acoustical properties in mind—namely magnitude spectra, harmonic distortion, and linearity—it was possible to analyze the resulting overall sound pressure level.
Figure 4 shows the equivalent A- and Z-weighted sound pressure level LAeq and LZeq, respectively, as a function of the gain setting for the music stimulus. These values were based on the convolution of the individual impulse responses with the raw stimulus including crosstalk when appropriate. The differences between A- and Z-weighted levels of the individual devices mainly indicate low-frequency loss. Small differences between A- and Z-weighted levels indicate that only a small amount of low-frequency energy was reproduced, caused either by the stimulus itself or by the capabilities of the device. In case of the laptop, the A-weighted level was higher than the Z-weighted level, indicating dominant spectral energy between 1 and 6 kHz.
Nonlinear effects occurred at high amplification gains, especially for the laptop for gains from −12 dB to 0 dB and less so for the low-quality headphones for gains −6 dB to 0 dB. To determine the sound level as a function of gain adjusted by participants, we searched for the mathematical relation between dBFS values and dBSPL measurements. The HATS measurements for the right and left ears were averaged for each dBFS value. We aimed for the simplest regression equation that fitted the data with a coefficient of determination of R2 ≥ .99. For the loudspeaker and headphones, linear equations were sufficient. For the laptop, quadratic equations were used. We adjusted the R2 to take the complexity of the equation into account. As there were four transducer conditions and the two types of level (A- and Z-weighted), eight equations were used to estimate the sound levels set by the participants (see S5 in the Supplemental Materials for more details).
Participants
The study was conducted in June and July, 2020. Participants were acquired through university mailing lists, advertising posters with a QR code, and social media posts. A total of 40 participants (mean age = 31.8 years, SD = 13.5, n = 15 male) took part in the study and gave written informed consent. Thirty-five participants reported normal hearing whereas five participants reported hearing loss (e.g., tinnitus, perception of noise). Each participant was paid €15 as reimbursement for participation. The study was performed in accordance with relevant institutional and national guidelines and regulations (Deutsche Gesellschaft für Psychologie, 2016; Hanover University of Music, Drama and Media, 2017) and with the principles outlined in the Declaration of Helsinki. Formal approval of the study by the ethics committee of the Hanover University of Music, Drama and Media was not mandatory, as the study adhered to all required regulations.