The Headphone and Loudspeaker Test–Part II: A comprehensive method for playback device screening in Internet experiments

Wycisk, Yves; Sander, Kilian; Manca, Benedetto; Kopiez, Reinhard; Platz, Friedrich

doi:10.3758/s13428-022-02048-3

The Headphone and Loudspeaker Test–Part II: A comprehensive method for playback device screening in Internet experiments

Open access
Published: 17 January 2023

Volume 56, pages 362–378, (2024)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

The Headphone and Loudspeaker Test–Part II: A comprehensive method for playback device screening in Internet experiments

Download PDF

2020 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

HALT (The Headphone and Loudspeaker Test) Part II is a continuation of HALT Part I. The main goals of this study (HALT Part II) were (a) to develop screening tests and strategies to discriminate headphones from loudspeakers, (b) to come up with a methodological approach to combine more than two screening tests, and (c) to estimate data quality and required sample sizes for the application of screening tests. Screening Tests A and B were developed based on psychoacoustic effects. In a first laboratory study (N = 40), the two tests were evaluated with four different playback devices (circumaural and intra-aural headphones; external and laptop loudspeakers). In a final step, the two screening tests A and B and a previously established test C were validated in an Internet-based study (N = 211). Test B showed the best single-test performance (sensitivity = 80.0%, specificity = 83.2%, AUC = .844). Following an epidemiological approach, the headphone prevalence (17.67%) was determined to calculate positive and negative predictive values. For a user-oriented, parameter-based selection of suitable screening tests and the simple application of screening strategies, an online tool was programmed. HALT Part II is assumed to be a reliable procedure for planning and executing screenings to detect headphone and loudspeaker playback. Our methodological approach can be used as a generic technique for optimizing the application of any screening tests in psychological research. HALT Part I and II complement each other to form a comprehensive overall concept to control for playback conditions in Internet experiments.

The Headphone and Loudspeaker Test – Part I: Suggestions for controlling characteristics of playback devices in internet experiments

Article Open access 17 May 2022

An online headphone screening test based on dichotic pitch

Article Open access 09 December 2020

Headphone screening to facilitate web-based auditory experiments

Article 10 July 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Scope of the study

A high degree of control over the playback situation is important in conducting experiments on auditory perception. Maximum control is primarily possible in laboratory experiments. However, if a large sample size is needed, an Internet experiment is usually the method of choice. In this situation, having a high number of participants and, at the same time, maintaining a high level of control seems mutually exclusive. With the Headphone and Loudspeaker Test (HALT) Part I and Part II, we wanted to provide a tool to counter this predicament by remote testing playback device characteristics in Internet experiments. In the previous study, HALT Part I (Wycisk et al., 2022), we suggested a procedure to standardize level adjustments, detect stereo/mono playback and assess lower frequency limits of the playback devices. Subsequently, in HALT Part II, we focused on the identification of playback device types. A comprehensive concept to distinguish between headphones and loudspeakers will be suggested. HALT Part I and Part II together form the complete HALT procedure that can help improve the quality of Internet experiments on auditory perception.

In general, there can be various reasons to control for the type of sound reproduction device. First, playback device types, such as headphones or loudspeakers, can have an impact on how participants perceive stimuli. For example, Zelechowska et al. (2020) investigated the effects of headphones and loudspeaker playback on spontaneous body movement to rhythmic music. The authors found a “significant higher mean velocity of the head and body motion” (p. 14) in the headphone condition compared to the loudspeaker condition. For this reason, headphones or loudspeaker playback can be regarded to as confounding factors that should be controlled for. Second, a control procedure may be necessary due to the use of special audio samples, such as 3D binaural headphone mixes. As the 3D impression of such stimuli would be lost in case of loudspeaker reproduction, it must be ensured that the participants use headphones.

Existing playback device screening tests are a promising possibility to control for either loudspeaker or headphone playback. However, it is a major challenge to assess, compare, and select tests for a specific application. This study aimed to overcome these challenges. To compare the quality and capability of screening tests in general, several parameters must be determined. Those parameters help in selecting a screening test and screening strategy for a specific use case. In the current study, we used signal detection theory (SDT; Macmillan & Creelman, 2005; Treat & Viken, 2012) as a paradigm for evaluating screening tests. The detection of headphone or loudspeaker playback is logically similar to the detection of a disease. In both cases a screening test can be used to check whether a characteristic is present or absent. Mathematical and statistical methods and standards from disease detection can be transferred. For that reason, we expanded the analysis by using an epidemiological approach. In the following, we introduce the most important terms and parameters in this context.

Nomenclature, definitions, and fundamentals of diagnostics

We define a screening test as a procedure with its tasks and stimuli for which a certain sensitivity and specificity can be reported. In contrast, a screening strategy encompasses the initiation, embedding, and targeted application of a certain screening test. A screening test containing more than one task or item requires a threshold or cutoff for classification that is the minimum number of correct responses from which the test result is positive.

“Sensitivity is defined as the ability of a test to detect all those with the disease in the screened population” (Miller, 2014, p. 767). In our application, the presence of headphones is equated with the presence of the disease whereas the absence of headphones is equated with being free of the disease. We decided on this to ensure comparability with parameters from other studies. Theoretically, the assignment can also be inverted. In terms of SDT terminology, headphones are the signal to be detected. A person to whom the tests give a positive result is classified as a headphone user. A person to whom the tests give a negative result is classified as a loudspeaker user.

In our context, a headphone user for whom the screening test yields a positive result is considered a true positive (TP) case or a hit whereas one with a negative test result is considered a false negative (FN) case or a miss. Following from this, the sensitivity or hit rate according to SDT expresses the proportion of true headphone users for whom the screening test gave a positive result (TP). Formulated as a conditional probability, sensitivity is the probability of a positive test result given the presence of headphones. See Eq. (1) (Miller, 2014, p. 768) for the calculation (P for probability).

$$\textrm{Sensitivity}= Sen=P\left(\textrm{test}\ \textrm{positive}|\textrm{headphones}\right)=\frac{TP}{TP+ FN}$$

(1)

“Specificity is defined as the ability of a test to detect all those free of the disease in the screened population” (Miller, 2014, p. 767). In our application, a loudspeaker user for whom the test gives a negative result is considered a true negative (TN) case or a correct rejection whereas one with a positive result is considered a false positive (FP) case or a false alarm. The specificity or, in terms of SDT, correct rejection rate expresses the proportion of true loudspeaker users for whom the screening test gave a negative result (TN). See Table 1 for the confusion matrix regarding true condition and screening test result. Formulated as a conditional probability, specificity is the probability of a negative test result given the absence of headphones, that is, the presence of loudspeakers. See Eq. (2) (Miller, 2014, p. 768) for the calculation.

$$\textrm{Specificity}= Spe=P\left(\textrm{test}\ \textrm{negative}|\textrm{loudspeakers}\right)=\frac{TN}{TN+ FP}$$

(2)

Table 1 Confusion matrix for the classification according to signal detection theory (SDT) and epidemiology

Full size table

Sensitivity and specificity are measures of intrinsic accuracy to screening tests and are considered constant and independent of prevalence (Zhou et al., 2011). For a test with more than one item or trial, the measures depend on the threshold: A lower threshold increases sensitivity and decreases specificity compared to a higher threshold in general (Treat & Viken, 2012, pp. 727–728).

Prevalence is defined as the proportion of people who have a particular disease or condition at a specific time (Rothman & Greenland, 2014). In our application the prevalence or base rate expresses the proportion of potential headphone users in a population. Let π denote the prevalence. In this case, the probability of randomly drawing a headphone user from the respective population of headphone and loudspeaker users equals π (see Eq. 3).

$$\textrm{Prevalence}=P\left(\textrm{headphones}\right)=\pi =\frac{TP+ FN}{TP+ FN+ TN+ FP}$$

(3)

The sensitivity, specificity, and prevalence can be used to calculate the positive predictive value (PPV) and the negative predictive value (NPV). Both values describe the conditional probability that the true state matches the respective result given the test result (Kestenbaum, 2019, p. 163; Miller, 2014, p. 768). In our application, the value expresses the probability of headphone usage when the screening test is positive (PPV) and the probability of loudspeaker usage when the screening test is negative (NPV). The predictive values are not inherent to the screening tests, as they are influenced by the prevalence. See Eq. (4) as well as Eq. (5) for the calculation (Fletcher & Fletcher, 2005, p. 39; Kestenbaum, 2019, p. 164).

$$\begin{array}{r}PPV=P\left(\mathrm{headphones}|\mathrm{test}\ \mathrm{positive}\right)=\frac{TP}{TP+ FP}\\=\frac{Sen\times \pi }{Sen\times \pi +\left(1- Spe\right)\times \left(1-\pi \right)\ }\end{array}$$

(4)

$$\begin{array}{r}NPV=P\left(\textrm{loudspeaker}|\textrm{test}\ \textrm{negative}\right)=\frac{TN}{TN+ FN}\\=\frac{Spe\times \left(1-\pi \right)}{\left(1- Sen\right)\times \pi + Spe\times \left(1-\pi \right)}\end{array}$$

(5)

Similar to the predictive values, the overall utility can be used to describe the performance of a test for a given prevalence. Utility is a “value placed on a specific decision-making outcome” corresponding to its desirability (Treat & Viken, 2012, p. 725). The overall utility (U_overall, see Eq. (6); Treat & Viken, 2012, p. 736) describes “a utilities-weighted sum of the probabilities of the four decision-making outcomes” (Treat & Viken, 2012, p. 725). After choosing appropriate utilities (0 ≤ U_TP, U_FN, U_TN, U_FP ≤ 1), the value can be used to select the “best” test, the one with the highest overall utility among several tests for the application (one test with different threshold values or different tests).

$$\begin{aligned}\begin{array}{c}{U}_{\text{overall}}=P(TP)\times {U}_{\text{TP}}+P(FN)\times {U}_{\text{FN}}\\+P(TN)\times {U}_{\text{TN}}+P(FP)\times {U}_{\text{FP}}\\ {U}_{\text{overall}}=\pi \times Sen\times {U}_{\text{TP}}+\pi \times \left(1- Sen\right)\times {U}_{\text{FN}}\\+\;\left(1-\pi \right)\times Spe\times {U}_{\text{TN}}+\;\left(1-\pi \right)\times \left(1- Spe\right)\times {U}_{\text{FP}}\end{array}\end{aligned}$$

(6)

For the common goal of maximizing the percentage of correct classifications, the weights would be U_TP = U_TN = 1 for correct classifications and U_FN = U_FP = 0 for incorrect classifications (Treat & Viken, 2012, p. 736).

Existing screening tests

There are only a few screening tests to determine headphones or loudspeaker playback. Woods et al. (2017) developed a now widely used screening test based on destructive interferences. For this test, the used stimuli are based on a 200-Hz sinusoidal tone, which differ in terms of their level (normal level and low level) or phase between the two stereo channels (normal level but phase-shifted). For one trial of the test, all three stimuli are played sequentially. When reproduced via stereo loudspeakers, the level of the phase-shifted sinusoidal tone (one out of the three stimuli) drops compared to the other stimuli due to destructive interferences. The participant’s task is to name the softest one of the three tones. Ideally, when played back over loudspeakers, the phase-shifted tone is selected as the softest. When listening over headphones, the participant should select the low-level tone as the softest. There are a total of six trials for the complete screening procedure. If at least five out of six times the low-level tone was selected, a headphone playback is assumed. A more detailed description of the test can be found in the section Method – Main Study. Unfortunately, the study by Woods et al. (2017) lacks an appropriate measurement theory like SDT and, therefore, no information on sensitivity and specificity was reported. The accuracy of the screening process of Woods et al. (2017) remains vague. Moreover, the sample size was relatively small (N = 20 for each loudspeaker and headphone group). The characteristics of the screening procedure should be determined with state-of-the-art methods on a larger sample and with a bigger variety of playback devices.

More recently, another approach of screening for headphones playback was developed by Milne et al. (2021). The procedure is based on the perception of dichotic pitch (Huggins Pitch; see Cramer & Huggins, 1958). The stimulus consists of white noise presented on both the left and right channels. On one channel, the white noise is phase-shifted (180°) over a narrow frequency band. A tone embedded in the noise is perceived when played back over headphones but not when played back over loudspeakers. Milne et al. (2021) reported a sensitivity of 85% and a specificity of 70% for a test length of six trials and a threshold of five out of six correct responses. For reasons of comparison, Milne et al. (2021) also collected data on the Woods et al. (2017) screening method and calculated a sensitivity of 86% and specificity of 58% for the same threshold of five out of six correct responses.

Evaluating data quality after applying screening tests

Sensitivity and specificity are important parameters usually used for the evaluation of screening tests (Kestenbaum, 2019; Newman, 2001). However, for the evaluation of screening results from Internet studies, this approach alone may be insufficient: Sensitivity and specificity are calculated based on a verified proportion of events (e.g., headphones and loudspeakers), but in Internet studies, the base rate of playback devices is unknown for a population. The results of a test should not be considered independent of the prevalence. A short example shows the importance of including prevalence when interpreting screening results: A headphone screening method with a sensitivity of 90% and a specificity of 90% is used to collect a data set with headphones-users only. As soon as 100 cases classified as headphone users have been collected, the study is stopped. Sensitivity (true positive rate) and specificity (true negative rate) are inherent to the screening test. The main question is which proportion of the screening results is caused by errors of the screening and which proportion is caused by using specific playback devices. To adequately assess the data quality, researchers must use a measure such as PPV (Eq. 4) and NPV (Eq. 5), which includes prevalence (not inherent to the screening test; Kestenbaum, 2019). In the above case of 100 headphone users, we could use PPV (Kestenbaum, 2019, p. 164) to reveal the probability that headphones were used given that the screening method states the use of headphones. Assuming that headphones were used by 18% (prevalence) of all participants who took part in the screening test, the PPV would be about 66%.

That means if the test states that headphones were used, the probability that this statement is correct would be 66%. Therefore, the expected value for true headphone users in the hypothetical sample is n = 66. The expected value for true loudspeaker users amongst the participants is n = 34 even though the screening test identified all subjects as headphone users. This calculation example is extreme since it assumes a blunt screening without requesting headphone usage from participants, resulting in a worst-case scenario for the prevalence. It demonstrates that the prevalence has a dramatic impact on data quality even in screening tests with high sensitivity and specificity. In other words, for any screening method, reliable information on the prevalence of a feature in the target population is of central importance and must always be taken into account for a meaningful interpretation of findings. As a main challenge, knowing the percentage of verified headphone users in a screening test is crucial. To the best of our knowledge, information on the prevalence of playback devices and participants’ behavior when certain playback devices are requested is currently unavailable. Therefore, it cannot be estimated which proportion of a sample was rejected in earlier studies due to loudspeaker use or due to the inherent error of the screening method itself (economics of the screening). For the same reason, no conclusion could be drawn as to how many true headphones-users were in the group of participants who were classified by the test as headphone-users (data quality). At first glance, this approach seems to be counterintuitive as our intention was to use the screening test for the identification of headphone users. Even with information on prevalence of playback devices, there is no easy-to-use strategy for including playback device base rates in the preliminary considerations. However, we will make suggestions for the reliable estimation of the prevalence based on empirical evidence.

Screening strategies

Screening strategies are application methods that can improve the ability of screening tests. For example, strategies can be used to avoid response bias and to increase the number of potentially suitable participants. When headphones are required in Internet experiments, there are basically two strategies to gain control over the playback device used.

Filtering without Request (FWR)

For this screening strategy, the playback device used by the participants has to be recorded either via self-report or screening test. The required playback device for the study has to be concealed to prevent response bias. Based on the self-report or screening result, the participants can be grouped into headphone users (desired playback device – H) and loudspeaker users (undesired playback device – L). If a sample is expected to have a low headphone prevalence, for example, 25%, it is not wise, when screening for headphones, to exclude participants based on self-report as the proportion of available headphone users will never exceed the prevalence. The data quality may be high, but so is the number of excluded subjects. In some cases, it may become impossible to achieve a certain minimum number of participants. When a screening test is used, a low PPV would be expected due to the low prevalence. A hypothetical test with both a high sensitivity and specificity of 90% would lead to PPV of 75%. The data quality can therefore be described as poor.

Filtering after Request (FAR)

A more economical strategy is to request a specific device and to screen for compliance. At first, all participants are required to use headphones. It can be expected that some loudspeaker-users switch to headphones. Afterwards, a screening test can be applied. The participants can be filtered based on their screening result. The biggest problem with this method is that the true initial prevalence is unknown. In addition, it is difficult to estimate how many people actually switched to headphones. This seems contradictory, since the point of a playback device screening is to determine the rate of headphone and loudspeaker users. However, to determine the PPV, an estimate of the headphone rate is necessary. Otherwise, the data quality cannot be evaluated.

Both Strategy 1 (FWR) and 2 (FAR) were used in several Internet-based studies (Brown et al., 2018; Lavan et al., 2019; McPherson et al., 2020; Mehr et al., 2018; Niarchou et al., 2022; Ramsay et al., 2019; Tzeng et al., 2021; Woods & McDermott, 2018; Zelechowska et al., 2020).

Study aims

From the challenges and problems elucidated above, we derived the following main study aims: In a first laboratory pre-study, we developed screening tests to detect headphone and loudspeaker playback. The aim was to check the general function of the tests under controlled laboratory conditions. Based on the knowledge gained, the tests’ length was then to be adjusted if necessary to improve the test characteristics.

In a second Internet-based main study, the improved screening tests were checked on the basis of more data and with a wider variety of playback devices. Furthermore, we wanted to gain reliable data on the Woods et al. (2017) screening test. In addition, we collected information on headphone prevalence. On this basis, parameters were calculated to evaluate screening tests and to develop screening strategies.

We wanted to develop a comprehensive method for planning and conducting playback device screening in Internet experiments by bringing all information and parameters of screening tests together in an online tool. Researchers can use this tool to select suitable screening tests and tailor optimal test combinations and thresholds for a specific use case. The overall approach makes it possible to estimate the required sample sizes and the data quality for the application of screening tests. This has a big advantage over the selection of single screening tests solely on the basis of sensitivity and specificity as this improves both the economics of the study and the knowledge about the data quality. In addition, a method to combine more than two screening tests was developed. Moreover, the screening tests are integrated into a common procedure (HALT Part I and Part II), which enables standardized conditions for testing playback devices.

Method – Pre-study

Experimental setup and procedure

As in HALT Part I (Wycisk et al., 2022), HALT Part II was meant to perform in ordinary non-optimized listening environments and with sound devices of diverse quality. For that reason, the laboratory experiment took place in a non-optimized laboratory room of the Hanover Music Lab (HML; for details, see Tables S1, S2 and S3 in the Supplemental Material) with a variety of low- to average- and high-quality transducers:

Beyerdynamic DT 770 Pro 250 Ohm, closed circumaural, high-quality headphones;
No-name earbuds, open, intra-aural, low-quality headphones;
A pair of Yamaha HS8M loudspeakers (near field monitor) of average quality;
Apple MacBook Pro, 13” (Retina, early 2015) low-quality loudspeakers/laptop.

The assigned quality level in this study is only a subjective classification. As in HALT Part I (Wycisk et al., 2022), we used the browser-based survey platform SoSci Survey (www.soscisurvey.de; Leiner, 2020) for the data collection in the laboratory. After giving demographic information, participants started with the above-mentioned average-quality loudspeaker condition, followed by the laptop, high-quality headphones, and low-quality headphones (see Fig. S1 in the Supplemental Material for the procedure). During the experiment, the experimenter and the participant were located in two separate rooms. Volume levels were monitored and recorded by the experimenter’s use of a second screen (split screen extension of the participants computer). Each listening session lasted approximately 90 min, including instructions, pauses and retests.

Stimuli and task development

We developed stimuli and associated tasks to detect headphone and loudspeaker playback. All stimuli were created on an Apple MacBook Pro, 13” (Mid 2012) using Logic Pro X. In general, researcher-developed stimuli were limited to – 0.5 dBFS (true peak) to avoid clipping through the Gibbs phenomenon (Oppenheim & Schafer, 2014). Two different stimulus types were used in developing screening tests A and B for the identification of headphone and loudspeaker users.

Test A

The first stimulus was based on interaural time differences (ITD) and was extracted from a CD with examples of dichotic pitch (Bilsen & Raatgever, 2002). For an illustration of the basic stimulus construction, see Fig. 1. In this case, there was identical continuous noise on the left and right channel. At the beginning of the stimulus, both channels had a time offset of 40 samples (0.907 ms). The right channel was ahead of the left channel. In other words, a specific section of the noise would first sound on the right channel and after 40 samples (0.907 ms) on the left channel. In intervals of 1 s, the continuous noise on the right channel was gradually but slightly delayed in eight steps (see Fig. 1, T1, T2, …). At each step, a time delay of ten samples (0.227 ms) was added. The resulting gap was filled with noise. Counting the initial offset plus the eight steps of delaying the signal, there are nine different segments within a stimulus. At the end of the stimulus, the right channel was 40 samples behind the left channel (0.907 ms). In general, if the stimulus is played back over headphones, the best-case perceptual correlate is a noise that moves stepwise from right to left (i.e., this response will be classified as a hit in terms of SDT). If the audio sample is played back over loudspeakers, then ideally the impression of a noise jumping irregularly from one side to the other and back again would be generated (i.e., correct rejection). Figure 1 illustrates a possible perception over loudspeaker (T1: noise on the left, T2: noise slightly on the right, T3: noise further to the right, T4: noise on the far left) In order to create an independent but comparable second stimulus, we swapped left and right channels. The final test included four trials to allow thresholds to be set. We labelled this screening approach Test A.

Test B

The second stimulus was based on the Franssen effect (Ballou, 2008; Franssen, 1960), an auditory illusion related to the precedence effect (Plack, 2010). In general, the stimulus consisted of two short transient tones on one channel and a sustained tone on the other one. For an illustration of the basic stimulus construction, see Fig. 2. All characteristics of the stimulus were taken from Hartmann and Rakerd (1989). The left channel had a pure tone of 1 kHz with a total duration of 32 ms. During the first 2 ms, the level was constant. Immediately after the constant segment, the tone began to decay exponentially (see Fig. 2, T1). Simultaneously with the beginning decay on the left channel, the same sharp-onset (but sustained) sinusoid tone was begun on the right channel with a fade in of 30 ms (total duration = 1998 ms, Fig. 2, T2). At the end of the sustained tone, it started to decay exponentially over a period of 30 ms while a short sharp onset tone on the other (left) channel (total duration = 30 ms) increased exponentially with a phase shift of 180° over a period of 30 ms (Fig. 2, T3). The total duration of the audio was 2 s. The task was to identify the perceived channel of the pure tone. In the case of headphone usage, ideally the pure tone would jump from one side to the other and back again (i.e., this response is classified as a hit). In the case of loudspeaker use for playback, ideally the sound source of the pure tone would be perceived as being the left speaker (i.e., correct rejection). Here, too, the right and left channels were swapped to create a second comparable stimulus. Thus, the perception was the same – only the sides were changed. Again, the final test included four trials to allow thresholds to be set. We called this screening method Test B.

Pretest for pre-study

The aim of the pre-study was to check the functionality of the newly developed screening tests. To identify issues regarding the study design of the pre-study, the HALT Part II procedure was pretested by students and laboratory assistants. As a result, we could observe that randomization of the procedure might influence the participants' responses due to the Franssen effect. Presumably, the headphone condition revealed the true nature of the stimulus composition of screening Test B. Thus, we decided against the randomization of the playback conditions. The final playback condition order was loudspeakers (1st), laptop (2nd), high-quality headphones (3rd), and low-quality headphones (4th).

Participants

The study was conducted in June and July 2020. Participants were acquired through university mailing lists, advertising posters with a QR-Code and social media posts. A total of N = 40 participants (mean age = 31.83 years, SD = 13.48, n = 15 were males) took part in the study and gave written informed consent. The study was performed in accordance with relevant institutional and national guidelines (Deutsche Gesellschaft für Psychologie, 2016; Hanover University of Music, Drama and Media, 2017) and with the principles expressed in the Declaration of Helsinki. Formal approval of the study by the Ethics Committee of the Hanover University of Music, Drama and Media was not mandatory, as the study adhered to all required regulations.

According to self-disclosure, 35 participants reported normal hearing whereas five participants indicated a hearing loss (e.g., tinnitus, perception of noise). Additional screening to identify hearing loss is not always possible in Internet experiments. Since hearing loss can always be present in participants, we have followed a conservative strategy and decided not to exclude participants with hearing loss. We believe that this approach allows for a more realistic assessment of screening test performance because underachieving people are represented in the data. None of the participants used a hearing aid. Each participant was paid €15 as reimbursement for participation.

Data analysis

Test A (based on ITD) and Test B (based on the Franssen effect) were used to check for headphones or loudspeaker playback. For the analysis, we examined the properties of individual Tests A and B and their combinations. As combination approaches for parallel tests, we used the A AND B method (which classifies a participant as a headphone user when both tests are positive and results in a decreased sensitivity and increased specificity) and the A OR B method (which classifies a participant as a headphone user when at least one test is positive and results in an increased sensitivity and decreased specificity; Cebul et al., 1982).

In addition to the aforementioned measures of diagnostic accuracy, we used receiver operating characteristic (ROC) curves from SDT in which the hit rate is plotted against the false-alarm rate for each threshold, and the area under the curve (AUC; Treat & Viken, 2012, pp. 731–735) was calculated. The AUC is independent of a chosen threshold and can be interpreted as the probability that a randomly selected pair of a headphone user and a loudspeaker user will be classified correctly by the test. Additionally, score confidence intervals according to Agresti and Coull (1998) were calculated for sensitivity and specificity (see Table 1 for details).

The individual and combined evaluations of the tests provided several parameters. To assess those parameters, we defined general target criteria for a useful test: 1. Data Quality – The probability of correctly detecting loudspeaker users should be high (high specificity). In other words, in order to ensure high data quality, only a small number of loudspeaker users should be classified as headphone users. 2. Economy – The probability of correctly detecting headphone users should be high (high sensitivity). In other words, to ensure the economics of the study, only a small number of headphone users should be classified as loudspeaker users. However, there might be conflicts in the selection of an adequate screening test. For example, a test with the highest data quality could produce many misses so that the target number of subjects would not be achieved.

Results and discussion – Pre-study

The characteristics for Test A and Test B show that only sensitivity or specificity achieved a high value. No test scored high on both parameters at the same time (see Table 2 for details). The thresholds in Table 2 refer to the four trials for each test. The AUC from the ROC analysis was .642 and .735 for Test A and for Test B, respectively. Thus, the discriminative power of the individual tests could be considered mediocre at best.

Table 2 Characteristics of the screening procedures depending on different thresholds in the pre-study

Full size table

To calculate the characteristics of test combinations directly from the characteristics of individual tests, the tests have to be statistically independent conditional on the true state (Cebul et al., 1982; Zhou et al., 2011, p. 409), that is, the device used. The tests are statistically independent conditional on the true playback device if P(A = a|B = b, d) = P(A = a|d) for all a, b ∈ {0, 1} and d ∈ {headphones, loudspeakers}. Each of the 16 possible combinations of test A and B was checked for conditional independence by using a chi-square test or an exact multinomial test where assumptions of the former were violated (Bortz & Lienert, 2008, pp. 72–76; Bortz & Schuster, 2010, pp. 142–143) at an α-level of .10 (for details, see Table S4). Since the null hypothesis of these tests was of interest, we chose the comparatively higher α-level to protect against the β-error. The characteristics of all combinations of Test A and B could be calculated from their individual characteristics (for details, see Table S5 in the Supplemental Material). In general, either the sensitivity or the specificity of tests A and B as well as their combinations were relatively low. Unfortunately, comparing the mentioned parameters alone was insufficient for the selection of a method. The initial decision, for example, about whether headphones or loudspeakers should be used in a study affects whether sensitivity or specificity become important to evaluate data quality and economics of a screening method. Additionally, the estimated quality of the data would be influenced by the prevalence of the required playback device in the target sampling group. All factors together influence the total number of participants who have to be invited in order to achieve the desired sample size of participants with the verified playback device. We decided to conduct an online study (Main Study) to address those problems by extending the length of the screening procedures (thus, improving test performance) and collecting data on prevalence (evaluating real-life application).

Method – Main study

To gain more knowledge about the fundamental question of the likely headphone prevalence in a sample and to calculate the characteristics of the screening methods on a larger database and with a wider variety of playback devices, we conducted an Internet study.