Development of an adaptive test of musical scene analysis abilities for normal-hearing and hearing-impaired listeners

Hake, Robin; Bürgel, Michel; Nguyen, Ninh K.; Greasley, Alinka; Müllensiefen, Daniel; Siedenburg, Kai

doi:10.3758/s13428-023-02279-y

Development of an adaptive test of musical scene analysis abilities for normal-hearing and hearing-impaired listeners

Original Manuscript
Open access
Published: 13 November 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Development of an adaptive test of musical scene analysis abilities for normal-hearing and hearing-impaired listeners

Download PDF

Robin Hake ORCID: orcid.org/0000-0002-8717-1261¹,
Michel Bürgel¹,
Ninh K. Nguyen¹,
Alinka Greasley²,
Daniel Müllensiefen^3,4 &
…
Kai Siedenburg¹

856 Accesses
1 Altmetric
Explore all metrics

Abstract

Auditory scene analysis (ASA) is the process through which the auditory system makes sense of complex acoustic environments by organising sound mixtures into meaningful events and streams. Although music psychology has acknowledged the fundamental role of ASA in shaping music perception, no efficient test to quantify listeners’ ASA abilities in realistic musical scenarios has yet been published. This study presents a new tool for testing ASA abilities in the context of music, suitable for both normal-hearing (NH) and hearing-impaired (HI) individuals: the adaptive Musical Scene Analysis (MSA) test. The test uses a simple ‘yes–no’ task paradigm to determine whether the sound from a single target instrument is heard in a mixture of popular music. During the online calibration phase, 525 NH and 131 HI listeners were recruited. The level ratio between the target instrument and the mixture, choice of target instrument, and number of instruments in the mixture were found to be important factors affecting item difficulty, whereas the influence of the stereo width (induced by inter-aural level differences) only had a minor effect. Based on a Bayesian logistic mixed-effects model, an adaptive version of the MSA test was developed. In a subsequent validation experiment with 74 listeners (20 HI), MSA scores showed acceptable test–retest reliability and moderate correlations with other music-related tests, pure-tone-average audiograms, age, musical sophistication, and working memory capacities. The MSA test is a user-friendly and efficient open-source tool for evaluating musical ASA abilities and is suitable for profiling the effects of hearing impairment on music perception.

Change detection in complex auditory scenes is predicted by auditory memory, pitch perception, and years of musical training

Article 17 August 2018

Speech perception is similar for musicians and non-musicians across a wide range of conditions

Article Open access 18 July 2019

Investigating acoustic numerosity illusions in professional musicians

Article Open access 10 April 2024

Introduction

A necessary foundation for developing an advanced understanding of music perception is auditory scene analysis (ASA)—the process by which the auditory system organises the acoustic environment into separate coherent events and streams (Bregman, 1990). This process is essential for allowing listeners to make sense of complex sounds and to distinguish between different sources or elements within an auditory scene. ASA is critical for normal hearing in natural environments, but it is also critical for music perception, because disentangling simultaneous streams of sound (e.g., an oboe within an orchestra, or a tenor voice within a choir) is a key part of music appreciation and can be difficult, particularly for hearing-impaired individuals (Greasley et al., 2020; Madsen & Moore, 2014; Siedenburg et al., 2021). Even though music psychology has long acknowledged the fundamental role of ASA in shaping music perception, at the same time, no efficient and ecologically valid test to precisely quantify listeners’ ASA ability in realistic musical scenarios has yet been published.

Prior research on ASA has largely employed atomistic approaches, which involve methods to examine auditory perception by isolating and analysing individual components or features within auditory scenes, simplifying complex stimuli to understand the underlying perceptual mechanisms (e.g., Bregman & Campbell, 1971; Micheyl et al., 2013). For example, one study by Bey and McAdams (2002) explored the role of schema-based processes in streaming using a melody recognition task with two unfamiliar six-tone sequences. Their findings revealed that performance improved with increasing frequency difference between target and distractor tones, and listeners performed better when the target alone was played first, but only when there was a difference in mean frequency between the target and distractor tones. Although approaches such as these have contributed significantly to our understanding of ASA, they may not fully capture the intricacies of real-world listening experiences, particularly in the context of music perception.

In a more recent study, Kirchberger and Russo (2015) developed the adaptive music perception (AMP) test, which includes subtests for metre, harmony, melody, and timbre. Additionally, they introduced the melody-to-chord subtest, which tapped into the realm of ASA by asking participants to identify a target melody that is presented simultaneously with a chordal accompaniment. The task requires participants to segregate the target melody from the background chords, which is an essential aspect of ASA. The AMP incorporates an adaptive testing method that dynamically adjusts the difficulty of test items presented to an individual based on their performance in real time. Nonetheless, the melody-to-chord subtest appeared to be particularly difficult for many participants, with roughly a quarter of normal-hearing (NH) participants and a third of hearing-impaired (HI) participants being unable to complete the task. Moreover, the AMP also uses artificial sound stimuli. Other tests similarly highlighted difficulties with ASA tasks. Siedenburg et al. (2020) adaptively measured signal-to-masker ratio thresholds of NH and HI listeners in a melody and timbre discrimination task, but also needed to discard data from HI listeners who yielded uninterpretable results. Another non-adaptive study, in which participants were asked to track target instruments in a classical piece while also attending to other instruments playing simultaneously, needed to discard data from HI participants as well due to chance performance (Siedenburg et al., 2021). These documented challenges in measuring ASA abilities motivated us to develop an adaptive, computer-driven measurement instrument suitable for assessing ASA abilities in the context of music for individuals with a broad range of listening abilities (that is, suitable for both NH and HI listeners). By employing an ecologically valid methodology that incorporates authentic, recorded music, our aim is to capture the intricate and dynamic nature of auditory scenes, integrating multiple auditory features and cognitive processes for incorporating the overall experience, and providing meaningful measurements for both NH and HI individuals.

HI listeners are known to perform poorly compared to NH listeners on music-related perceptual tasks such as timbre identification (e.g., Emiroglu & Kollmeier, 2008; Siedenburg et al., 2020), rhythm perception, pitch discrimination (Uys & van Dijk, 2011), melodic intonation (Kirchberger & Russo, 2015; Siedenburg et al., 2020), and auditory scene separation (e.g., Bayat et al., 2013). Older HI listeners also experience degraded spatial auditory processing (Akeroyd et al., 2007). While these effects are thought to result from damage to the ear, the auditory nerve, or the nervous system (e.g., Cai et al., 2013), some studies suggest that in fact some of these low performance levels can be explained by the problems associated with ageing and associated cognitive decline (Garami et al., 2020; Goossens et al., 2017; Gordon-Salant & Cole, 2016; Vinay & Moore, 2020). Yet, others argue that neither ageing nor HI can fully explain the observed individual differences, but a combination of these factors (Lentz et al., 2022). Furthermore, a growing body of research deals with the influence of musical sophistication and musical training on auditory perception skills. Even though there is still an ongoing debate as to whether musical training has a beneficial effect on speech perception (e.g., Bidelman & Yoo, 2020; McKay, 2021; Parbery-Clark et al., 2009), several studies have demonstrated a positive link between musical training and music perception and other acoustical abilities (e.g., Madsen et al., 2019; Siedenburg et al., 2020; von Berg et al., 2021; Zendel & Alain, 2012). Moreover, musicians have been reported to outperform non-musicians in basic cognitive tasks, such as those related to working memory (Talamini et al., 2017).

Although there is no clear picture of the causal factors underlying individual differences in ASA, it may be difficult to accurately understand ASA in music without taking into account these differences in the ability to process complex auditory scenes. One method which accounts for the large variability in listeners’ abilities is adaptive testing. In contrast to standard testing procedures, where a fixed set of items is presented to all test-takers regardless of their abilities or performance, in adaptive testing the difficulty level of test items is adjusted based on the responses provided by the test-taker (for an overview see van der Linden & Glas, 2000). This allows the difficulty of the administered items to be tailored to the test-taker's ability level, rather than presenting a fixed set of items that may be too difficult or too easy for some test-takers. There are further benefits of adaptive testing. The comparatively shorter testing times can reduce inattention effects and test bias, making it less likely that test-takers will gain advantage by guessing answers, resulting in more precise measurement estimates compared to standard procedures. This results in greater reliability, even when the testing time is reduced by 50–80% compared to non-adaptive tests (de Ayala, 2009; van der Linden & Glass, 2007; Weiss & Kingsbury, 1984). There are several examples of music-related tests that rely on adaptive testing procedures. Modern tests include measures of beat perception (Harrison & Müllensiefen, 2018), melody discrimination (Harrison et al., 2017), and mistuning perception (Larrouy-Maestri et al., 2019). These tests generally apply item response theory (IRT) models, which is a flexible adaptation approach that can be applied to a wide range of testing situations. However, IRT models require a calibrated item bank with a known difficulty level for all items. We consequently explored factors suitable for manipulating the item difficulty in a task that entails the detection of target sounds in mixtures of popular music.

A common approach for identifying factors that affect the underlying construct of interest is to examine the underlying cognitive processes involved in a task (Embretson, 1983). One key aspect of ASA is the ability to segregate individual sound sources from background signals in a complex auditory mixture. This process involves identifying and grouping together sounds that share similar acoustical characteristics and similarities along perceptual dimensions such as pitch and timbre, and which exhibit similar temporal patterning or common onsets (which are principles of common fate, continuity, similarity, and proximity; see Bregman, 1990). Accordingly, instruments (or vocals) with distinctive timbral qualities are often easier to distinguish in an ensemble because their unique acoustic properties enable them to stand out from the mixture. Bürgel et al. (2021) found that participants’ performance in a detection task was generally dependent on the target’s instrument category, with lead vocals showing a particularly robust attentional salience regardless of low-level acoustic cues. A second relevant component is the acoustical complexity of a mixture. The greater the number of instruments contributing to a musical mixture, the higher the probability of energetic masking (one sound spectrally masks or obscures a quieter sound, making it difficult or impossible to hear) and informational masking (one sound interferes with or disrupts the perception of another sound). Thus, by providing a more complex musical scene, the segregation of individual instruments should become more difficult. Due to the same masking processes, it is also expected that the relative level of the instruments can provide (or hide) important cues for segregating the target sound from the background. Sounds that are louder are typically perceived as more salient than softer sounds. Thus, if the target instrument is presented at a lower level than the background mixture, it may be more difficult to segregate it from the mixture and detect it. Another primitive cue that has been identified as important for encoding perceptual features includes signals’ spatial location, as documented by a large range of studies on spatial release from masking (e.g., Litovsky et al., 2021). The spatial separation between a target sound source and interfering sound sources leads to an improvement in target signal detection. The effect of a priori knowledge about the target location has been studied as well. For instance, in a study by Kidd et al. (2005), participants were asked to identify keywords from a target talker in the presence of two distractors in a setting with spatially separated loudspeakers. Conditions in which a priori knowledge about the target location was provided yielded higher performance than conditions in which no cue was provided.

The present study

To account for these processes with respect to individual differences among test-takers, we designed a straightforward and simple 'yes–no' task (also known as the 'A–not A' task; see Düvel & Kopiez, 2022) that required participants to decide whether a single target instrument (or lead singing-vocals) was part of a two-second mixture of instruments. In a calibration phase, two online experiments were conducted in order to establish item characteristics that could be used as predictors in an explanatory IRT model. This phase is necessary to fine-tune the test items for the adaptive test version, ensuring they accurately measure the intended construct and provide meaningful results across a broad range of ability levels. Experiment 1 of the calibration phase investigated the influence of the target-to-mixture level ratio (designated as LEVEL), the choice of the target instrument (TARGET), and the number of instruments in the mixture (NUM) on the test results. Experiment 2 focused on the effect of spatial separation in azimuth (at a stereo width of 0, 90°, 180°) introduced by inter-aural level differences (ILD).

Cognitive model of the Musical Scene Analysis Test (MSA)

In the present task, the full cognitive process model includes the following stages: (1) participants perceive the target instrument as a distinct auditory object; (2) participants store the mental representation of the sound of the target instrument in working memory; (3) participants use bottom-up and top-down processing to separate the target instrument from the background mixture based on its acoustic features and prior knowledge. Within this process, stream segregation is guided according to common principles such as similarity, proximity, continuity, and common fate; (4) participants selectively attend to the target instrument's timbre within the mixture (if present) based on their working memory representation while disregarding other sounds in the mixture. This includes a comparison of all segregated auditory streams to an internal mental representation or template of the target instrument's sound; (5) based on the similarity between the segregated auditory stream and the internal template, the listener decides whether the target instrument is present in the mixture or not. Accordingly, the test imposes demands on various cognitive processes, such as perception, working memory, segregation processes, attention, and decision-making. The task's difficulty is presumably influenced by a combination of factors, including the number and relative prominence of instruments within the mixture, as well as the nature of the target instrument.

Based on the proposed cognitive model of the MSA task and the reviewed literature, we formulated four hypotheses for the calibration phase:

1.
Decreasing the target-to-mixture level ratio will make it more difficult for listeners to accurately identify and separate the target sound from the mixture, resulting in lower accuracy.
2.
The listener's ability to detect the target instrument within the mixture will decrease as the number of musical instruments in the excerpt increases.
3.
There will be differences in detection accuracies for various target instruments. Although the literature provides limited guidance on the direction of these differences, we expect that lead vocals will be the easiest and bass the most difficult to detect, as indicated by Bürgel et al. (2021).
4.
An increase in stereo width (induced by inter-aural level differences) will make it easier for listeners to localise and segregate the target sound within the mixture, leading to improved accuracy.

By examining these factors (i.e., LEVEL, TARGET, NUM, ILD), the calibration phase aims to optimise the MSA test items for effectively assessing individual differences in auditory scene analysis abilities in a musical context (irrespective of prior musical experience and hearing impairments). In addition to the online calibration phase, we conducted a validation experiment (experiment 3) in a laboratory context to verify results under controlled conditions. This enabled us to assess the consistency of the MSA test through test–retest reliability analysis and to compare individuals' scene analysis abilities with those in a range of other psychoacoustic and music listening tests.

Experiment 1: Calibration phase—Part 1

Methods

Test battery

Musical Scene Analysis Test (MSA)

The MSA is a ‘yes–no’ test that reflects a two-alternative-forced-choice (2-AFC) testing paradigm. The MSA assesses participants’ ASA abilities in realistic musical scenarios by asking participants to detect a single target instrument (or lead vocals) in a mixture of instruments. Each trial consisted of a two-second audio excerpt of a single instrument or voice (the target), followed by a one-second silence, and a two-second excerpt with multiple instruments (the mixture). Participants were then asked to decide whether the target was part of the mixture or not (see Fig. 1 for a schematic illustration).

All excerpts were drawn from an open-source music database (MedleyDB, Bittner et al., 2014, 2016), which consists of real-world multitrack music recordings representing a wide range of musical genres (e.g., pop, rock, world/folk, fusion, jazz, rap, classical). Prior to the extraction, a professional musician with a background in music production meticulously adjusted and post-processed each mix to improve overall audio quality. This process involved refining the balance among individual instruments, fine-tuning volume levels, and minimising signal leakage. The excerpts were generated using the programming environment MATLAB (MathWorks Inc, 2020). In order to identify a set of suitable candidate tracks, the sound levels of each individual instrument within each song were analysed and calculated based on the root-mean-square average over 500 ms time windows for the full duration of the song. If one instrument in the target category and two to six additional instruments had sound levels above − 20 dB relative to the instrument’s maximum sound level in the song, the song qualified as a candidate base track. By setting a minimum sound level threshold of − 20 dB relative to the instrument's maximum sound level in the song, we aim to include only those songs where all chosen instruments are clearly audible.

The candidate list comprised 12,126 potential excerpts, each extracted from a distinct two-second time window within one of the 117 eligible songs in the database. From this list, excerpts were selected pseudo-randomly, with a deliberate effort to minimise duplications of the base song. The selection protocol also ensured an equal distribution of excerpts in terms of the designated target instrument (lead vocals, guitar, bass, or piano) and the number of instruments in the mixture (either three or six). The specific target instruments were selected due to their diverse and widespread accuracy reported in Bürgel et al. (2021), which employed a similar detection task in one of their experimental conditions. The composition of instruments within each mixture was preserved in its original configuration, meaning that it could include a diverse array of instruments such as lead vocals, backing vocals, bass, drums, guitars, keys, piano, percussion, strings, or winds, depending on the base songs used. In half of the mixes, the target instrument did not play in the mixture. In such instances, only excerpts featuring an additional instrument were selected to guarantee the preservation of three or six instrument signals within the musical mixtures for all items. For example, we utilised excerpts originally containing four instruments when a mixture with three instruments was required. Detailed information regarding the specific composition of instruments within each excerpt can be found in the MSA GitHub repository (https://github.com/rhake14/MSA).

Overall, this yielded a 4 (target instrument categories) × 2 (number of instruments in the instrument mixture) × 2 (presence of the target in the mixture) design, for which 160 different excerpts from 98 base songs were compiled. In addition to the experimental factors target instrument (1) and number of instruments in the mixture (2), the first calibration experiment explicitly examined the influences of the target's level ratio in comparison to the mixture (3). To this end, only for those excerpts in which the mix contained the target instrument, four versions were created that varied in their target-to-mixture level ratio (that is, 0, − 5, − 10, − 15 dB). Overall, a total of 400 items were created for the experimental task. Apart from the manipulation of the target instrument, the musical material in the excerpts was left unchanged (i.e., only excerpts were chosen in which the number of instruments corresponded to the desired condition). A logarithmic fade-in and fade-out with a duration of 200 ms was applied to the beginning and end of the audio signals. To allow for use with an online testing platform, all stimuli were converted from WAV format to MP3 with a bit rate of 320 kbit/s stereo (i.e., perceptually lossless compression). All resources, including the MSA test, task description, stimulus details, and example excerpts, are available on the project's GitHub repository (https://github.com/rhake14/MSA).

Degree of hearing impairment

Participants were asked to fill out an adaptation of the HAfM (Hearing Aids for Music) National Survey on hearing impairment (e.g., Greasley, 2022). These questions aimed at assessing the type and degree of hearing impairment. Participants were asked ‘Do you feel you have a hearing loss?’ and were able to respond with five options ranging from: ‘No, I do not feel that I have a hearing loss’ to ‘Yes, I have the feeling of being profoundly hearing impaired’. For each option, a short description was given (e.g., ‘Yes, I have the feeling of being mildly hearing impaired: When I am talking to a person in a quiet room, I can usually understand a conversation. In noisy situations (e.g., in a pub) and in group conversations, I sometimes have problems understanding speech.’). See Tables A4 and A5 for the complete self-assessment survey.

Goldsmiths Musical Sophistication Index (GMS; Müllensiefen et al., 2014)

The Gold-MSI is a brief, 39-item self-report questionnaire that assesses several aspects of musical expertise. It was designed to capture subscales for active engagement, emotions, musical training, perceptual abilities, and singing abilities. Participants were asked to respond on a a seven-point-Likert scale (1 = completely disagree; 4 = neither agree nor disagree; 7 = completely agree). For both calibration experiments, the two sub-scores for musical training (7 items, for example ‘I engaged in regular, daily practice of a musical instrument (including voice) for ___ years.’) and for perceptual abilities (9 items, for example ‘I can tell when people sing or play out of time with the beat.’) were used. The final composite score ranging from 1 to 7, with 7 being the highest possible score, was generated for each subscale. Both the validated English and German versions that were used and other relevant materials are freely available on the Gold-MSI home page (https://gold-msi.org).

Huggins headphone screening (Milne et al., 2020)

This 3-AFC task probes for headphone usage and makes use of a perceptual illusory pitch phenomenon, called the Huggins Pitch. The procedure involves presenting a white noise stimulus to one ear and the same white noise stimulus to the other ear, but 180° phase-shifted over a narrow frequency band at about 600 Hz. A faint tone can then be detected, but only when the stimuli are presented dichotically over headphones. Importantly, when the stimuli are presented to one ear alone or over loudspeakers, the sound is very weak or absent. In order to pass the test, listeners needed to properly identify the tone five out of six times. Participants with severe HI struggled with this task, and since the test was originally calibrated only among NH individuals, only NH participants needed to pass the headphone screening in order not to be excluded from the data analysis. A free demo implementation of the task can be found on GitHub repository (https://github.com/ChaitLabUCL/HeadphoneCheck_Test).

Demographics questionnaire

The demographics questionnaire consisted of several items designed to gather participants' background information. Participants were asked to provide information on their age, gender, and educational level. This demographic data helped to characterise the study sample and provided context for interpreting the results of the main experimental measures.

Procedure

Ethical approval for the study was obtained from the ethics committees at the University of Oldenburg and the University of Leeds. Informed consent was obtained from all participants tested. Two different samples of participants were recruited for experiment 1: For Sample 1, the experiment was conducted using testable.org, a web-browser-based application for creating behavioural experiments and surveys online (e.g., Rezlescu et al., 2020). The study was conducted in a single online session, with an average completion time of about 35 min for participants in Sample 1. For those in Sample 2, the average completion time was 10 min. The study was available in both the English and German languages. All participants provided digital consent by signing an electronic form and explicitly agreed, through a checkbox format, to remain in a quiet and distraction-free environment for the duration of the experiment. Participants who usually wore hearing aids were instructed to remove them for the study. All participants were financially compensated based on an hourly rate of €10; participants in Sample 1 received €5, while those in Sample 2 received €2 (or its equivalent in British pounds for UK residents), reflecting the different administration times for each sample group. Prior to the main experiment, a calibration sound was presented, and participants were instructed to adjust the volume of their playback device to a loud but comfortable level. Then the headphone screening was applied. Prior to the main experiment, participants underwent an MSA training session featuring five unique excerpts from the candidate list, which were not included in the main test set. Immediate feedback was provided after each response, and participants had the option of repeating the training phase as often as desired. The training phase was followed by the main experiment, that is, the MSA, where no feedback was given. A total of 160 trials, each presenting a single target-to-mixture level-ratio version of the 160 excerpts, were administered. The order in which the 160 items were presented and the selection of the target-to-mixture level-ratio version was randomised. Participants were allowed to pause at any time, with a recommended pause after half of the trials. After completing the experimental part of the study, a questionnaire regarding personal information including degree of hearing impairment, age, and gender, as well as the two subscales of the Gold-MSI for musical training and perceptual abilities, was administered. The same procedure was followed for Sample 2, but participants were only presented 32 items, with each combination of parameters being given no more than once—effectively reducing the duration of the experiment to approximately 10 min. The rationale for this approach was to obtain a diverse participant sample with individuals exhibiting varied profiles, such as differing hearing abilities, ages, and genders. Accordingly, in Sample 2, we prioritised the diversity of participants over measurement accuracy. For Sample 2, the experiment was conducted using psychTestR (Harrison, 2020), an R package for creating web-browser-based behavioural experiments.

Participants

The first sample was recruited through a newspaper article, mailing lists, and a call for participation posted at the online job board of the University of Oldenburg, and 126 participants (69 female) took part. Among these, 47 self-reported having at least a mild hearing impairment (14 female; M = 61.6 years, SD = 16.6), whereas 79 reported having no hearing impairment (55 female; M = 26.9 years, SD = 9.3). The second sample of participants was recruited via the online market research company SoundOut, located in the United Kingdoms, from which 1078 individuals completed the experiment (M = 30.19 years, SD = 11.72, 598 female). To ensure some control over the playback conditions during the experiment, participants were instructed to wear two-channel headphones, which was checked with a screening test (Milne et al., 2020). Of the initial sample, 548 individuals failed the headphone screening and were thus excluded. Consequently, the analysis for experiment 1 included a total of 525 NH participants with ages ranging from 18 to 72 years (M = 28.6 years, SD = 11.02, 274 female) and 131 HI participants with ages ranging from 23 to 82 years (M = 40.5 years, SD = 20.6, 67 female). The geographical location of participants was as follows: Australia (1), Canada (10), United Kingdom (135), Ireland (1), New Zealand (2), United States (322), Germany (126), and not specified (59). For a detailed overview with respect to the individuals’ degree of hearing impairment (i.e., degree of individuals’ self-rated hearing loss) see Fig. 2E.

Data analysis

Bayesian generalised logistic mixed-effects models (B-GLMM) were used for the analysis. B-GLMM are powerful and flexible alternatives to more commonly used frequentist approaches. In particular, they are able to account for uncertainty in parameter estimation and can provide stable estimates for categorical variables with many levels and smaller sample sizes with the help of informative prior distributions (e.g., Dienes & Mclatchie, 2018; Stegmueller, 2013). Using (Bayesian) mixed-effect models with a binary dependent variable (correct/incorrect participant response), we can measure how different aspects of a musical excerpt affect its perceptual processing difficulty. We report the median estimates and 95% confidence intervals of the conditional effects for the final Bayesian logistic mixed-effects model. These values were obtained by averaging the conditional effects estimates across the respective factor of interest. By examining credible intervals and comparing the posterior probabilities of different hypotheses, we evaluate the extent to which detection accuracies differ among the respective conditions. The observed descriptive statistics, including the overall results averaged across items, can be found in the accompanying plots, which offer a visual representation of the data and highlight the main patterns observed in the study (see Figs. 2 and 4). Before conducting the mixed-effects analysis, the data were inspected for unexpected response patterns. As a result, a total of 29 items were removed from the final analysis. An individual inspection of these problematic items showed that in some cases, the low target-to-mixture level ratios made the target inaudible. In other cases, backing vocals were so highly similar to the lead vocals that the experimental task was in fact ill-defined. Eight items with lead vocals as the target instrument were also excluded as they showed a success rate of 100% and thus did not provide valuable discriminative information. It should be mentioned that despite some participants scoring below the 50% chance performance level (occurring only in Sample 2), all participant scores were retained in the analysis. The rationale for this decision is twofold: First, the task inherently varies in item difficulty, and excluding low-performing participants could introduce selection bias, potentially underestimating the true challenge posed by certain items and compromising the generalisability of the results. Furthermore, the sample had already been subjected to a rigorous screening process; a significant cohort (N = 548) was excluded for failing the headphone screening test, presumably filtering out participants not adequately committed to the task. Retaining all participant scores was thus deemed crucial for preserving the integrity and representativeness of the data.

All analyses were executed in R (v2022.07.2 + 576; RStudio Team, 2020) and the Stan modelling language (v2.21.7; Carpenter et al., 2017), using the package brms as an interface from R to Stan (v2.18.0; Bürkner, 2017).

Bayesian GLMM fitting

We fitted several B-GLMM (Bernoulli family with identity link; estimated using Markov chain Monte Carlo [MCMC] sampling with 35,643 observations, four chains of 6000 iterations, and a warmup of 3000) to predict participants’ performance SCORE at the level of each individual trial (binary item responses, with 0 = incorrect and 1 = correct). The model was built step by step in a hierarchical way by adding one parameter at a time. This allowed us to evaluate the individual impact of each variable. We first added each parameter to the same model structure separately to gain a first understanding of the general predictive performance of each parameter in isolation (as shown in Table 1).

Table 1 Model comparison of all Bayesian (GLM) models of experiment 1

Full size table

We compared three models, each with a single fixed effect, using Bayes factors (BF) in contrast to a null model that included only random effects and no fixed effects. BF is a measure to quantify the evidence for one model over another. A BF of 1 indicates that both the null and alternative models are equally likely, while a BF greater than 1 suggests that the data better support the alternative model. Because the specifications for models 1B, 2B, and 3B differ only in the fixed effect used, we interpret the BF as an indicator of which predictor is most strongly supported by the data for explaining the dependent variable, namely, MSA scores.

Model 1C, which incorporated the LEVEL as a fixed effect, and model 1A (TARGET) provided the strongest evidence compared to the null model (BF_LEVEL = Inf; BF_TARGET = 1.84e + 48). This was followed by model 1B, which included the NUM as a fixed effect (BF = 7.39e + 07). These results suggest that incorporating the TARGET and especially LEVEL as fixed effects better explains the observed data than the number of instruments in the mixture (NUM), highlighting the potential importance of the choice of the target instrument and the target-to-mixture level ratio for the MSA task. However, it is important to interpret these results with caution, as the Bayes factor only provides relative evidence between models and does not directly quantify the effect size of each fixed effect. Further investigation, including effect size estimation and comparison, is needed to draw more definitive conclusions about the influence of these factors on the outcome variable (as indicated by the medians of the conditional effects and their accompanying density intervals; see Makowski et al., 2019a, 2019b).

According to this analysis, we planned four versions of the model, each one adding a parameter in the order of their hypothesised importance to predict the MSA test scores. We then used the leave-one-out cross-validation information criterion (LOOIC) and BF to provide a statistical measure of the model's predictive performance. ‘LOOIC’ is a model comparison metric derived from the concept of cross-validation. It estimates the predictive accuracy of a model by leaving out one observation at a time, fitting the model to the remaining data, and then predicting the left-out observation. A lower LOOIC value indicates better predictive performance, and the model with the lowest LOOIC value is considered the best (Vehtari et al., 2017). The final model was chosen based on a combination of these measures, to consider the model's relative fit, its complexity, and predictive performance. Model 1G (see Table 1) was excluded from the pool of candidate models, as the Pareto k estimate indicates potential issues with the model's convergence or MCMC sampling efficiency. The high BF value suggests that model 1F is likely to best explain the observed data when compared to the other models. Additionally, the low LOOIC value indicates that model 1F offers the optimal balance between its predictive ability and complexity (see Table 1). The final model used (1) LEVEL, (2) TARGET, and (3) NUM as fixed factors. For each of the factors, an interaction effect with the presence of the target in the mixture (PRESENCE) was included. Both excerpt and participant were added as random effects, which allowed the intercept to vary across participants and excerpts. Priors over the guessing and inattention parameters were set as beta distribution (α = 1, β = 1), with an expected lower bound for possible values of 0.4 and an upper bound of 0.6 for the guessing parameter (expected success rate if the participant were to answer randomly), and 0 to 0.1 for the inattention parameter (expected probability of a participant not paying attention), respectively.

Results

Model fit

The final model (1F) convergence and stability of the Bayesian sampling was assessed using Ȓ, which was below 1.01 (Vehtari et al., 2019), whereas the effective sample size (ESS) was above 1000 (Bürkner, 2017). All Pareto k estimates were good (k < 0.5; Vehtari et al., 2019). For assessing the model's explanatory power we used the classification accuracy of our model, which represents the proportion of correct predictions out of the total number of predictions. Although classification accuracy does not provide a direct measure of explanatory power, it does offer insights into how well the model is performing in predicting binary outcomes. The final model was able to correctly identify 78.3% of the observed response data. When excluding the random-effect information, the three employed fixed-effect factors alone accounted for a reasonable proportion of the model accuracy at 70.7%. The estimates obtained for this model are summarised in Table A1. Overall, the model assumed a guessing parameter of 0.43, which lies within a reasonable scope of the theoretically assumed 0.5. According to the model, the inattention parameter was below 0.01. This suggests inattention effects to be negligible for the short test durations employed in the present research. A detailed overview of the estimated conditional effects of the model (effects of the parameter employed corresponding to all reference conditions) can be found in Fig. 2D (see also Table A2). The results and statistical effects with regard to the individual experimental factors are described in the following.

Table 2 Model comparison of all Bayesian (GLM) models of experiment 2

Full size table

Level ratio between target and mixture

Descriptive statistics and overall results are presented in Fig. 2A. The model estimates that as the target-to-mixture level ratio was decreased, the MSA task became increasingly difficult. The median difference in accuracy when changing the level ratio from 0 dB (M = 93.6%; CI = [90%; 96.3%]) to − 5 dB (M = 89.8%; CI = [85.4%; 93.5%]) was 3.8 percentage points. Based on the B-GLMM, non-linear one-sided hypothesis testing was performed, indicating that the posterior probability (PP) negative difference between 0 dB and − 5 dB was above 0.95. This can be interpreted as evidence for a difference between conditions (for uniform priors, the posterior probabilities will exactly correspond to frequentist one-sided p-values; see, e.g., Marsman & Wagenmakers, 2017). A substantial difference in performance was also observed as the level ratio between the target and mixture was further decreased, with the detection rate dropping from a median percentage correct of 80.1% (CI = [73.3%, 85.5%]) at − 10 dB to 56.6% (CI = [50.5%, 64.2%]) at − 15 dB. When there was no target in the mix, the median detection accuracy was 79% (CI = [70.2%, 86.6%]). In short, the model yielded strong evidence that this difference was meaningful. The congruence between the descriptives of the observed data and the model estimates highlights the strong model fit and underscores the robustness of the observed effect.

Choice of the target instrument

The model indicates strong interaction effects of both the number of instruments in the mix and the target category with the presence of the target in the mix (PRESENCE). Thus, differences in correct detection rates must be interpreted according to this interaction. In line with previous research (e.g., see Bürgel et al., 2021), lead vocals yielded outstanding accuracy. Even though several items were excluded from this condition due to their perfect detection rates (100% of participants answered correctly), lead vocals still demonstrated the highest detection rates both when the target was presented in the mix (M = 87.2%; CI = [82.5%; 91.2%]) and when the target was not presented in the mix (M = 94.1%; CI = [88.7%; 97.0%]). The bass, on the other hand, was the most difficult to detect and also showed the greatest variation: for items with the target in the mix the median was 60.8% (CI = [54.3%; 68.0%]), and it was 80.2% (CI = [70.3%; 88.5%]) when the target instrument was not included. Both guitar (M = 84.8%; CI = [80.0%; 89.1%]) and piano (M = 87.1%; CI = [82.4%; 91.3%]) remained in an easy difficulty range for items in which the target was present, but became moderately difficult when the target was not present (i.e., M_guitar = 70.8%; CI = [61.1%; 80.5%] and M_piano = 70.2%; CI = [60.7%; 80.3%]). Similar to the LEVEL factor, non-linear hypothesis testing was performed. Non-negligible differences were found between the bass and all other target instruments, both when the items included the target in the mixture and when the target was missing. When the target was not part of the mixture, the lead vocals also showed substantial differences in detection rates compared to all other instruments. For items in which the target played in the mixture, the differences in performance for lead vocals were apparent only in comparison to the bass. Guitar and piano had comparable detection rates.

Number of instruments in the mixture

The B-GLMM indicated a relevant interaction effect for the number of the instruments and the presence of the target in the mixture. Overall, when the target did not play in the mixture, more complex musical excerpts with six instruments in the mixture showed lower detection rates (M = 74.1%; CI = [65.4%; 82.7%]) compared to simpler mixtures with three instruments (M = 83.6%; CI = [75.0%; 90.4%]). When the target was present in the mixture, the median detection accuracy was slightly different for both groups: 77.4% (CI = [72.2%; 82.5%]) for the six-instrument mix and 82.5% (CI = [77.4%; 87.3%]) for the three-instrument mix. Based on the model, this difference was found to be meaningful.

Individual differences factors

We assessed the relationship between the model-derived MSA scores, which include participant random intercepts adjusted for guessing and inattention parameters, and the GMS perceptual and training scores using Pearson's product–moment correlation. We found a marginal correlation between participants' model-based MSA scores and the musical training subscale of the GMS (r = 0.103, p = 0.008), and a slightly larger correlation between MSA scores and the musical perception subscale (r = 0.188, p < 0.001). Even though the inter-individual variation was quite large within each hearing group of subjects, we observed a decrease in performance with the self-reported degree of hearing impairment. Averaged across all items, the mean instrument detection accuracy decreased from 79.8% (CI = [78.7%; 80.9%]) for participants with no HI, to 73.9% (CI = [69.8%; 77.9%]) for individuals with mild HI, 77.1% (CI = [72.8%; 81.4%]) for those with moderate HI, 74.5% (CI = [69.3%; 79.6%]) for participants with severe HI, and 69.0% (CI = [53.2%; 84.9%]) for individuals with profound hearing impairment. Participants’ observed responses are displayed in Fig. 2E, F, G.

Discussion

As hypothesised, the model estimates suggested all three employed parameters had a robust influence on the participants’ accuracy. Experiment 1 showed that the accuracy in the MSA test depends on the choice of the target instrument, the number of instruments in the mixture, and the level ratio between the target and mixture. Lead vocals had the highest detection rates, while the bass was the most difficult to process. The detection rates were lower for more complex musical excerpts with six instruments in the mixture compared to simpler mixtures with three instruments. As indicated by the sequential model comparison (see BF in Table 1), the effect of the level ratio between the target and mixture had the most substantial effect on the difficulty of the task, with the task becoming increasingly difficult as the target-to-mixture level ratio decreased. When included in the model, both the number of instruments in the mixture and the choice of the target instruments added comparatively smaller but non-negligible improvements to the predictive performance of the model. The degree of hearing impairment also showed a relationship with the MSA test, with a decrease in performance as the degree of hearing impairment increased. There was a weak correlation between simple MSA sum scores and musical training and a slightly larger correlation between MSA scores and musical perception. Even among the most highly musically trained participants in this sample, no ceiling effects could be observed, whereas some of the more severely hearing-impaired individuals performed at chance level (see Fig. 2E and G). This suggests that the test was challenging enough to provide meaningful results for individuals with prior musical expertise and is also likely sufficient to measure ASA performance in the context of complex multi-source music for individuals with both normal hearing and severe hearing impairments.

Experiment 1 showed the strongest effects for the factor of the level ratio between target and mixture. In experiment 2, we explored whether another important factor of music production, spatialisation in terms of a stereo image, proved to be similarly powerful for adjusting item difficulty.