Harmonicity aids hearing in noise

McPherson, Malinda J.; Grace, River C.; McDermott, Josh H.

doi:10.3758/s13414-021-02376-0

Harmonicity aids hearing in noise

Original Manuscript
Published: 31 January 2022

Volume 84, pages 1016–1042, (2022)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Harmonicity aids hearing in noise

Download PDF

Malinda J. McPherson ORCID: orcid.org/0000-0003-0786-6774^1,2,3,
River C. Grace¹ &
Josh H. McDermott^1,2,3,4

1944 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Hearing in noise is a core problem in audition, and a challenge for hearing-impaired listeners, yet the underlying mechanisms are poorly understood. We explored whether harmonic frequency relations, a signature property of many communication sounds, aid hearing in noise for normal hearing listeners. We measured detection thresholds in noise for tones and speech synthesized to have harmonic or inharmonic spectra. Harmonic signals were consistently easier to detect than otherwise identical inharmonic signals. Harmonicity also improved discrimination of sounds in noise. The largest benefits were observed for two-note up-down “pitch” discrimination and melodic contour discrimination, both of which could be performed equally well with harmonic and inharmonic tones in quiet, but which showed large harmonic advantages in noise. The results show that harmonicity facilitates hearing in noise, plausibly by providing a noise-robust pitch cue that aids detection and discrimination.

Pitch Perception: Dissociating Frequency from Fundamental-Frequency Discrimination

On the Perception of Disharmony

Music Perception and Hearing Aids

Introduction

Noise is an unavoidable part of our auditory experience. We must pick out sounds of interest amid background noise on a daily basis – a speaker in a restaurant, a bird song in a windy forest, or a siren on a city street. Noise distorts the peripheral representation of sounds, but humans with normal hearing are relatively robust to its presence (Sarampalis et al., 2009). However, hearing in noise becomes more difficult with age (Ruggles et al., 2012; Tremblay et al., 2003) and for those with even moderate hearing loss (Bacon et al., 1998; Oxenham, 2008; Plack et al., 2014; Rossi-Katz & Arehart, 2005; Smoorenburg, 1992; Tremblay et al., 2003). Consequently, understanding the basis of hearing in noise, and its malfunction in hearing impairment, has become a major focus of auditory research (Kell & McDermott, 2019; Khalighinejad et al., 2019; Mesgarani et al., 2014; Moore et al., 2013; Rabinowitz et al., 2013; Town et al., 2019).

Hearing in noise can be viewed as a particular case of auditory scene analysis, the problem listeners solve when segregating individual sources from the mixture of sounds entering the ears (Bregman, 1990; Carlyon, 2004; Darwin, 1997; McDermott, 2009). In general, segregating sources from a mixture is possible only because of the regularities in natural sounds. Most research on the signal properties that help listeners segregate sounds has focused on situations where people discern concurrent sources of the same type, for example, multiple speakers (the classic ‘cocktail party problem’ (Assman & Summerfield, 1990; Culling & Summerfield, 1995a; de Cheveigne et al., 1995; de Cheveigne, Kawahara, et al., 1997a; de Cheveigne, McAdams, & Marin, 1997b)), or multiple concurrent tones (as in music (Micheyl & Oxenham, 2010; Rasch, 1978)). Concurrent onsets or offsets (Darwin, 1981; Darwin & Ciocca, 1992), co-location in space (Cusack et al., 2004; Freyman et al., 2001; Hawley et al., 2004; Ihlefeld & Shinn-Cunningham, 2008), and frequency proximity (Chalikia & Bregman, 1993; Darwin & Hukin, 1997; Młynarski & McDermott, 2019) can all help listeners group sound elements and segregate them from other similar sounds in the background. Harmonicity – the property of frequencies that are multiples of a common ‘fundamental’, or f0 (Fig. 1a-b) – likewise aids auditory grouping. For example, harmonic structure can help a listener select a single talker from a mixture of talkers (Darwin et al., 2003; Josupeit et al., 2020; Josupeit & Hohmann, 2017; Popham et al., 2018; Woods & McDermott, 2015). And when one harmonic in a complex tone or speech utterance is mistuned so that it is no longer an integer multiple of the fundamental, it can be heard as a separate sound (Hartmann et al., 1990; Moore et al., 1986; Popham et al., 2018; Roberts & Brunstrom, 1998).

Less is known about the factors and mechanisms that enable hearing in noise (operationally defined for the purposes of this paper as a continuous background sound that does not contain audibly discrete frequency components, for example, white or pink Gaussian noise, and some sound textures). Previous research on hearing in noise has mainly focused on features of noise, such as stationarity, that aid its suppression (Kell & McDermott, 2019; Khalighinejad et al., 2019; Mesgarani et al., 2014; Moore et al., 2013; Rabinowitz et al., 2013) or separation (McWalter & McDermott, 2018; McWalter & McDermott, 2019) from signals such as speech. Here we instead study the aspects of a signal that enable it to be heard more readily in noise.

Harmonicity is one sound property that differentiates communication signals such as speech and music from noise (Fig. 1a). Although harmonicity is known to aid the segregation of multiple harmonically structured sounds, its role in hearing in noise is unclear. To explore whether harmonic frequency relations aid hearing of sounds in noise, we compared detection and discrimination of harmonic and inharmonic tones and speech embedded in noise. Inharmonic sounds were generated by jittering frequency components so that they were not integer multiples of the fundamental frequency (McPherson & McDermott, 2018; Roberts & Holmes, 2006). These inharmonic sounds are inconsistent with any single f0 in the range of audible pitch (Pressnitzer et al., 2001) (Fig. 1b & c). Harmonic and inharmonic tones have previously been used to probe the basis of pitch perception, where under some conditions, but not others, they reveal representations of f0 underlying pitch judgments (McPherson & McDermott, 2020).

Our first question was whether harmonicity would make sounds easier to detect in noise across a range of sounds and tasks. The one related prior study we know of found that ‘chords’ composed of three harmonically related pure tones were somewhat easier to detect in noise than non-harmonically related tones, but did not pursue the basis of this effect (Hafter & Saberi, 2001). Previous hypotheses regarding tone-in-noise detection based on energetic masking account for differences between pure and complex tones (Buus et al., 1997; Dubois et al., 2011; Green, 1958, 1960), but it remains unclear if they account for effects of harmonicity. To test these hypotheses with our stimuli, we instantiated a simple model of energetic masking and ran it in a simulated detection experiment using the same stimuli presented to our participants.

The second question was whether harmonicity would make sounds easier to discriminate in noise. We first measured the discrimination of single tones as well as extended melodies in noise, comparing performance for harmonic and inharmonic tones, asking whether harmonicity would aid discrimination in noise at supra-threshold SNRs. Pitch discrimination thresholds are known to be comparable for harmonic and inharmonic tones without noise, suggesting that listeners use a representation of the spectrum to make up/down discrimination judgments (Faulkner, 1985; McPherson & McDermott, 2018; McPherson & McDermott, 2020; Micheyl et al., 2012; Moore & Glasberg, 1990). But in noisy conditions it could be difficult to accurately encode the spectrum, making it advantageous to leverage the regularity provided by harmonic structure. Previous studies have found that it is easier to hear the f0 of harmonic sounds when there is background noise (Hall & Peters, 1981; Houtgast, 1976), but it was unclear whether such effects would translate to improved discrimination of tones and melodies in noise. One other study found harmonicity to aid the discrimination of frequency modulation in noise (Carlyon & Stubbs, 1989), but did not explore whether this effect could relate to detection advantages. We also assessed speech discrimination, resynthesizing speech with harmonic or inharmonic voicing, and measuring the discrimination of English vowels and Mandarin Chinese tones at a range of SNRs. One previous study had failed to see a benefit of harmonicity on speech intelligibility of English words in noise (Popham et al., 2018), but it seemed plausible that effects might be evident in contexts where pitch is linguistically important.

We found that harmonic sounds were consistently easier to detect in noise than inharmonic sounds. This result held for speech as well as synthetic tones. Although effects of harmonicity on speech discrimination in noise were modest, there were large effects on tone and melody discrimination, with thresholds considerably better for harmonic than inharmonic tones when presented in noise despite being indistinguishable in quiet. The results are consistent with the idea that harmonicity improves hearing in noise by providing a noise-robust pitch signal that can be used to detect and discriminate sounds.

Experiment 1. Detecting harmonic and inharmonic tones in noise.

The purpose of Experiment 1 was to examine the effect of harmonicity on the detection of sounds in noise. We conducted three sub-experiments. Experiment 1a was run online due to the COVID-19 pandemic. We validated this online experiment using data collected in lab (before the pandemic shutdown, Experiment 1b, and during the shutdown, using two of the authors as participants in order to obtain data from highly practiced participants, Experiment 1c). In all three versions of the experiment, participants heard two noise bursts on each trial (Fig. 2a). A complex tone or a pure tone was embedded in one of the noise bursts, and participants were asked to choose which noise burst contained the tone. The complex tones could be harmonic or inharmonic, with constituent frequencies added in sine or random phase (example trials for this and other experiments are available at http://mcdermottlab.mit.edu/DetectionInNoise.html). Participants in Experiments 1a and 1b completed four adaptive measurements of the detection threshold for tones in each condition. Participants in Experiment 1c completed 12 adaptive measurements per condition.

In addition to our online and in-lab experiments, we created a formal model of this task to compare our findings with previous theoretical predictions regarding detecting tones in noise (Buus et al., 1997; Dubois et al., 2011; Green, 1958, 1960). We compared model performance to the results of the highly practiced participants in Experiment 1c.

Method

All experiments (both online and in-lab) were approved by the Committee on the use of Humans as Experimental Subjects at the Massachusetts Institute of Technology and were conducted with the informed consent of the participants.

Participants: Online, Experiment 1a

This experiment was run online because of a lab closure due to the COVID-19 pandemic. 110 participants completed Experiment 1a on an online data collection platform (Amazon Mechanical Turk). Here and in all other online experiments in this paper, we limited participation to individuals with US-based IP addresses. All online experiments began with a set of screening questions that included a question asking the participant if they had any hearing loss. Anyone who indicated any known hearing loss was excluded from the study (across all the online experiments in this paper, 9.5% of participants who initially enrolled self-reported hearing loss; 89% of these individuals also failed the headphone screening, described below). All participants in this and other experiments in this paper thus reported normal hearing. Given the age distribution of participants, and use of self-report, it is possible that some participants in the study had mild hearing impairment. We include results from an analogous in-lab experiment with younger participants (Experiment 1b, see below) to assess whether this and other factors specific to the online format might have influenced the results.

12 participants were removed from analysis because their average threshold across conditions (using the first adaptive run of each condition) was over three standard deviations worse than the group mean across all conditions. This exclusion criterion is neutral with respect to the hypotheses being tested, and independent of the data we analyzed (only the subsequent 3 runs were included for analysis in the remaining participants, to avoid double-dipping). Therefore, our exclusion procedure allowed unbiased threshold estimates from those final three runs. In previous studies we have found that online results replicate in-lab results when such steps are taken to exclude the worst-performing participants (McPherson & McDermott, 2020; Woods & McDermott, 2018). Of the remaining 98 participants, 38 self-identified as female, 60 as male (binary choice), mean age=39.2 years, S.D.=10.6 years. We planned to analyze the effects of musicianship on tone detection and so recruited participants with a range of musical experience. This analysis is presented in Effects of Musicianship.

In this and other experiments, we determined sample sizes a priori based on pilot studies, and using G*Power (Faul et al., 2007). We ran a pilot experiment online that was similar to Experiment 1a. The only difference between this pilot experiment and Experiment 1a was that the frequencies of each Inharmonic note were jittered independently on each trial (in contrast to Experiment 1a, and the other experiments reported in this paper, in which each Inharmonic tone for a participant was made inharmonic in the same way across the entire experiment, as described below). We ran this pilot experiment in 43 participants, and observed a strong main effect of harmonicity (η_p²=.37 for an ANOVA comparing Harmonic vs. Inharmonic conditions). Because we considered it plausible that the effects of interest might depend on musicianship, we chose our sample size to be able to detect a potential musicianship effect that might be substantially weaker than the main effect of harmonicity (see Effects of Musicianship section below). Therefore, we sought to be well-powered to detect an interaction between musicianship and harmonicity 1/8 the size of the main effect of harmonicity at a significance level of p<.01, 95% of the time. This yielded a target sample size of 62 participants (31 musicians and 31 non-musicians). In practice, here and in all other online experiments we ran participants in batches, and then excluded them based on whether they passed the headphone check and our performance criteria, so the final sample was somewhat larger than this target.

Participants: In Lab, Experiments 1b&c

Experiment 1b was run in the lab before the COVID-19 lab closure. The Harmonic and Pure Tone stimuli and procedures in Experiment 1b matched those in Experiment 1a. 21 participants completed the experiment (13 self-identified as female, 7 as male, 1 as nonbinary, mean age=28.8 years, S.D.=8.8 years). All participants reported normal hearing. No participants performed over three standard deviations away from the mean on their first run, so none were excluded. Only the final three runs were used for analysis.

Experiment 1c was completed in the lab by the first two authors (female, 29 years old, 23 years of musical training, and male, 21 years old, 11 years of musical training).

Procedure: Online, Experiment 1a

Online experiments were conducted using Amazon’s Mechanical Turk platform. In-person data collection was not possible due to the COVID-19 virus. Prior to starting the experiment, potential participants were consented, instructed to wear headphones and ensure they were in a quiet location, and then used a calibration sound (1.5 seconds of Threshold Equalizing noise (Moore et al., 2000)) to set their audio presentation volume to a comfortable level. The experimental stimuli were normalized to 6 dB below the level of the calibration sound to ensure that they were never uncomfortably loud (but likely to be consistently audible). Participants were then screened with a brief experiment designed to help ensure they were wearing earphones or headphones, as instructed (Woods et al., 2017), which should help to attenuate background noise and produce better sound presentation conditions. If they passed this screening, participants proceeded to the main experiment. For all experiments in the paper, participants received feedback after each trial, and to incentivize good performance, they received a compensation bonus proportional to the number of correct trials.

We used adaptive procedures to measure detection thresholds. Participants completed 3-down-1-up two-alternative-forced-choice (‘does the first or second noise burst contain a tone?’) adaptive threshold measurements. Adaptive tracks were stopped after 10 reversals. The signal-to-noise ratio (SNR) per component was changed by 8 dB for the first two reversals, 2 dB for the subsequent two reversals, and .5 dB for the final six reversals. The threshold estimate from a track was the average of the SNRs at the final six reversals. Participants completed four adaptive threshold measurements for each condition. Complex tone conditions (random vs. sine phase tones, and harmonic vs. inharmonic tones) were randomly intermixed, and the four runs of the Pure Tone condition were grouped together, run either before or after all of the complex tone adaptive runs, chosen equiprobably for each participant.

Procedure: In Lab, Experiments 1b-c

Experiment structure and adaptive procedure were the same for in-lab and online participants. In-lab participants sat in a soundproof booth (Industrial Acoustics) and heard sounds played out by a MacMini computer, presented via Sennheiser HD280 circumaural headphones. The audio presentation system was calibrated ahead of time with a GRAS 43AG Ear & Cheek Simulator connected to a Svantek SVAN 977 audiometer. The setup is intended to replicate the acoustic effects of the ear, measuring the sound level expected to be produced at the eardrum of a human listener, enabling sound presentation at a desired sound pressure level, which in these experiments was 70 dB SPL. All in-lab experimental stimuli were presented using The Psychtoolbox for MATLAB (Kleiner et al., 2007).

The experimental interface differed somewhat between online and in-lab experiments – online participants logged responses using a mouse or track-pad click, whereas in-lab participants used a keyboard. Like online participants, participants in the lab received feedback (correct/incorrect) after each trial, and completed four adaptive runs per condition. In Experiment 1c, the two participants each completed two sessions of two hours, and during each session completed 12 runs of each condition (3 conditions: harmonic and inharmonic tones added in random phase, and pure tones). The two sessions were completed on separate days within the same week.

Stimuli

Trials consisted of two noise bursts, one of which contained a tone. First, two 900ms samples of noise were generated, and one of these noise samples was randomly chosen to contain the tone. The tone was scaled to have the appropriate power relative to that noise sample; both stimulus intervals were then normalized to 70 dB SPL. Tones were 500ms in duration; the noise began and ended 200ms before and after the tone (Fig. 2a). The tones started 200ms after the noise to avoid an ‘overshoot’ effect, whereby tones are harder to detect when they start near the onset of noise (Carlyon & Sloan, 1987; Zwicker, 1965). The two noise bursts were separated by 200ms of silence.

The noise used in this and all other experiments was Threshold Equalizing (TE) noise (Moore et al., 2000). Noise was generated in the spectral domain to have the specified duration and cutoff frequency. Pilot experiments with both white and pink noise suggested that the harmonic detection advantage is present regardless of the specific shape of the noise spectrum provided the noise is broadband. In Experiment 1, noise was low-pass filtered with a 6^th order Butterworth filter to make it more pleasant for participants. The cutoff frequency was 6000 Hz, chosen to be well above the highest possible harmonic in the complex tones. Noise in all experiments was windowed in time with 10ms half-Hanning windows.

Complex tones contained ten equal-amplitude harmonics. Depending on the condition, harmonics were added in sine phase or random phase (Fig. 2c). The two phase conditions were intended to test whether any harmonic detection advantage might be due to amplitude modulation; tones whose components are added in sine phase have deeper amplitude modulations than tones whose components are added in random phase. F0s of the tones (both complex and pure – pure tones were generated identically to the f0 frequency component of the harmonic tones) were randomly selected to be between 200-267 Hz (log uniform distribution). Tones were windowed with 10ms half-Hanning windows, and were 500ms in duration. Tones and noise were sampled at 44.1 kHz.

To make tones inharmonic, the frequency of each frequency component (other than the f0 component) was ‘jittered’ by up to 50% of the f0 value. Jittering was accomplished by sampling a jitter value from the distribution U(-0.5, 0.5), multiplying by the f0, then adding the resulting value to the frequency of the respective harmonic. Jitter values were selected by moving up the harmonic series, starting with the second, and for each harmonic repeatedly sampling jitter values until the jittered frequency was at least 30 Hz greater than that of the frequency component below it (to avoid salient beating). Jitter values varied across participants (described below), but for a given participant were fixed across the experiment (i.e., each inharmonic tone heard by a given participant had the same jitter pattern). These inharmonic tones do not have a clear pitch in the traditional sense that listeners would be able to match through singing, for example, and have a bell-like timbre comparable to some pitched percussion instruments with inharmonic spectra (McLachlan et al., 2013). Previous experiments with such sounds have shown that this jitter is sufficient to yield substantial differences in performance on some tasks compared to that for harmonic sounds (McPherson & McDermott, 2018; McPherson & McDermott, 2020; Popham et al., 2018).

Stimuli for in-lab participants were generated in real time. For technical reasons all stimuli for online experiments were generated ahead of time and were stored as .wav files on a university server, from which they could be loaded during the experiments. 20 stimuli were pre-generated for every possible difficulty level (SNR) within the adaptive procedure. The SNR was capped at +6 dB SNR per component. If participants in the experiment reached this cap the stimuli remained at this SNR until participants got three trials in a row correct. In practice, participants who performed poorly enough to reach this cap were removed post hoc by our exclusion procedure. Adaptive tracks were initialized at -8 dB SNR per component. For each trial within an adaptive track, one of the 20 stimuli for the current difficulty level within the adaptive track was selected at random.

To vary the jitters across participants, we generated 20 independent sets of possible stimuli, each with a different set of randomly selected jitter values for the Inharmonic trials. Each participant only heard trials from one of these sets (i.e., all the inharmonic stimuli they heard were jittered in the same way throughout the experiment). This was intended to make the inharmonic conditions comparable in their uncertainty to the harmonic conditions (which always used the same spectral pattern, i.e. that of the harmonic series). As some randomly selected jitter patterns can by chance be close to Harmonic, we randomly generated 100,000 possible jitter patterns, then selected the 20 patterns that minimized peaks in the autocorrelation function. The resulting 20 jitters were evaluated by eye to ensure that they were distinct. For Experiment 1c (in which the first two authors were the participants), two of these 20 jitters were randomly chosen (one for each author).

Statistical Analysis

Thresholds were calculated by averaging the SNR values of the final six reversals of the adaptive track. Data distributions were non-normal (skewed), so non-parametric tests were used in all cases (these are also more conservative than parametric tests). To compare performance across multiple conditions we used a non-parametric version of a repeated-measures ANOVA, computing the F statistic but evaluating its significance with approximate permutation tests. To do this, we randomized the assignment of the data points for each participant across the conditions being tested 10,000 times, re-calculated the F statistic on each permuted sample to build a null distribution, and then compared the original F statistic to this distribution. For ANOVAs that did not show significant main effects, we ran additional Bayesian ANOVAs to establish support for or against the null hypothesis.

For post-hoc pairwise comparisons between dependent samples we used two-sided Wilcoxon signed-rank tests. For comparisons of independent samples (online vs. in-lab data) we used two-sided Wilcoxon rank-sum tests. These pair-wise comparisons were not corrected for multiple comparisons both due to the low number of planned comparisons (only harmonic vs. inharmonic conditions and pure vs. complex tones), and because they were preceded by ANOVAs that revealed significant main effects of harmonicity.

Model of energetic masking

The model performed the experiment task on the stimulus waveforms, instantiating the assumptions of the standard power spectrum model of masking. Although there is evidence that listeners do not rely exclusively on power per se when detecting tones in noise (Kidd Jr. et al., 1989; Lentz et al., 1999; Leong et al., 2020; Maxwell et al., 2020), power is plausibly correlated in many conditions with the cue(s) that listeners may be using. For each trial, we generated the two stimulus intervals (one with a tone, one without), using the exact parameters of the stimuli used with human participants, but without independently rms-normalizing each interval (noise was generated to be -20 dB rms re. 1 in a one-ERB wide band centered at 1000 Hz). Each interval was passed through a gammatone filter bank (Slaney, 1998) approximating the frequency selectivity of the cochlea. The resulting subbands were raised to a power of 0.3 to simulate basilar membrane compression, then half-wave rectified, then averaged over the duration of the stimulus to yield a measure of the average “energy” in each channel. To simulate internal noise, we added random noise to each channel’s average energy. This internal noise was drawn from a Gaussian distribution with a mean of 0 and a standard deviation of .0002 (this translated to internal noise that was, on average, 18.5 dB below the signal energy). This standard deviation was selected using a grid search over possible values, in steps of .00005, and chosen to minimize the mean-squared-error between the average performance on the pure tone condition for the model and for the human listeners from Experiment 1c. Based on previous results suggesting that listeners use an unweighted sum across an optimally selected set of frequency channels (Buus et al., 1986), we summed the energy over each of either the 28 filters that covered the entire range of frequencies that could occur in the complex tone signals (for conditions with complex tones), or the 2 filters that covered the frequency range of the pure tones (for trials with pure tones). The interval with the greater summed power was chosen as that containing the signal.

We ran 10,000 trials at each stimulus SNR ranging from -30 dB SNR to 0 dB SNR in .5 dB steps, then estimated the threshold by fitting logistic functions to the model results. The model threshold was defined as the point at which the fitted logistic function yielded 79.4% correct, corresponding to the performance target of the three-down-one-up thresholds measured in human listeners. We estimated confidence intervals by bootstrapping samples of the model data with replacement, fitting curves to each bootstrapped sample.

Results & Discussion: Experiment 1a

As shown in Fig. 2d, detection in noise was better for the complex tone conditions than the Pure Tone conditions (Z=6.90, p<.0001, mean performance for Inharmonic conditions vs. that for Pure Tones, two-sided Wilcoxon signed-rank test) as expected from signal detection theory given the ten-fold increase in harmonics in the complex tones compared to the pure tones (Buus et al., 1997; Dubois et al., 2011; Florentine et al., 1978; Green, 1958, 1960). However, detection thresholds were substantially better for harmonic than inharmonic complex tones even though they each had 10 frequency components (main effect of harmonicity, F(1,97)=101.00, p<.0001, η_p²=.51, significant differences in both sine and random phase conditions: sine phase, Z=7.44, p<.0001; random phase, Z=6.31, p<.0001, two-sided Wilcoxon signed-rank test). We observed a 2.65 dB SNR advantage for Inharmonic tones compared to Pure Tones, and an additional 1.38 dB SNR advantage for Harmonic tones over Inharmonic tones (averaged across phase conditions).

These differences are large enough to have some real-life significance. For instance, if a harmonic tone could be just detected 10 meters away from its source in free field conditions, an otherwise identical inharmonic tone would only be audible 8.53 meters away from the source (using the inverse square law; for comparison, a pure tone at the same level as one of the frequency components from the complex tone would be audible 6.29 meters away).

A priori it seemed plausible that a detection advantage for harmonic tones could be explained by the regular amplitude modulation of harmonic sounds, compared to inharmonic sounds. However, performance was similar for the sine and random phase conditions (the latter of which produces substantially less modulation, Fig. 2c). We observed no significant differences between phase conditions or interaction with harmonicity (no significant main effect of phase, F(1,97)=1.12, p=.29, η_p²=.01, and no interaction between harmonicity and phase, F(1,97)=0.26 p=.61,η_p²=.003). The Bayes factors, (BF_incl, specifying a multivariate Cauchy prior on the effects (Rouder et al., 2012)), were .13 for the effect of phase, and .10 for the interaction between phase and harmonicity, providing moderate support for the null hypotheses in both cases. This result indicates that the observed harmonic advantage does not derive from amplitude modulation.

The results are also unlikely to be explained by distortion products. Although harmonic tones would be expected to produce stronger distortion products than inharmonic tones, these should be undetectable for stimuli that include all the lower harmonics (as were used here) (Norman-Haignere & McDermott, 2016; Pressnitzer & Patterson, 2001).

Results & Discussion: Experiment 1b

Although online data collection has some advantages relative to in-lab experiments and enabled this study to be completed despite the pandemic conditions, sound presentation is less controlled compared to in-lab conditions due to the variability of headphones and/or listening environments for home listeners. To validate the online results, we compared them to data collected under controlled conditions in the lab (Experiment 1b; using calibrated headphones and sound-attenuating booths).

As shown in Fig. 2d, in-lab results from Experiment 1b were qualitatively and quantitatively similar to those obtained online in Experiment 1a. We observed no significant differences between online and in-lab data across two of the three conditions (Harmonic, Random Phase, Z=1.77, p=.076, Pure Tone, Z=1.86, p=.063, using two-sided Wilcoxon rank sum tests), and a marginally significant difference in one condition (Harmonic, Sine Phase, Z=2.14, p=.033). However, this latter difference was modest (threshold of -20.82 dB SNR online compared to -19.95 dB SNR in-lab), and not significant after Bonferroni correction for three comparisons (corrected α value of .017). These results, combined with previous studies that have quantitatively replicated in-lab results with online experiments (Kell et al., 2018; McPherson et al., 2020; McPherson & McDermott, 2020; McWalter & McDermott, 2019; Traer et al., 2021; Woods & McDermott, 2018) suggest that the measures taken here to improve sound presentation quality, such as requiring participants complete a brief headphone screening (Woods et al., 2017) and requesting that they situate themselves in a quiet room, and to eliminate non-compliant or inattentive participants, are sufficient to obtain results comparable to what would be observed in a traditional laboratory setting. While there are undoubtedly some differences from participant to participant in the stimulus spectrum with online experiments, these are evidently not sufficient to substantially alter detection in noise. Moreover, the relatively tight correspondence between in-lab and online findings suggests that factors such as headphone quality, distractions in at-home experiment settings, etc., did not greatly influence our overall results.

Results & Discussion: Experiment 1c

Previous models of detection in noise predict a 5 dB improvement for detecting a complex tone with 10 harmonics compared to a pure tone (Buus et al., 1997; Dubois et al., 2011; Green, 1958, 1960). Previous results with human listeners approximately match this theoretical prediction for harmonic complex tones and pure tones. The 5 dB detection advantage predicted by such energetic masking models should in principle hold for both harmonic and inharmonic tones. Yet in our main experiment, we observed only a 4.03 dB advantage for harmonic tones over pure tones, and just a 2.65 dB advantage for Inharmonic tones over pure tones.

One possible explanation for this discrepancy is that listeners in our experiments were not highly practiced and therefore may not have used optimal strategies to perform the task. To test this possibility, the first and second authors completed two two-hour experimental sessions (Experiment 1c), each with 12 adaptive runs per condition (six times the number of runs completed by each participant in Experiments 1a-b) for three conditions: Harmonic (random phase), Inharmonic (random phase), and Pure Tone. During the first four runs of the first session the advantage for detecting Harmonic tones over Pure Tones was 3.21 dB for one author and 4.46 for the other. However, in the final four runs of the second session (after extensive practice), the advantage for detecting Harmonic tones over Pure Tones increased to 4.67 dB for one author and 4.92 for the other (plotted in Fig. 3a). These results roughly match previous findings comparing 10-component harmonic complex tones and pure tones. There was a similar practice effect for the Inharmonic conditions compared to Pure Tones (first four runs: 1.00 dB and 2.60 dB for MJM and RCG respectively; final four runs: 2.57 and 3.65 dB); the difference between Harmonic and Inharmonic conditions replicated the harmonic detection advantage of 1.33 dB observed for random phase tones in Experiment 1a (2.10 dB and 1.27 dB, for each of the authors).

Results & Discussion: Model of Energetic Masking

To test whether these results could be explained by a simple model of energetic masking, we ran a model on a simulated version of the experiment. The model measured the power in each stimulus interval using an auditory filterbank, and chose the interval with the greatest power (Fig 3b). As with earlier models, our model approximately replicated the difference in thresholds between harmonic complex tones and pure tones observed in humans (a 5.74 dB advantage for Harmonic tones over Pure Tones, Fig. 3c). However, the model did not reproduce the empirically observed effect of inharmonicity: the model’s thresholds were similar for harmonic and inharmonic tones (a 5.73 dB advantage for Inharmonic tones over Pure Tones). The model results confirm that the harmonic advantage exhibited by human listeners is not predicted by classical models of masking.

Taken together, the results of Experiments 1a-c suggest that 1) harmonic sounds are more readily detected than inharmonic sounds when presented in noise, 2) detection thresholds are similar online and in-lab, 3) our effects are quantitatively consistent with prior tone-in-noise detection experiments provided that listeners are sufficiently practiced, and 4) classical models of masking are not sufficient to explain our results. The results are consistent with the idea that detection is performed using a cue (something other than power) that differs for harmonic and inharmonic tones.

Experiment 2. Detecting harmonic and inharmonic tones in noise, with cueing

One potential explanation for the observed harmonic detection advantage is that people are accustomed to hearing harmonic spectra based on their lifetime of exposure to harmonic sounds, and that this familiarity could help listeners know what to listen for in a detection task. Experiment 2 tested this idea by assessing whether the harmonic advantage persists even when listeners are cued beforehand to the target tone. Participants heard two stimulus intervals, each containing a “cue” tone followed by a noise burst. One of the noise bursts contained an additional occurrence of the cue tone (Fig. 4a), and participants were asked whether the first or second noise burst contained the cued tone.