Object-based attention in complex, naturalistic auditory streams

Marinato, Giorgio; Baldauf, Daniel

doi:10.1038/s41598-019-39166-6

Object-based attention in complex, naturalistic auditory streams

Article
Open access
Published: 27 February 2019

Volume 9, article number 2854, (2019)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Object-based attention in complex, naturalistic auditory streams

Download PDF

3294 Accesses
15 Citations
5 Altmetric
Explore all metrics

Abstract

In vision, objects have been described as the ‘units’ on which non-spatial attention operates in many natural settings. Here, we test the idea of object-based attention in the auditory domain within ecologically valid auditory scenes, composed of two spatially and temporally overlapping sound streams (speech signal vs. environmental soundscapes in Experiment 1 and two speech signals in Experiment 2). Top-down attention was directed to one or the other auditory stream by a non-spatial cue. To test for high-level, object-based attention effects we introduce an auditory repetition detection task in which participants have to detect brief repetitions of auditory objects, ruling out any possible confounds with spatial or feature-based attention. The participants’ responses were significantly faster and more accurate in the valid cue condition compared to the invalid cue condition, indicating a robust cue-validity effect of high-level, object-based auditory attention.

Listening to trees in the forest: Attentional set influences how semantic and acoustic factors interact in auditory perception

Article 04 January 2024

Attention is critical for spatial auditory object formation

Article 05 May 2015

An Object-Based Interpretation of Audiovisual Processing

Introduction

In many ecologic environments, the naturalistic auditory scene is composed of several concurrent sounds with their spectral features overlapping both in space and time. Humans can identify and differentiate overlapping auditory objects surprisingly well¹. This ability was first described in the literature in form of the “cocktail party problem”², which is still one of the most successful paradigms in research on auditory perception. The original term was introduced to describe the particular situation of a multi-talker environment, like a cocktail party, in which a person has to select a particular speech signal, filtering out other, distracting sound signals. The challenge in such a cocktail party situation is due to the fact that all the sounds in the auditory scene, sum together linearly into one single sound stream per ear. Only by segregating features originating from different spatial sources and by grouping together features originating from the same spatial source a listener can individuate the intended sound stream and then parse out the respective auditory objects from the mixture of the scene. The mechanism by which the single signal is segregated in different sound objects was termed sound segregation or “auditory scene analysis”³. According to McDermott⁴, the identification of different sounds in a complex auditory scene is mainly studied from two conceptually distinct perspectives: sound segregation (or “auditory scene analysis”)³ and attentional selection (first introduced by Cherry)². According to the biased competition model⁵ selective attention is the central mechanism that biases processing for behaviorally relevant stimuli by facilitating the processing of important information and at the same time filtering out or suppressing irrelevant information.

Auditory research has focused primarily on the segregation component^6,7,8,9, and despite of many efforts to better understand the interaction between auditory attention and segregation processes^{8,10,11,12,13,14,15}, there is still debate about the mechanisms of auditory object formation and auditory selective attention^{16,17,18,19,20}. However, attentional mechanisms have been described in much detail in other sensory modalities, in particular, in vision. From visual attention research we have learnt how top-down attentional control can operate on visual space^{21,22,23,24,25,26,27,28,29}, οn low-level perceptual features^{30,31,32,33,34,35,36,37,38,39}, and high-level visual objects^{40,41,42,43,44,45,46,47}.

And especially visual objects have been described as the ‘units’ on which non-spatial attention operates in many natural settings^43,45,48. In the auditory domain, we know much less about how selective attention can operate in a non-spatial manner. Particularly, we lack a better understanding of how attention can facilitate object units⁴⁹, and guide selection on the level of segregated sounds. Such interaction of auditory selective attention and sound segregation remains an open issue⁴, preventing a more comprehensive understanding of both the cocktail party phenomenon and auditory scene analysis.

Within the domain of auditory selective attention, the experimental work that explicitly tried to tackle the interaction between top-down object-based attention and auditory scene analysis is relatively small in comparison to experimental work on the stimulus-based psychophysics of sound perception. Early work exploited mainly the “dichotic listening” paradigm². In this paradigm, participants listen to a different audio stream presented to each ear and are asked to pay attention to either one of them^50,51,52, or sometimes to both^53,54,55,56. However, the dichotic listening paradigm always have a spatial component to them and therefore leave plenty of room for attentional lateralization confounds, which constitute a major shortcoming for using them to investigate non-spatial attention. Later work used paradigms that manipulated specific features of the acoustic stimulus to demonstrate successful tracking of one sound signal over the other. Some studies modulated pitch^19,57, others intensity level⁵⁸ or spatial features, such as location¹⁹. More recent studies, focused on the mechanisms of the neural representation of speech signals, using neural recordings for precisely tracking speech signals^59,60,61,62. Lastly high-level attention modulation in a complex auditory scene was investigated from the neural perspective also with paradigms that involve competing speech streams^63,64, speech in noise⁵⁸, and tone streams^65,66.

Here, we introduce a novel stimulus set and task to study object-based attention in the auditory domain. In analogy to visual objects, we defined an auditory object as the aggregation of low-level features into grouped entities. Several auditory objects together can then constitute an auditory scene, or soundscape, e.g. the characteristic soundscape of a railroad station or a multi-talker conversation at a party. In such natural, complex auditory environments, auditory objects are temporally confined and bound entities, e.g., the words constituting a conversation or a train whistle in the soundscape of a railroad station. Notably, there have already previously been various attempts to define what an auditory object is, e.g., by exploring the rules of its formation from a background of competing sounds, but without reaching yet an unanimous consensus on how the diverse mechanisms work together^1,3,17. One influential operational definition was proposed by Griffiths and Warren¹. Here, an auditory object is defined as something (1) that corresponds to things in the sensory world, (2) that can be isolated from the rest of the sensory world, and (3) that can be recognized or generalized beyond the single particular sensory experience. Further, object analysis may also involve generalization across different sensory modalities, such as the correspondence between the auditory and visual domain¹. This operational definition has also been used to define the neural representation of auditory objects⁸. Our definition borrows from the previous ones and is in line with the concept of acoustic stream, or ‘soundscape’, as a superordinate entity of individual objects⁶⁷.

Again in analogy to visual paradigms used to study object-based attention⁶⁸, we introduce an auditory repetition detection task, in which participants had to detect brief repetitions of auditory objects within the acoustic stream of a soundscape. The logic behind this new task is that such a repetition detection task requires the participants to fully process the acoustic stream to a cognitive level that allows them to recognize a certain, temporally extended set of low-level features as an object and to understand that this set of features was repeated. Importantly, this attention task cannot be solved by attending to a distinct low-level feature itself (e.g., a certain pitch). To also rule-out spatial attention, we presented two overlapping auditory scenes (e.g., in Experiment 1 a foreign language conversation and a railroad station soundscape) at the same external speaker, attentionally cuing one or the other.

In every trial, a 750 ms long repetition is introduced in one of the two overlapping streams and participants are asked to detect any such repetitions of auditory objects as fast as possible. This task requires the processing of the acoustic stream to the level of auditory objects and is specifically designed to investigate object-based mechanism of selective attention, i.e., whether top-down selective attention can weigh incoming acoustic information on the level of segregated auditory objects by facilitation and/or inhibition processes.

Experiment 1: Attentional Weighting of Speech Versus Environmental Soundscenes

Methods

Participants

Eleven participants (6 females, 5males, mean age 25.7 years, range 23–32 years, all of them right-handed and normal hearing) took part in the behavioral experiment and were paid for their participation. All participants were naïve in respect to the purpose of the study and they were not familiar with any of the languages used to create the speech stimuli. All participants provided written, informed consent in accordance with the University of Trento Ethical Committee on the Use of Humans as Experimental Subjects. One participant had to be excluded from further analyses because he failed to follow the task instructions.

Stimuli

Speech and environmental sound signals: The experimental stimuli were auditory scenes, consisting of overlapping streams of (a) speech conversations embedded in (b) environmental sounds. All the speech signals were extracted from newscast recordings of various foreign languages: (1) African dialect, (2) Amharic, (3) Armenian, (4) Bihari, (5) Hindi, (6) Japanese, (7) Kurdish, (8) Pashto, (9) Sudanese, (10) Urdu, (11) Basque, (12) Croatian, (13) Estonian, (14) Finnish, (15) Hungarian, (16) Icelandic, (17) Macedonian, (18), Mongolian, and (19) Bulgarian. The environmental sound source signals were selected from soundscapes of public human places, recorded at (1) airports, (2) canteens, (3) malls, (4) markets, (5) railway stations, (6) restaurants, (7) streets, (8) trains, and (9) subways.

From each recording, we extracted a central part using Audacity software, discarding the very beginning and end of the original signal. All recording segments were processed with Matlab custom functions to cut the sound segments to 5 seconds length, convert them to mono by averaging both channels, and normalize them to -23db.

Guided by the Urban Sound Taxonomy⁶⁹ and Google’s Audio Set⁷⁰ we chose the stimuli from high quality YouTube recordings.

Enveloping: After these processing steps, speech signals and environmental signals still differed in their low-frequency rhythmicity and overall signal envelope: the analytical envelopes of the environmental sound epochs were rather stationary whereas speech signals are characterized by prominent quasi-rhythmic envelope modulations in the 4–8 Hz range. In order to further equalize the two sound streams and make them as comparable as possible we dynamically modulated the envelope of the environmental sounds using envelopes randomly extracted from the speech signals. To do so envelopes of the speech signals were first extracted using the ‘Envelope’ functionality in Matlab, which is based on the spline interpolation over local maxima separated by at least 4410 samples, corresponding to 0.1 s at a sample rate of 44.1KHz. This relative large number of samples was chosen in order to keep the environment sound clearly recognizable after applying a quasi-rhythmic temporal.

One-back repetition targets and overlay: In a next step, we inserted small segment repetitions to be used as repetition targets in our listening task (see Fig. 1A). For this we randomly sampled and extracted short sound epoch of 750 ms and repeated it immediately after the end of the segment that has been sampled. The length of the repetition targets was chosen to roughly correspond to a functional unit like a typical acoustic event in the environment sounds or a couple of syllables/words in the speech signals. In order to implement the repetition in Matlab, the initial sound signal was cut at a randomly selected sample, then the original beginning, the 750 ms repetition, and the original end to the stream were all concatenated by a linear ramping and cross-fading. The linear ramping is made by a window of 220 samples that corresponds to 5 ms at a sample rate of 44.1 KHz. The cross-fading is achieved by simply adding together the ramping down part of the previous segment with the ramping up part of the subsequent segment.

Finally, for each trial’s audio presentation one resulting speech signal and one environmental sound signal were overlapped to form an auditory scene, consisting of speech conversation embedded in environmental sounds (see Fig. 1A bottom panel). In each trial only one of them could contain a repetition target. A set of the experimental stimuli can be freely downloaded at https://doi.org/10.5281/zenodo.1491058.

Trial Sequence and Experimental Design: All stimuli were presented using Psychophysics Toolbox Version 3⁷¹. Figure 1B provides an overview of a typical trial sequence. We implemented an attentional cueing paradigm with three cue validity conditions, i.e. valid, neutral, and invalid cues. Cue validity was 70%, 20% of cues were invalid, and 10% neutral. At the beginning of each trial, a fixation-cross appeared and subjects were instructed to keep central eye fixation throughout the trial (see Fig. 1B). After an interval of 1.0–2.0 s (randomly jittered) a visual cue was presented, directing auditory attention either to the “Speech” signal stream or to the auditory “Environment” stream, or to neither of them in the neutral condition. After another interval of 0.5–0.75 s (randomly jittered) the combined audio scene with overlapping speech and environmental sounds started playing for 5.0 s. The participants were instructed to pay attention to the cued stream and to respond with a button press as soon as they recognize any repetition in the sound stimuli. Accuracy and speed were equally emphasized during the instruction.

Before the actual data collection, participants were first familiarized with the sound scenes and had a chance to practice their responses to repetitions for one blocks of 100 trials. For practice purposes, we initially presented only one of the two sound streams individually so participants had an easier time understanding what repetition signals to watch out for. This training lasted for 17 minutes in total.

Each subsequent testing block consisted of 100 trials but now with overlapping sound scenes consisting of both a speech and an environmental sound stream and with the described attentional cueing paradigm. Each participant performed 3 experimental blocks, resulting in 300 experimental trials in total. Overall our experimental design had two factors: (1) Cue validity with the conditions valid (70% of trials), neutral (10% of trials) and invalid (20% of trials), and (2) Position of the repetition target in either the speech (50% of trials) or environmental (50% of trials) sound stream. All conditions were trial-wise intermixed.

Data Analysis: All data analyses were performed with custom scripts in MATLAB. A combination of built-in function and custom code was used in order to conduct descriptive and inferential statistics. For each condition in our 2 × 3 factor, mean and standard error of the mean (SEM) were calculated both for reaction times and response accuracies. Repeated-measurement analyses of variance were computed on accuracy data, mean reaction times, signal detection sensitivity and response biases. To further investigate systematic differences between individual conditions we computed planned contrasts in form of paired-samples t-tests between the repetition detection rates and reaction times in the valid versus invalid versus neutral cueing condition (both in the speech and environmental sound stream).

However, differences in detection accuracy reaction times can also result from changes in the response bias, for example, by a tendency to reduce the amount of evidence that is required to decide whether a target had occurred. To better understand the stage of selection, i.e., whether increases in detection rate are due to changes in sensitivity or changes in the decision criterion, or both, we further computed signal detection theory (SDT) indices in form of the sensitivity indices (d’) and response bias or criterion (c).

Results

Accuracy

Figure 2A shows the average accuracy with which repetition targets were detected in both the speech and environmental sound stream, and Fig. 2B shows the corresponding reaction times with which the responses were given. Repetition targets were detected well above chance, but performance was clearly not ceiling with up to 85% correct responses in the valid cueing condition. In general, valid cues helped the participants detecting repetition targets and also speeded up their responses by about 100 ms in respect to the neutral cueing condition. Invalid cues had the opposite effect. Table 1 provides an overview of the numeric values of mean detection accuracy and reaction times.

Table 1 Experiment 1 with overlaid speech and environmental sound streams.

Full size table

A two-way repeated-measures analysis of variance (ANOVA) on the mean detection accuracy statistically confirmed a main effect of the factor Cue validity, with F(2,18) = 28.36, p < 0.001. There was also a significant main effect of the second factor Position of the repetition target (speech signal vs. environmental signal), with F(1,9) = 22.53, p = 0.001. Importantly, there was no significant interaction between the two factors, with F(2,18) = 1.61, p = 0.226, indicating that the attentional modulation by the cue validity worked similarly for both streams. Planned contrasts in form of paired t-tests confirmed the expected direction of the attentional modulation effect: for speech and environmental sound targets combined, participants were significantly better in detecting the repetition targets in the valid then in the invalid cueing condition, t(9) = 7.5, p < 0.001. Participants responded significantly better also in the valid than in neutral condition, t(9) = 2.83, p = 0.02 and worse in invalid compared to the neutral condition: t(9) = −4.38, p = 0.002. Also for the speech and environmental sound stream targets separately, t-tests revealed that valid cues made participants respond faster compared to invalid cues, with t(9) = 7.13, p < 0.001 in the environmental signal and t(9) = 3.75, p = 0.005 in the speech signal. A significantly more accurate response was also found between the valid and neutral condition (i.e., facilitation), with t(9) = 2.32, p = 0.045 for the speech signal and t(9) = 2.34, p = 0.044 for the environmental signal. However, comparing the invalid versus neutral condition for the two different streams (i.e. suppression effects) gave a significant better response accuracy only for the environment signal, with t(9) = −4.07. p = 0.003, but not for the speech signal, with t(9) = −1.83, p = 0.1.

Comparing the detection accuracy under valid cueing conditions for speech signals versus environmental sound signals, a paired t-test revealed that it was a bit harder to detect embedded repetition targets in the environmental signal then in the speech signal, with t(9) = 3.273, p = 0.010.

Reaction times

A data analysis similar to the one performed for accuracy was also conducted on reaction times, revealing congruent effects. The numeric values of the average reaction time performance in the six experimental conditions are also provided in Table 2.

Table 2 Experiment 2 with two overlaid speech streams.

Full size table

A two-way repeated-measure analysis of variance (ANOVA) on mean reaction times revealed a main effect of factor Cue validity, F(2,18) = 8.63, p = 0.002, and a main effect of factor Position of the repetition target (speech signal vs. environmental signal, F(1,9) = 13.02, p = 0.005). Again, there was no significant interaction between both factors, with F(2,18) = 0.51, p = 0.610.

To investigate the direction of the observed effects, planned contrasts in form of paired t-tests were performed between the valid and invalid attention cue for speech and background, combined as well as separately. Combining data from both sound streams, participant were significantly faster identifying targets in the valid compared to the invalid cue condition, t(9) = −3.218, p = 0.010. There were also a significant differences between the valid and neutral condition, t(9) = −2.41, p = 0.039 (i.e. facilitation), and invalid and neutral conditions, t(9) = 2.44, p = 0.037 (i.e. suppression). Also for the speech and environmental sound stream targets separately, t-tests revealed that valid cues made participants respond faster compared to invalid cues, with t(9) = −3.85, p < 0.004 for targets in the environmental stream t(9) = −2.62, p = 0.028 for repetition targets hidden in the speech stream. For the environmental signal, we found evidence for facilitation effects, i.e. faster responses in the valid than in the neutral condition, with t(9) = −2.48, p = 0.035. In the speech stream, however, we did not find any significant advantage between the valid and neutral cueing condition, with t(9) = −1.83, p = 0.198. The opposite was true for suppression effects, i.e. comparing the invalid with the neutral cueing condition. Here, for the environmental signal, participants did not show any significant advantage between invalidly and neutrally cued trials, with t(9) = 1.70, p = 0.123. Instead, participants were faster in the neutral condition if the repetition was in the speech stream, with t(9) = 2.55, p = 0.03. Finally, comparing the detection accuracy under valid cueing conditions for speech signals versus environmental sound signals, a paired t-test revealed that the repetition targets were detected faster in the speech signal then in the environmental signal, t(9) = −3.683, p = 0.005. These results of mean reaction times are therefore consistent with the analysis of the detection accuracy data.

Signal-detection theory (SDT) analyses

We also computed sensitivity indices (d’) using the method suggested by Macmillan and Creelman^72,73. False alarms were detected as responses given before the presentation of the target. We first calculated sensitivity indices separately for each subject and each condition and averaged the computed values separately for each of the six conditions in our 3 × 2 factorial design (with factors Cue validity and Position of the repetition target).

Figure 2C shows the average sensitivity indices across all participants as a function of cue validity and the relative position of the repetition target. Participants were clearly more sensitive to repetition targets when they were validly cued. In comparison to the neutral cue condition, valid cues made participants more sensitive to repetition targets in both the speech and environmental noise stream. Invalid cues had the opposite effect, hindering subjects’ sensitivity to those subtle auditory targets (see also Table 1 for an overview of the numeric values of d’ sensitivity and criterion. Therefore, the signal detection sensitivity analysis results were congruent with both the accuracy and reaction time data.

A two-way repeated-measures analysis of variance (ANOVA) of the sensitivity scores statistically confirmed a main effect of the factor Cue validity, with F(2,18) = 21.3, p < 0.001. There was also a significant main effect of the factor Position of the repetition target (speech signal vs. environmental signal), with F(1,9) = 23.73, p < 0.001. There was no significant interaction between the two factors, with F(2,18) = = 0.78, p = 0.47, indicating that the attentional modulation by the cue validity worked similarly for both streams. Planned contrasts in form of paired t-tests confirmed the expected direction of the attentional modulation effect: for speech and environmental sound targets combined, participants were more sensitive to repetition targets in the valid then in the invalid cueing condition, t(9) = 7.19, p < 0.001. Comparing the valid and invalid condition with the neutral condition a significant effect of facilitation was detected for the valid condition, with t(9) = 2.98, p = 0.02 and an suppression effect was found for the invalid condition, with t(9) = −3.39, p = 0.008. Also for the speech and environmental sound stream targets separately, t-tests revealed that valid cues made participants more sensitive than invalid cues, with t(9) = 6.19, p < 0.001 and t(9) = 5.53, p < 0.001 in the environmental signal and in the speech signal, respectively. Regarding the environmental signal, validly cued trials were not significantly different from trial with neutral cues, with t(9) = 1.83, p = 0.1, but there was a facilitation of sensitivity for the speech signal, with t(9) = 2.94, p = 0.02. An opposite pattern was observed comparing the invalid condition with the neutral one, revealing a significant difference when the target was in the environmental signal, with t(9) = −2.63, p = 0.03, but no significant difference for targets in the speech stream, with t(9) = −1.80, p = 0.11. Comparing the sensitivity under valid cueing conditions for speech signals versus environmental sound signals, a paired t-test revealed that sensitivity was in general higher for the speech signals compared to environmental noise signals, with t(9) = 4.56, p = 0.001.

Figure 2D shows the average criterion (c) indices as a function of the factors Cue validity and Position of the repetition target. Participants have a similar bias and relatively liberal response criterion in the valid cueing conditions for both the speech and the environmental stream. They become more conservative in the invalid cueing condition especially when the target was embedded in the environmental signal.

A two way repeated-measures analysis of variance of the criterion scores confirmed a main effect of the factor Cue validity, with F(2,18) = 18.17, p < 0.001, and of the factor Position of the repetition target, with F(2,18) = 18.09, p = 0.002. There was also a significant interaction between the two factors, with F(2,18) = 4.48, p = 0.03. Planned paired t-tests were conducted to test the direction of the observed effects. In general there was a significantly more liberal decision criterion in the valid than in the invalid cueing condition, with t(9) = −5.70, p < 0.001. The difference in response criterion was also significant between the invalid and neutral condition, with t(9) = 4.0, p = 0.003, but not between the valid and neutral conditions, with t(9) = −0.804, p = 0.44. Interesting, for the speech and environmental signal stream separately, there was a significant liberalization of the response criterion for the environmental signal (i.e. contrasting the valid versus invalid cue condition, with t(9) = −4.86, p = 0.001), and a more conservative answering scheme when comparing the invalid and neutral condition, with t(9) = 3.87, p = 0.004. In any other contrast no significant differences were observed.

Ethical approval and informed consent

All experiments of this study were performed in accordance with relevant guidelines and regulations and approved of by the responsible institutional review board and the University of Trento Ethical Committee on the Use of Humans as Experimental Subjects. All participants provided written, informed consent in accordance with the University of Trento Ethical Committee on the Use of Humans as Experimental Subjects. All methods were carried out in accordance with the relevant guidelines and regulations.

Experiment 2: Attentional Weighting of Two Competing Speech Streams

In Experiment 1, we used an ecologically valid scenario of a speech signal being overlaid with environmental noise and asked participants to tune their attention to track one or the other input stream. Importantly, we equaled the low-level rhythmicity and the signal envelope, however, there exists the possibility that some low-level differences remained between the two types of stimuli and that any attentional weighting was based only on such subtle differences alone. Maybe participants could have done the task in Experiment 1 by focusing on lower-level feature instead.

Therefore, we address the question of object-based attention in a second experiment in which we present two overlaid sound streams from only one category (speech) that largely match in all low-level properties and thus require participants to fully attend to the higher-level properties. In Experiment 2, we therefore employ the same object-based repetition detection task as in Experiment 1, but have people attend one voice among other voices (both streams again overlaid spatially and temporally congruent), i.e., a listening scenario that is more similar to the classic cocktail party problem but without spatial separability of the signal sources.