Auditory Perceptual Organization
KeywordsSound Source Auditory System Perceptual Organization Interaural Time Difference Temporal Coherence
The process of extracting acoustic features from sound waves and partitioning them into meaningful groups
Traveling pressure waves (i.e., sounds) are produced by the movements or actions of objects. So sounds primarily convey information about what is happening in the environment. In addition, some information about the structure of the environment and the surface features of objects can be extracted by determining how the original (self-generated or exogenous) sounds are filtered or distorted by the environment (e.g., the notion of “acoustic daylight,” Fay 2009). In this entry we consider how the auditory systems process sound signals to extract information about the environment and the objects within it.
The auditory system faces a number of specific challenges which need to be considered in any account of perceptual organization: (1) sounds unfold in time; we can’t (normally) go back to reexamine them. Therefore, information must be extracted and perceptual decisions made in a timely manner. (2) The information contained within sounds generally requires processing over many timescales in order to extract their meaning (Nelken 2008). For example, a brief impulsive sound may tell the listener that two objects have been in collision, but a series of such sounds is needed in order for the listener to know that someone is clapping rather than walking. (3) Many objects of interest generate sounds intermittently. Therefore, some means for associating temporally discontiguous events are required. (4) Sound pressure waves are additive; what the ear receives is a combination of all concurrently active sound sources and their reflections off any hard surfaces. Many animals and birds communicate acoustically in large social groups, making the problem of source separation particularly tricky (Bee 2012). Despite these challenges, if the auditory system is to provide meaningful information about individual objects in the environment (e.g., potential mates or aggressors), it needs to partition the acoustic features into meaningful groups, a process known as auditory perceptual organization or auditory scene analysis (Bregman 1990).
Natural environments typically contain many concurrent sound sources, and even isolated sounds can be rather complex, e.g., animal vocalizations contain many different frequency components, and both the frequencies of the components and their amplitudes can vary within a single sound. The problem for the auditory system is to find some way of correctly associating the features which originate from the same sound source. The classical view of this process is that the cochlea decomposes the incoming composite sound waveform into its spectral components, generating a topographically organized array of signals which sets up the cochleotopic (or tonotopic) organization found throughout most of the auditory system, up to and including the primary auditory cortex (Zwicker and Fastl 1999). Other low-level features such as onsets, amplitude and frequency modulations, and binaural differences are extracted subcortically and largely independently within each frequency channel (Oertel et al. 2002). These acoustic features are bound together to form auditory events (Bertrand and Tallon-Baudry 2000; Zhuo and Yu 2011) or tokens (Shamma et al. 2011), i.e., discrete sounds that are localized in time and perceived as originating from a single sound source (Ciocca 2008). Events are subsequently grouped sequentially into patterns, streams, or perceptual objects.
Gestalt Grouping Principles
Good continuation: Smooth continuous changes in perceptual attributes favor grouping, while abrupt discontinuities are perceived as the start of something new. This principle can operate both within and between individual events.
Similarity: Similarity between the perceptual attributes of successive events (e.g., pitch, timbre, location) promotes grouping (Bregman 1990; Moore and Gockel 2002, 2012). Similar to the perception of visual motion (Weiss et al. 2002), it appears that it is not so much the raw difference that is important, but rather the rate of change; the slower the rate of change between successive sounds, the more similar they are judged (Winkler et al. 2012). In other words, in the auditory modality, similarity and good continuation may be equivalent.
Common fate: Correlated changes in features promote grouping, recently formalized as temporal coherence, i.e., feature correlations within time windows that span periods longer than individual events (Elhilali et al. 2009; Shamma et al. 2011).
Disjoint allocation (or belongingness): Refers to the principle that each element of the sensory input is only assigned to one perceptual object, e.g., exclusive border assignment in Rubin’s face-vase illusion. However, although generally true, this principle is sometimes violated in auditory perception, e.g., in duplex perception, the same sound component can contribute to the perception of a complex sound as well as being heard separately (Rand 1974; Fowler and Rosenblum 1990).
Closure: Objects tend to be perceived as whole even if they are not complete, e.g., a glide continuing through a masking noise if the glide offset is masked (Miller and Licklider 1950; Riecke et al. 2008). This applies more generally to the perception of global patterns (or “Gestalts”), e.g., individual notes are subsumed into a melodic pattern (McDermott and Oxenham 2008) and predictable individual speech sounds are perceived as present even if they are masked or missing (Warren et al. 1988). The auditory system is extraordinarily sensitive to repeating patterns and appears to readily use this cue to parse complex scenes (Winkler 2007; McDermott et al. 2011).
An important concept that emerges from the idea of a “Gestalt” as a pattern is that of predictability. In the case of auditory perception, this refers to expectancies about sound events that have not yet occurred. By detecting patterns (or feature regularities) in the acoustic input, the brain can construct representations that allow it to anticipate or “explain away” (Pearl 1988) future events. In this way Gestalt theory connects to the ideas of unconscious inference (Helmholtz 1885) and perception as hypothesis formation (Gregory 1980).
While visual objects are widely accepted as fundamental representational units, the notion of an auditory object is less well established, and there is as yet no universal agreement on how they should be defined, e.g., see Kubovy and Van Valkenburg (2001), Griffiths and Warren (2004), Winkler et al. (2006), Shinn-Cunningham (2008). Based on the Gestalt principles and ideas of perceptual inference, outlined above, Winkler et al. (2009) proposed a definition of an auditory perceptual object as a predictive representation, constructed from feature regularities extracted from the incoming sounds. These object representations are temporally persistent and encode distributions over featural and temporal patterns, determined by the current context. The consolidated object representation therefore refers to patterns of sound events; individual sound events are processed within the context of the whole to which they belong. This definition of an auditory perceptual object is compatible with the definition of an auditory stream, as a coherent sequence of sounds separable from other concurrent or intermittent sounds (Bregman 1990). However, whereas the term “auditory stream” refers to a phenomenological unit of sound organization, with separability as its primary property, the definition proposed by Winkler et al. (2009) emphasizes the extraction and representation of the unit as a pattern with predictable components (Winkler et al. 2012). While the usage of the term object is not universally accepted within the auditory domain, we will use it in this entry as defined by Winkler et al. (2009).
Auditory Scene Analysis
In order to determine the perceptual qualities of individual sound events, the brain must first bind their component features even though the number of concurrent auditory objects and which features belong to each is unknown a priori; this must be inferred incrementally from the ongoing sensory input. Therefore, it is clear that the auditory system needs to use (top-down) contextual information to guide its grouping decisions and some means for evaluating these decisions and revising them in the event that they prove to be incorrect. In the currently most widely accepted framework describing perceptual sound organization, auditory scene analysis, Bregman (1990) proposes two separable processing stages. The first stage is suggested to be concerned with partitioning sound events into potential groups based primarily on featural similarities and differences. The second stage, within which prior knowledge and task demands exert their influence, is a competitive process between candidate organizations that determines which one is perceived. Within this framework there are two types of grouping: simultaneous grouping based on concurrent cues and sequential grouping based on contextual temporal cues. For the reasons outlined above, these two are not really distinct (simultaneous cues are influenced by prior sequential grouping, e.g., Darwin et al. (1995) and Bendixen et al. (2010b), just as sequential grouping is influenced by the perceptual qualities of individual events (simultaneous grouping) (Bregman 1990); nevertheless, they provide a useful starting point for models of auditory scene analysis.
In the absence of sequential grouping cues, there are some features which automatically trigger the formation of individual sound events; for reviews see Darwin and Carlyon (1995) and Ciocca (2008). Common onsets and offsets form clear temporal boundaries, and the strategy adopted by the auditory system is to match onsets to offsets (including similarities between features and temporal proximity) in order to segregate perceptual events (Nakajima et al. 2000). Harmonicity (i.e., the presence of frequency components which are integer multiples of a common fundamental frequency) is another important grouping cue (Darwin and Carlyon 1995). For example, when one component of a complex harmonic tone is mistuned, listeners perceive two concurrent sounds, a complex tone consisting of the harmonically related components and a pure tone, corresponding to the mistuned component (Moore et al. 1986). However, not all acoustic features trigger concurrent grouping, e.g., a location cue (common interaural time differences) between a subset of frequency components within a single sound event does not generate a similar segregation of component subsets within individual sound events (Culling and Summerfield 1995).
Another important strategy for segregating sound events is template matching. If people have prior knowledge of events, then it is possible to hear them out. This effect was exploited in the many double-vowel experiments used to test the influence of different acoustic features, e.g., Assmann and Summerfield (1990) and Summerfield and Assmann (1991), and even in the absence of featural differences, it was shown that known vowel sounds can be identified well above chance (Assmann and Summerfield 1989). This template-matching phenomenon appears to be rather general and applies to any sound that is repeated. The auditory system is very sensitive to repetition (Teki et al. 2011). If a previously unheard sound is repeated against a different background, then it can be segregated and identified significantly above chance, even with only a single repetition, and even if many of usual grouping cues are absent (McDermott et al. 2011). Similarly, arbitrary repeated noise segments can be rapidly learnt within a few trials (Agus et al. 2010).
Models of Event Formation
Many models have been developed to investigate simultaneous grouping and the segregation of perceptual events, e.g., see models described in Wang and Brown (2006). A model of auditory saliency which used low-level cues of spectral and temporal contrast to highlight salient events in continuous noisy soundscapes predicted human event detection very well (Kayser et al. 2005). Temporal contrasts effectively highlight onsets and offsets, while spectral peaks carry information about the resonances of sound sources and to some extent their identity (von Kriegstein et al. 2007). The segregation of overlapping events using pitch cues has been widely explored (c.f. Pitch Perception, Models), e.g., for explaining enhanced double-vowel segregation (de Cheveigne et al. 1995). The segregation of events using repetition was shown to be possible in principle by using a combination of cross-correlation and averaging to incrementally build a representation of the repeated target (McDermott et al. 2011). Because of the importance of longer-term context on grouping, none of these models provide general solutions to the problem of auditory scene analysis; nevertheless, they provide important building blocks in this process.
Sequential grouping generally conforms to the Gestalt principles of similarity/good continuation and common fate. In contrast to concurrent grouping, sequential grouping is necessarily based on some representation of the preceding sounds; for reviews, see (Moore and Gockel 2002; Carlyon 2004; Haykin and Chen 2005; Snyder and Alain 2007; Ciocca 2008; Shamma and Micheyl 2010; Shamma et al. 2011; Moore and Gockel 2012). Most studies of this class of grouping have used sequences of discrete sound events to investigate the influences of acoustic features and temporal structure. In the most widely used experimental approach (termed the auditory streaming paradigm), sequences of alternating sound events differing in some feature(s) are presented to listeners (van Noorden 1975). When the feature separation is small and/or they are delivered at a slow pace, listeners predominantly hear a single integrated stream containing all the sounds. With large feature separation and/or fast presentation rates, listeners report hearing the sequence separate out into two segregated streams. In this there is a cue trade-off: smaller feature differences can be compensated with higher presentation rates and vice versa (van Noorden 1975). Differences in various auditory features, including frequency, pitch, loudness, location, timbre, and amplitude modulation, have been shown to support auditory stream segregation (Vliegen and Oxenham 1999; Grimault et al. 2002; Roberts et al. 2002). Thus it appears that sequential grouping is based on perceptual similarity, rather than on specific low-level auditory features (Moore and Gockel 2002, 2012). Temporal structure has also been suggested as a key factor in segregating streams either by guiding attentive grouping processes (Jones 1976; Jones et al. 1981; Large and Jones 1999) or through temporal coherence that binds correlated component features in the auditory input (Elhilali et al. 2009; Shamma and Micheyl 2010; Shamma et al. 2011, 2013).
Models of Auditory Streaming
Early models of auditory streaming, e.g., Beauvois and Meddis (1991), focused on the relationship between frequency differences and event rate and the proposal that streaming could be explained almost exclusively by peripheral channeling mechanisms (Hartmann and Johnson 1991) or the degree of overlap between neural responses to each of the alternating tones, e.g., McCabe and Denham (1997). In these models the perceptual decision was represented by levels of activation across a spatial array of neurons; see also Micheyl et al. (2005) for a similar interpretation of neural activity in primary auditory cortex. A different approach in which grouping is signaled by temporal correlations within network responses was proposed by Wang, Brown, and colleagues (Brown and Wang 2006; Wang and Chang 2008). For example, the model proposed by Wang and Chang (2008) consists of a 2-dimensional array of oscillators with one dimension representing frequency and the other external time. Units are connected by local excitatory connections and by global inhibition. Characteristic results of classical auditory streaming experiments (van Noorden 1975) are simulated by including strong local excitatory connections (encouraging synchronization) and weaker long-range connections (which are easily overcome by inhibition and therefore encourage desynchronization). Sensitivity to event rate is modeled by dynamic weight adjustments. However, while the representation of grouping is different from the models previously outlined, this model also depends on peripheral channeling and the degree of overlap in the incoming activity patterns to determine its grouping decision.
A similar focus on temporal coherence (in this case the average correlation within a sliding window 50–500 ms in duration) is seen in the model of streaming proposed by Elhilali and colleagues, e.g., Elhilali and Shamma (2008) and Shamma et al. (2011) (Note, Figs. 6 and 9 in this entry have incorrect colour scale labels (0 % and 100 %, interchanged; Shamma and Elhilali (2013)). The computational model developed by Elhilali and Shamma (2008) extracts multiple features from the incoming acoustic input including frequency, pitch, direction, and spectral shape and assigns the resulting activity patterns to one of two clusters which come to represent the properties of the events in each stream. The temporal coherence measure is used to determine which components should be grouped. The clusters compete to incorporate each event, and the winning cluster uses the event features (as determined by the grouping process) to refine its representation. These correlation-based models overcome a problem faced by the population separation account of streaming (Micheyl et al. 2005) that predicted widely separated components would be segregated even if they overlapped in time, which is not the case (Elhilali et al. 2009). They also provide a means for binding the component features of an event, not considered in the earlier models. Later refinements to the temporal coherence account of streaming (Shamma et al. 2011, 2013), included the strong claims that (a) feature binding occurs only with attention, i.e., attention is responsible for grouping features that belong to the foreground object, c.f. (Treisman 1998), and (b) all other features remain ungrouped in an undifferentiated background. However, the proposed role of attention in feature binding has long been debated in the visual domain, e.g., Duncan and Humphreys (1989), and it is not consistent with the results of experiments testing feature binding in the absence of attention by recording auditory event-related potentials (AERP) in response to rare feature combinations (Takegata et al. 2005; Winkler et al. 2005a).
Competition and Selection
The models described above all conform to the assumptions that in response to alternating two-tone sequences, (a) auditory perception always starts from the integrated organization and (b) that eventually a stable final perceptual decision is reached (Bregman 1990). However, it has been found, when listeners report their percepts continuously while listening to such sequences for long periods, that perception fluctuates between different perceptual organizations (Winkler et al. 2005b; Pressnitzer and Hupe 2006). Perceptual switching occurs in all listeners and for all combinations of stimulus parameters tested (Anstis and Saida 1985; Roberts et al. 2002; Denham and Winkler 2006; Pressnitzer and Hupe 2006; Schadwinkel and Gutschalk 2011; Denham et al. 2012), even combinations very far from the ambiguous region identified by van Noorden (1975). Furthermore, for stimuli with parameters that strongly promote segregation, participants often report hearing segregation first (Deike et al. 2012; Denham et al. 2012). It has also been found that perceptual organizations other than the classic integrated and segregated categories may be reported (Bendixen et al. 2010a, 2012; Bőhm et al. 2012; Denham et al. 2012; Szalárdy et al. 2012), showing that auditory perceptual organization in response to alternating two-tone sequences is multistable (Schwartz et al. 2012).
The notion of perceptual multistability is challenged by everyday subjective experience of a world perceived as stable and continuous and by experimental results obtained by averaging over the reports of different listeners, which generally show that within the initial 5–15 s of two-tone sequence, the probability of reporting segregation monotonically increases (termed the buildup of auditory streaming) (but see Deike et al. (2012)). For these reasons it has been suggested that perceptual multistability observed in the auditory streaming paradigm may be simply a consequence of the artificial stimulation protocol used. However, there is a growing body of experimental data supporting the existence of multistability and just as visual multistability has provided new insights into visual processing, e.g., Kovacs et al. (1996); it seems likely that understanding spontaneous changes in the perception of unchanging sound sequences will help throw new light on auditory perception.
Modeling Multistability in Auditory Streaming
Multistability of auditory perceptual organization cannot be explained by any of the theories or models outlined above, which all have essentially one fixed attractor. Models of visual multistability have a longer history, e.g., Laing and Chow (2002); Shpiro et al. (2009); van Ee (2009). These models typically contain three essential components (Leopold and Logothetis 1999): (a) mutual inhibition between competing stimuli to ensure exclusivity (i.e., perceptual awareness generally switches between the different alternatives rather than fusing them), (b) adaptation to ensure the observed inevitability of perceptual switching (the dominant percept cannot remain dominant forever), and (c) noise to account for the observed stochasticity of perceptual switching (successive phase durations are largely uncorrelated, and the distribution of phase durations resembles a gamma or log-normal distribution) (Levelt 1968). The questions for auditory multistability are what are the competing entities, and what form does this competition take in order to explain dynamic nature of perceptual awareness reported by listeners.
The computational model of auditory multistability proposed by Mill et al. (2013) is based on the idea that auditory perceptual organization rests on the discovery of recurring patterns embedded within the stimulus, constructed by forming associations (links) between incoming sound events and recognizing when a previously discovered sequence recurs and can thus be used to predict future events. These predictive representations, or proto-objects (Rensink 2000; Winkler et al. 2012), compete for dominance with any other proto-objects which predict the same event (a form of local competition) and are the candidate set of representations that have the potential to become the perceptual objects of conscious awareness. This model accounts for the emergence of, and switching between, alternative organizations; the influence of stimulus parameters on perceptual dominance, switching rate, and perceptual phase durations; and the buildup of auditory streaming. In a new sound scene, the proto-object that is the easiest to discover determines the initial percept. Since the time needed for discovering a proto-object depends largely on the stimulus parameters (i.e., to what extent successive sound events satisfy/violate the similarity/good continuation principle), the first percept strongly depends on stimulus parameters. However, the duration of the first perceptual phase is independent of the percept (Hupe and Pressnitzer 2012), since it depends on how long it takes for other proto-objects to be discovered (Winkler et al. 2012). The model also accounts for the different influences of similarity and closure on perception; the rate of perceptual change (similarity/good continuation) determines how easy it is to form the links between the events that make up a proto-object, while predictability (closure) does not affect the discovery of proto-objects, but can increase the competitiveness (salience) of a proto-object once it has been discovered (Bendixen et al. 2010a).
Neural Correlates of Perceptual Organization
Neural responses to individual sounds are profoundly influenced by the context in which they appear (Bar-Yosef et al. 2002). The question is to what extent the contextual influences on neural responses reflect the current state of perceptual organization. This question has been addressed by a number of studies ranging in focus from the single neuron level (c.f. stimulus-specific adaptation) to large-scale brain responses (c.f. auditory evoked potentials), and the results provide important clues about the processing strategies adopted by the auditory system.
Studies investigating single neuron responses to alternating tone sequences, e.g., Fishman et al. (2004), Bee and Klump (2005), Micheyl et al. (2005)), and Micheyl et al. (2007), have shown an effect called differential suppression, i.e., at the start of the sequence, the neuron responds to both tones, but with time the response to one of the tones (typically corresponding to the best frequency of the cell) remains relatively strong, while the response to the other tone diminishes. Since neuronal sensitivity to frequency difference and presentation rate was found to be consistent with the classical van Noorden (1975) parameter space, it was claimed that differential suppression was a neural correlate of perceptual segregation (Fishman et al. 2004). This was supported by the finding that spike counts from neurons in primary auditory cortex predict an initial integration/segregation decision closely matching human perception (Micheyl et al. 2005; Bee et al. 2010). However, differential suppression does not account for perceptual multistability or for the perception of overlapping tone sequences (Elhilali et al. 2009); therefore, while differential suppression may be a necessary component of the auditory streaming process, it does not provide a complete explanation.
Auditory event-related brain potentials (AERPs) represent the synchronized activity of large neuronal populations, time locked to some auditory event. Because they can be recorded noninvasively from the human scalp, they have been widely used to study the brain responses accompanying auditory stream segregation; c.f. auditory event-related potentials, especially long-latency AERP responses. Three AERP components are of particularly relevance in this regard: (a) the “object-related negativity” (ORN) which signals the automatic segregation of concurrent auditory objects (Alain et al. 2002), (b) the amplitude of the auditory P1 and N1 which varies depending on whether the same sounds are perceived as part of an integrated or segregated organization (Gutschalk et al. 2005; Szalárdy et al. 2013), and c) the mismatch negativity (MMN; Näätänen et al. 1978) which has been used as an indirect index of auditory stream segregation, e.g., Sussman et al. (1999); Nager et al. (2003); Winkler et al. (2003a); Gutschalk et al. (2005).
The detection and representation of regularities by the brain, as indexed by the MMN, provided the basis for the definition of an auditory object proposed by Winkler et al. (2009). Using evidence from a series of MMN studies, they defined an auditory object as a perceptual representation of a possible sound source, derived from regularities in the sensory input (Winkler 2007, 2010) that has temporal persistence (Winkler and Cowan 2005) and can link events separated in time (Näätänen and Winkler 1999). This representation forms a separable unit (Winkler et al. 2006a) that generalizes across natural variations in the sounds (Winkler et al. 2003b) and generates expectations of parts of the object not yet available (Bendixen et al. 2009).
It should be pointed out that while traditional psychological accounts of auditory perceptual organization implicitly or explicitly refer to representations of objects, there are models of auditory perception which are not concerned with positing a representation directly corresponding auditory objects. The hierarchical predictive coding model of perception, e.g., Friston and Kiebel (2009), includes predictive memory representations, which are in many ways compatible with the notion of auditory object representations (Winkler and Czigler 2012), but no explicit connection with object representations is made. Shamma and colleagues’ temporal coherence model of auditory stream segregation (Elhilali and Shamma 2008; Elhilali et al. 2009; Shamma et al. 2011, 2013) provides another way to avoid the assumption that object representations are necessary for determining sound organization; instead it is proposed that objects are essentially whatever occupies the perceptual foreground and exist only insofar as they do occupy the foreground. In summary, there is currently little consensus on the role of auditory object representations in perceptual organization, and the importance placed on object representations by the various models and theories differs markedly.
fMRI studies of auditory streaming have found neural correlates in a number of brain regions. In one of the earliest studies, Cusack (2005) failed to find differential activity in auditory cortex corresponding to perceptual organization into one or two streams, but he did find such activity in the intraparietal sulcus, an area associated with cross-modal processing and object numerosity. Shortly afterwards Wilson et al. (2007) showed that auditory cortical activity increased with increasing frequency difference and that as the frequency difference increased, the cortical response changed from being rather phasic (i.e., far stronger at the onset of the sequence) towards a more sustained response throughout the stimulus sequence. Taking a closer look at the dynamics of cortical activity associated with perceptual switching, Kondo and Kashino (2009) showed that both auditory cortex and thalamus are involved, with an increase in thalamic activity preceding that in cortex associated with a switch from the nondominant to the dominant percept and, conversely, an increase in cortical activity preceding that in thalamus associated with a switch from the dominant to the nondominant percept. They also found differential activation in posterior insular cortex and in the cerebellum. Interestingly, activations in the cerebellum and thalamus are negatively correlated in auditory streaming, with the left cerebellar activation level increasing with the rate of perceptual switching and thalamus (medial geniculate) decreasing (Kashino and Kondo 2012). Consistent with these findings, Schadwinkel and Gutschalk (2011), using a different stimulus paradigm which allowed them to influence the timing of perceptual switching, found transient auditory cortical activation associated with perceptual switching and a further transient activation in inferior colliculus, although whether the inferior colliculus is responsible for triggering switching or simply reflects the transient switching activation in cortex is not clear. In summary, neural correlates of auditory streaming have been found in many areas within the auditory system and beyond, suggesting that creating and switching between alternative perceptual organizations involve a broadly distributed network within the brain.
Conclusions and Open Questions
The Gestalt principles and their application to auditory perception instantiated in Bregman’s (1990) two-stage auditory scene analysis framework provided the initial basis for understanding auditory perceptual organization, and recent proposals have extended this framework in interesting ways. Nevertheless, there remain many unanswered questions and there have been few, if any, attempts to build neuro-computational models capable of dealing with the complexity of real auditory scenes in which grouping and categorization cues are not immediately available; however, see (Yildiz and Kiebel 2011). Feedback connections are pervasive within the auditory system, including all stages of the subcortical system, yet to our knowledge no models include such connections. Although fMRI results are useful for identifying regional involvement, detailed understanding of the neural circuitry involved in auditory perceptual organization is sketchy, and the neural representations of auditory objects and perceptual organization are unknown. Even the role of primary auditory cortex remains something of a mystery, e.g., see Nelken et al. (2003) and Griffiths et al. (2004); perhaps studying the switching of perceptual awareness between different representations in awake behaving animals will help to elucidate the representations and processing strategies adopted by cortex.
- Anstis S, Saida S (1985) Adaptation to auditory streaming of frequency-modulated tones. J Exp Psychol Hum Percept Perform 11:257–271Google Scholar
- Bendixen A, Bőhm TM, Szalárdy O, Mill R, Denham SL, Winkler I (2012) Different roles of similarity and predictability in auditory stream segregation. J Learn Percept (in press)Google Scholar
- Bőhm TM, Shestopalova L, Bendixen A, Andreou AG, Georgiou J, Garreau G, Pouliquen P, Cassidy A, Denham SL, Winkler I (2012) Spatial location of sound sources biases auditory stream segregation but their motion does not. J Learn Percept (in press)Google Scholar
- Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT, Cambridge, MAGoogle Scholar
- Brown GJ, Wang DL (eds) (2006) Neural and perceptual modelling. Computational auditory scene analysis: principles, algorithms, and applications. Wiley/IEEE Press, ChichesterGoogle Scholar
- Darwin CJ, Carlyon RP (1995) Auditory grouping. In: Moore BCJ (ed) The handbook of perception and cognition: hearing, vol 6. Academic, London, pp 387–424Google Scholar
- Denham SL, Gymesi K, Stefanics G, Winkler I (2012) Multistability in auditory stream segregation: the role of stimulus features in perceptual organisation. J Learn Percept (in press)Google Scholar
- Hartmann WM, Johnson D (1991) Stream segregation and peripheral channeling. Music Percept 9(2):153–183Google Scholar
- Helmholtz H (1885) On the sensations of tone as a physiological basis for the theory of music. Longmans, Green, LondonGoogle Scholar
- Köhler W (1947) Gestalt psychology: an introduction to new concepts in modern psychology. Liveright Publishing Corporation, New YorkGoogle Scholar
- Large EW, Jones MR (1999) The dynamics of attending: how people track time-varying events. Psychol Rev 106:119–159Google Scholar
- Levelt WJM (1968) On binocular rivalry. Mouton, ParisGoogle Scholar
- McCabe SL, Denham MJ (1997) A model of auditory streaming. J Acoust Soc Am 101(3):1611–1621Google Scholar
- Mill R, Bőhm T, Bendixen A, Winkler I, Denham SL (2013) Competition and cooperation between fragmentary event predictors in a model of auditory scene analysis. PLoS Comput Biol (in press)Google Scholar
- Miller GA, Licklider JCR (1950) The intelligibility of interrupted speech. J Acoust Soc Am 22:167–173Google Scholar
- Moore BCJ, Gockel HE (2002) Factors influencing sequential stream segregation. Acta Acust 88:320–333Google Scholar
- Näätänen R, Gaillard AWK, Mäntysalo S (1978) Early selective attention effect on evoked potential reinterpreted. Acta Psychol 42:313–329Google Scholar
- Oertel D, Fay RR, Popper AN (2002) Integrative functions in the mammalian auditory pathway. Springer, New YorkGoogle Scholar
- Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San MateoGoogle Scholar
- Shamma SA, Elhilali M (2013)Google Scholar
- Szalárdy O, Bendixen A, Tóth D, Denham SL, Winkler I (2012) Modulation-frequency acts as a primary cue for auditory stream segregation. J Learn Percept (in press)Google Scholar
- Szalárdy O, Bőhm T, Bendixen A, Winkler I (2013) Perceptual organization affects the processing of incoming sounds: an ERP study. Biol Psychol (in press)Google Scholar
- van Noorden LPAS (1975) Temporal coherence in the perception of tone sequences. Doctoral dissertation, Technical University EindhovenGoogle Scholar
- von Ehrenfels C (1890) Über Gestaltqualitäten (English “On the qualities of form”). Vierteljahrsschr Wiss Philos 14:249–292Google Scholar
- von Kriegstein K, Smith DR, Patterson RD, Ives DT, Griffiths TD (2007) Neural representation of auditory size in the human voice and in sounds from other resonant sources. Curr Biol 17(13):1123–1128Google Scholar
- Wang DL, Brown GJ (2006) Computational auditory scene analysis: principles, algorithms, and applications. Wiley/IEEE Press, New YorkGoogle Scholar
- Wertheimer M (1912) Experimentelle Studien über das Sehen von Bewegung. Z Psychol 60Google Scholar
- Winkler I (2007) Interpreting the mismatch negativity. J Psychophysiol 21:147–163Google Scholar
- Winkler I (2010) In search for auditory object representations. In: Winkle I, Czigler I (eds) Unconscious memory representations in perception: processes and mechanisms in the brain. John Benjamins, Amsterdam, pp 71–106Google Scholar
- Zhuo G, Yu X (2011) Auditory feature binding and its hierarchical computational model. In: Third international conference on artificial intelligence and computational intelligence. SpringerGoogle Scholar
- Zwicker E, Fastl H (1999) Psychoacoustics. Facts and models. Springer, Heidelberg/New YorkGoogle Scholar