Auditory rhythms are systemically associated with spatial-frequency and density information in visual scenes

Sherman, Aleksandra; Grabowecky, Marcia; Suzuki, Satoru

doi:10.3758/s13423-013-0399-y

Auditory rhythms are systemically associated with spatial-frequency and density information in visual scenes

Brief Report
Published: 20 February 2013

Volume 20, pages 740–746, (2013)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

Auditory rhythms are systemically associated with spatial-frequency and density information in visual scenes

Download PDF

Aleksandra Sherman¹,
Marcia Grabowecky^1,2 &
Satoru Suzuki^1,2

1389 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

A variety of perceptual correspondences between auditory and visual features have been reported, but few studies have investigated how rhythm, an auditory feature defined purely by dynamics relevant to speech and music, interacts with visual features. Here, we demonstrate a novel crossmodal association between auditory rhythm and visual clutter. Participants were shown a variety of visual scenes from diverse categories and asked to report the auditory rhythm that perceptually matched each scene by adjusting the rate of amplitude modulation (AM) of a sound. Participants matched each scene to a specific AM rate with surprising consistency. A spatial-frequency analysis showed that scenes with greater contrast energy in midrange spatial frequencies were matched to faster AM rates. Bandpass-filtering the scenes indicated that greater contrast energy in this spatial-frequency range was associated with an abundance of object boundaries and contours, suggesting that participants matched more cluttered scenes to faster AM rates. Consistent with this hypothesis, AM-rate matches were strongly correlated with perceived clutter. Additional results indicated that both AM-rate matches and perceived clutter depend on object-based (cycles per object) rather than retinal (cycles per degree of visual angle) spatial frequency. Taken together, these results suggest a systematic crossmodal association between auditory rhythm, representing density in the temporal domain, and visual clutter, representing object-based density in the spatial domain. This association may allow for the use of auditory rhythm to influence how visual clutter is perceived and attended.

Window of audio-visual simultaneity is unaffected by spatio-temporal visual clutter

Article Open access 29 May 2014

Acoustics and Psychoacoustics of Sound Scenes and Events

Binaural Hearing with Temporally Complex Signals

Previous research has demonstrated a variety of perceptual correspondences between auditory and visual features. Most of these associations are based on auditory loudness mapping to visual brightness; auditory pitch (or pitch change) mapping to visual lightness, elevation, and size; and auditory timbre (often conveyed by speech sounds) mapping to sharpness/smoothness of visual contours or shapes (e.g., Bernstein & Edelstein, 1971; Evans & Treisman, 2010; Marks, 1987; Mossbridge, Grabowecky & Suzuki 2011; Ramachandran & Hubbard, 2001; Sweeny, Guzman-Martinez, Ortega, Grabowecky & Suzuki 2012).

Few studies have investigated how rhythm, an auditory feature defined purely by dynamics, may interact with visual features. Rhythm is a fundamental auditory feature coded in the auditory cortex, it plays an integral role in providing information about objects and scenes (Liang, Lu & Wang 2002; Schreiner & Urbas, 1986, 1998), and it conveys affective and linguistic information in music and speech (e.g., Bhatara, Tirovolas, Duan, Levy & Levitin 2011; Juslin & Laukka, 2003; Scherer, 1986). Intuitively, a faster auditory rhythm is associated with visual properties that imply rapid dynamics. Consistent with this idea, Shintel and Nusbaum (2007) showed that listening to a verbal description of an object spoken at an atypically rapid rate speeded recognition of a subsequently presented picture when the picture depicted an object in motion relative to when it depicted the same object at rest. This suggests that auditory rhythm can interact with the perception of visual dynamics.

Recently, Guzman-Martinez, Ortega, Grabowecky, Mossbridge, and Suzuki (2012) have demonstrated that auditory rhythm is also associated with visual spatial frequency, a fundamental visual feature initially coded in the primary visual cortex (De Valois, Albrecht, & Thorell, 1982; Geisler & Albrecht, 1997) that is relevant for perceiving textures, objects, hierarchical structures, and scenes (Landy & Graham, 2004; Schyns & Oliva, 1994; Shulman, Sullivan, Gish & Sakoda 1986; Sowden & Schyns, 2006). They used a basic form of auditory rhythm conveyed by an amplitude-modulated (AM) sound (a white noise whose intensity is modulated at a fixed rate) and a basic form of visual spatial frequency conveyed by a Gabor patch (a repetitive grating-like pattern whose luminance is modulated at a fixed spatial frequency). They found that participants matched faster AM rates to higher spatial frequencies in an approximately linear relationship. This crossmodal relationship is absolute (rather than relative), in that it is equivalent whether each participant found an auditory match to only one Gabor patch, or found auditory matches to multiple Gabor patches of different spatial frequencies. The relationship is perceptual, in that it is not based on general magnitude matching in an abstract numerical representation or on matching the number of “bars” in Gabor patches to AM rates. It was further shown that the relationship is functionally relevant, in that an AM sound can guide attention to a Gabor patch with the corresponding spatial frequency. These results suggest a fundamental relationship between the auditory processing of rhythm (AM rate) and the visual processing of spatial frequency.

Although it is necessary to characterize a crossmodal relationship using simplified visual stimuli, it is also important to understand how the relationship is relevant to perception in the real world. In the natural environment, we encounter complex scenes that are characterized by many spatial frequencies. It has been shown that the responses of spatial-frequency-tuned neurons to natural scenes are not readily predictable from their responses to Gabor patches (e.g., Olshausen & Field, 2006). In the present study, we thus investigated how the basic perceptual relationship between auditory rhythm and isolated visual spatial frequencies generalized to a perceptual relationship between auditory rhythm and natural scenes composed of multiple spatial-frequency components. This investigation would also elucidate how auditory rhythm may systematically influence the processing of complex visual scenes.

Experiment 1

We first determined whether people would consistently match a variety of complex visual scenes to specific auditory AM rates. Namely, we asked, does a natural scene have an implied auditory rhythm? For example, a cluttered indoor scene might be matched to a faster AM rate than would a less cluttered indoor scene, an urban scene to a faster AM rate than a beach scene, a mountain scene to a slower AM rate than a forest scene, and so on. Indeed, we found that people consistently matched each scene to a specific AM rate. We then analyzed the spatial-frequency components of the images in order to investigate the source of this crossmodal association.

Method

Participants

The participants in all of our experiments were Northwestern University undergraduate students, who gave informed consent to participate for partial course credit, had normal or corrected-to-normal visual acuity and normal hearing, and were tested individually in a dimly lit room. A group of 20 (nine female, 11 male) students participated in Experiment 1.

Stimuli and procedures

The participants determined auditory-AM-rate matches to 24 natural scenes (see the Supplementary Materials) and three Gabor patches (0.50, 2.20, and 4.50 cycles/cm in physical spatial frequency, corresponding to 0.75, 3.30, and 6.79 cycles/degree [c/deg] in retinal spatial frequency); see Fig. 1 for the trial information. All images were randomly presented three times, totaling 81 trials. Participants were given three practice trials prior to the experiment, in which they determined AM-rate matches to geometrical patterns. All images were displayed full-screen on a 22-in. color CRT monitor (1,152 × 870 pixels, 85 Hz). An integrated head-and-chin rest was used to stabilize the viewing distance at 84 cm.

After auditory–visual matching trials were completed, participants determined whether each scene (not including the three Gabor patches) was dense or sparse, with a forced choice response. The image was displayed slightly smaller (20.5º × 16.0º of visual angle) in order to present the choice words “dense” and “sparse” below the image. Each scene was randomly presented in two separate blocks. Left/right placement of the two words was counterbalanced (e.g., if the word “dense” appeared on the left in the first block, it appeared on the right in the second block, or vice versa). This provided a measure of perceived clutter.

The experiment was controlled using MATLAB software with Psychophysics Toolbox extensions (Brainard, 1997; Pelli, 1997).

Results

Participants matched specific auditory AM rates to the 24 scenes from diverse categories (nature, urban, and indoor) with surprising consistency (Fig. 2, black circles), F(23, 437) = 17.43, p < .0001 (main effect of images). The AM-rate matches to the intermixed Gabor patches were similar to those reported by Guzman-Martinez et al. (2012), in that no main effect of experiment emerged [F(1, 25) = 2.80, n.s.], although we did find a marginal interaction between experiment and spatial frequency [F(2, 50) = 2.81, p = .07] in which our participants assigned slightly lower AM rates to the highest-spatial-frequency Gabor patch (see Table 1 for the AM-rate matches in the two studies). We replicated a robust linear relationship between spatial frequency and AM rate [t(19) = 6.10, p < .0001, for the linear contrast]. The fact that our participants, who saw complex scenes as well as Gabor patches, and Guzman-Martinez et al.’s participants, who only saw Gabor patches, similarly matched Gabor patches of different spatial frequencies to specific AM rates suggests that people tend to use a consistent strategy to match AM rates to both simple Gabor patches and complex scenes. Because Gabor patches primarily carry single spatial frequencies, we hypothesized that our participants used spatial-frequency information to match AM rates to visual scenes.

Table 1 AM-rate matches to single visual spatial frequencies (SFs) presented as Gabor patches in Experiment 1, as compared with Guzman-Martinez et al.’s (2012) results for the same spatial frequencies

Full size table

To evaluate this hypothesis, we applied a two-dimensional Fourier transform to each scene and obtained its spatial-frequency profile with respect to 12 spatial-frequency bins, ranging from 0.05 to 12.8 c/deg (see Table 2).

Table 2 For each spatial-frequency (SF) bin, the lower and upper boundaries are indicated in both cycles per degree and cycles per pixel

Full size table

For each participant, we computed the correlation between AM-rate matches and contrast energy for each spatial-frequency bin. For example, if AM-rate matches were slower for scenes with more energy in lower-spatial-frequency components, the correlations would be negative for lower-spatial-frequency bins. If AM-rate matches were faster for scenes with more energy in higher-spatial-frequency components, the correlations would be positive for higher-spatial-frequency bins. Outlier images were removed from each correlation (across observers) using a 95 % confidence ellipse (5 % of the images were removed, on average). The average correlation coefficient, r, is shown as a function of spatial frequency bin in Fig. 3 (black curve). The function is peaked. That is, we did not obtain a simple crossmodal relationship in which the contrast energy in higher-spatial-frequency components drove faster AM-rate matches. Instead, the results suggest that the faster AM-rate matches were driven by the energy in the specific midrange spatial frequencies (0.3–1.25 c/deg).

In order to gain insight into why the midrange spatial frequencies were strongly associated with faster AM-rate matches, we filtered each image within this window of spatial frequency (0.3–1.25 c/deg). Representative examples are shown in Fig. 4. An inspection of these images suggests that scenes with stronger contrast energy in this spatial frequency bin tend to have more object boundaries and contours (e.g., the top image in Fig. 4), whereas scenes with weaker contrast energy in this spatial frequency bin tend to have fewer object boundaries (e.g., the bottom image in Fig. 4). This may suggest that AM-rate matches to visual scenes are based on the numerosity of object boundaries and contours. Consistent with this idea, we found a strong aggregate correlation between the average perceived clutter rating and the average AM-rate match across the 24 scenes (r = .62), t(22) = 3.68, p = .001. Significantly positive correlations were also attained when they were computed separately for each participant, t(19) = 3.89, p = .001. This supports the hypothesis that greater contrast energy in the midrange spatial frequencies drives a faster AM-rate match, because it makes a scene appear more cluttered.

To determine whether spatial-frequency information contributed to AM rate matches over and above perceived clutter, we computed the correlation between the clutter rating and the contrast energy in each spatial frequency bin. If AM-rate matches were completely mediated by perceived clutter, the midrange spatial frequencies that strongly drive faster AM-rate matches should also strongly drive higher clutter ratings. As can be seen in Fig. 3 (gray curve), although the functions for perceived clutter and AM-rate matches are both broadly elevated within similar ranges of spatial frequencies, the peak^{Footnote 1} occurs at a significantly higher spatial frequency for perceived clutter (M = 1.30 c/deg, SD = 1.93) than for AM-rate matches (M = 0.44 c/deg, SD = 4.09), t(17) = 3.62 , p < .01. Thus, although spatial frequency and perceived clutter similarly contribute to AM-rate matches, perceived clutter depends on a slightly higher range of spatial frequencies than does the crossmodal association. This suggests that the spatial-frequency profiles of visual scenes contribute to AM-rate matches over and above perceived clutter.

Experiment 2

The goal of this experiment was to determine whether the spatial-frequency-mediated crossmodal association between visual scenes and auditory AM rate was based on retinal, physical, or object-based spatial frequency. In the case of single-spatial-frequency texture patches (Gabor patches), the association is based on physical spatial frequency (Guzman-Martinez et al., 2012). For texture perception, physical spatial frequency is informative because it allows for anticipation of the felt texture prior to touching a surface. For scene perception, however, object-based spatial frequency (i.e., number of cycles per object) would be particularly informative, because it conveys information about object structure and scene features irrespective of the viewing distance or scaling of photographs (e.g., Parish & Sperling, 1991; Sowden & Schyns, 2006). Thus, it is possible that AM-rate matches to natural scenes may be based on object-based (rather than physical) spatial frequency.

We tested this hypothesis by reducing the size of the images by half, thus doubling the physical/retinal spatial frequencies (physical and retinal spatial frequencies are indistinguishable at a fixed viewing distance), without affecting the object-based spatial frequencies. If the crossmodal matches depend on physical/retinal spatial frequencies, the AM-rate matches to the individual images should change, but the critical spatial frequencies (which are strongly correlated with AM-rate matches) should remain the same in cycles per degree. In contrast, if the crossmodal matches depend on object-based spatial frequencies, the AM-rate matches to the individual images should remain the same, but the critical spatial frequency should double in cycles per degree, because an identical object-based spatial frequency corresponds to a doubled physical/retinal spatial frequency when image size is halved.