Music can elicit a visual motion aftereffect
Motion aftereffects (MAEs) are thought to result from the adaptation of both subcortical and cortical systems involved in the processing of visual motion. Recently, it has been reported that the implied motion of static images in combination with linguistic descriptions of motion is sufficient to elicit an MAE, although neither factor alone is thought to directly activate visual motion areas in the brain. Given that the monotonic change of musical pitch is widely recognized in music as a metaphor for vertical motion, we investigated whether prolonged exposure to ascending or descending musical scales can also produce a visual motion aftereffect. After listening to ascending or descending musical scales, participants made decisions about the direction of visual motion in random-dot kinematogram stimuli. Metaphoric motion in the musical stimuli did affect the visual direction judgments, in that repeated exposure to rising or falling musical scales shifted participants’ sensitivity to visual motion in the opposite direction. The finding that music can induce an MAE suggests that the subjective interpretation of monotonic pitch change as motion may have a perceptual foundation.
KeywordsMusic cognition Sound recognition Motion integration Visual perception
Visual motion can be described as a space–time correlation (Watson & Ahumada, 1983), which in the real world is typically produced by consistent displacement of an object in space over time. The perception of visual motion yields much survival-relevant information to an organism, enabling cognitive functions such as image segmentation, breaking of camouflage (Nakayama & Loomis, 1974), heading perception, and navigation (Warren, Blackwell, Kurtz, Hatsopoulos, & Kalish, 1991), among many others.
However, we can infer visual motion even when none is present in the stimulus. For example, observers infer motion even if they are only presented with two subsequent static images (Freyd, 1983). The sense of motion experienced from these static stimulus displays is similar to that evoked by visual motion, and some evidence has indicated shared neural pathways in real and implied motion perception (Kourtzi & Kanwisher, 2000; Lorteije et al., 2006; Peuskens, Vanrie, Verfaillie, & Orban, 2005; Senior et al., 2000). These listed cases all featured variation in contrast across static images, but even that is not necessary. In a study by Winawer, Huk, and Boroditsky (2008), participants viewed static images that each individually implied motion in a specific direction (e.g., from left to right), such as a runner pictured frozen in motion. After prolonged exposure to a sequence of similarly static pictures, each depicting motion in the same direction, participants showed systematically biased forced choice responses to random-dot kinematogram (RDK) motion in the opposite direction, which is similar to the well-known motion aftereffect (MAE; Anstis, Verstraten, & Mather, 1998; Purkinje, 1820, 1825; Wohlgemuth, 1911). Motion is not literally perceived in the static adaptors, yet the consequences for judgments of visual motion are consistent with the perception of motion.
Beyond this effect of static images suggesting motion, there is evidence that metaphoric movement—patterns that are not necessarily correlated with visual motion but that provide information about motion—can produce similar effects. Dils and Boroditsky (2010) reported that verbal descriptions of motion can produce contrastive changes in judgments of visual motion, suggesting that even symbolic descriptions of motion can affect motion perception in much the same way as real or implied visual motion. It thus seems that nonvisual information can affect motion judgments in a way that is similar to physical visual motion.
In music, composers have long used the metaphoric mapping of pitch change to vertical motion. For example, Symphony No. 6 by Franz Joseph Haydn (1761), known as Le Matin or “Morning,” uses an ascending melodic line to convey the rising of the sun. In fact, the metaphor that equates frequency changes with vertical movement is so strong in Western culture that it often pervades the everyday lexicon used to describe musical events (e.g., the falling bass line/the rising melody). When referring to a rising melody played on a keyboard instrument, nothing physically rises; the hands move in a left-to-right fashion. An even more striking example of the pitch–verticality metaphor is found in the language describing the movements of large stringed instruments, such as the cello or double bass. For these instruments, playing a rising melody requires one to move down in vertical space, yet the gesture is often called “moving up the fingerboard.”
While the origin of these musical metaphors remains unclear, likely incorporating cultural norms and knowledge (e.g., Eitan & Tubul, 2010) as well as embodied experience (Turner, 1987), the fact remains that these conceptual metaphors are pervasive in the understanding of musical meaning. Thus, scales and melodies that increase and decrease in frequency are classified as rising and falling, respectively, yet musical pitch changes are correlates neither of movement in the vertical axis nor necessarily of physical consequences of the movement. Since understanding musical change as movement is an important way in which music has been thought to convey information, auditory metaphors in descriptions of music are used extensively to convey vast amounts of information about dynamic processes, such as movement (Gjerdingen, 1994; Johnson & Larson, 2003; Todd, 1992). Moreover, listeners can certainly understand some of these musical devices as intended: Eitan and Granot (2006) asked participants to imagine movement as they heard several short music motifs, and they found consistent mappings between the manipulation of certain musical parameters (e.g., pitch, acceleration) and bodily movement.
Given the strength and pervasiveness of the pitch–verticality metaphor, one might hypothesize that specific frequency changes can cross-modally produce changes in motion judgments. Maeda, Kanai, and Shimojo (2004) demonstrated that continuously ascending or descending pitch glides could disambiguate visual motion, in that ambiguous visual motion presented with an ascending auditory stimulus was more likely to be judged as moving up, and ambiguous visual motion presented with a descending auditory stimulus was more likely to be judged as moving down. Furthermore, metaphoric auditory signals have been shown to facilitate the comprehension of implied visual motion (Hedger et al., 2011) as well as real motion (Sadaghiani, Maier, & Noppeney, 2009). Thus, if congruent metaphoric auditory motion can facilitate comprehension of both implied and real visual motion, it also may be possible that prolonged exposure to a particular implied direction of motion in either audition or vision could shift sensitivity to congruent motion in the opposite modality, in a manner similar to an MAE. An auditorily induced MAE would more directly implicate a basic visual mechanism than it would top-down changes in visual judgment.
This is the question that Kitigawa and Ichihara (2002) addressed by adapting participants to either auditory or visual motion, and then measuring the degree to which adaptation in one modality elicited an MAE in the other modality. The authors found that an auditory aftereffect is elicited from adaptation by visual motion, but not the converse, providing evidence for visual dominance in cross-modal aftereffects. However, this asymmetry might have had to do with the particular stimuli, for the authors investigated expanding/contracting visual stimuli and perceived dynamics for auditory stimuli. It is thus possible that the pitch–verticality correspondence is a stronger metaphoric mapping, and thus can elicit a visual MAE from purely auditory adaptation. This is one of the hypotheses tested by Jain, Sally, and Papathomas (2008). They found that after adaptation to continuously rising or falling auditory motion, participants were more likely to judge ambiguous sinusoidal gratings as moving in the opposite direction. This was the only auditory-to-visual adaptation that produced a significant MAE (adaptation to rightward/leftward panned auditory motion, as well as looming/receding auditory motion, did not result in a visual MAE). This suggests that experiencing a continuous pitch increase as a pitch rise can affect our perception of visual motion.
In music, the relationship between pitch changes and vertical motion might not be as straightforward as it might seem from the foregoing discussion. More specifically, in the previously outlined studies, the auditory stimuli consisted of continuous acoustic frequency changes, whereas notes are generally presented as discrete steps in music. In this manner, pitch changes in music are more similar to apparent visual motion (e.g., a dot being flashed in opposite sides of a screen, with no physical signal linking the perceived motion). Furthermore, representing pitch change as music might also shift attention away from motion information (e.g., one might attend to melodic, timbral, or scalar information, without directly attending to contour).
To test whether an MAE could be elicited by music, we measured the effects of prolonged listening to ascending and descending musical scales on visual perception. If the perceptual metaphor of pitch change in music selectively adapts direction-selective neurons in motion-processing areas, as physical motion does, we would predict similar changes in perception. Thus, taking the results obtained by Winawer et al. (2008) to their logical conclusion, we predict a cross-modal MAE, in that repeated exposure to ascending musical scales would result in a systematic shift that would favor downward motion, while repeated exposure to descending musical scales would result in a systematic shift favoring upward motion. The present study expands upon the previous literature by testing whether auditory signals that are constrained by the parameters of Western music are sufficient to elicit cross-modal MAEs.
A group of 43 University of Chicago undergraduates were recruited to participate in the experiment. All of the participants were naive to the purposes of the experiment, did not have any hearing problems, and had normal or corrected-to-normal vision. Three of the participants did not perform adequately on the baseline task (see the Baseline Task section for details), and thus did not participate in the main task.
The experiment was displayed on a cathode-ray-tube (CRT) monitor with a screen resolution of 1,024 × 768 pixels and a refresh rate of 75 Hz. The music stimuli, which consisted of a sine-wave timbre, included ten different musical scales (Ionian, Aeolian, Dorian, Phrygian, Mixolydian, chromatic, whole-tone, octatonic, hexatonic, and pentatonic) starting on every note within an octave range, resulting in 120 music stimuli for both upward and downward motion. Each scale consisted of eight notes, meaning that some scales (e.g., chromatic) did not span an octave range, while others (e.g., whole-tone) spanned slightly more than an octave range. The duration of each musical scale was 500 ms (thus, each note lasted 62.5 ms), and all of the stimuli were normalized to an average of 70 dB SPL. While discrete notes lasting only 62.5 ms might seem too fast to be perceived musically, there are numerous examples of musical events (e.g., trills) that occur at this speed. Moreover, in a separate discrimination study, we found that participants were able to discriminate between types of scales with over 95 % accuracy.
The RDK stimuli used in this task were based on those used by Newsome and Paré (1988) and had previously been utilized to measure MAEs from adaptation to real (Blake & Hiris, 1993) as well as implied (Winawer et al., 2008) motion. Using RDK stimuli marks an important deviation from the previous literature on cross-modal MAEs (e.g., Jain et al., 2008), as RDK stimuli lack any trackable visual features and have a much richer spatiotemporal frequency content. Each RDK stimulus consisted of 100 moving dots. This stimulus display took up approximately one-third of the screen, or about 12º × 9º of visual angle. Participants were seated approximately 70 cm away from the computer screen. On each trial, a subset of the dots moved coherently either up or down by approximately 0.05º per frame. All of the other dots disappeared and reappeared at random points within the rectangular window. Each trial lasted 1 s, consisting of 25 stimulus frames of 40 ms each.
Before participating in the main experiment, participants completed a baseline measure to estimate their motion detection thresholds. The baseline task consisted of 390 randomly ordered 1-s RDK stimuli (195 with upward motion, and 195 with downward motion). The dot-motion coherence levels in the baseline task ranged from 5 % to 65 %, in 5 % increments. A logistic function was fit to the responses, in which “downward” was arbitrarily coded as negative, and “upward” as positive. Asymptotic performance in both directions was calculated for each participant on the basis of the fitted logistic function. The point at which participants reached asymptotic performance was defined as the maximal coherence unit for each participant. In the main task, we tested motion judgments for RDKs with the maximal coherence unit, 50 % of the maximal coherence unit, and 25 % of the maximal coherence unit as test stimuli, resulting in six types of visual stimuli per participant (three levels of upward motion and three levels of downward motion). For instance, if a participant reached asymptotic performance at 40 % coherence, their coherence levels would be 40 %, 20 %, and 10 % for the main task. Three participants were likely unable to detect the motion contained within the RDK stimuli, as they performed so poorly on the baseline task that their normalized coherence unit would have been over 100 %. Consequently, they did not continue on to perform the main task.
Following the baseline task, participants completed the main portion of the experiment, which consisted of four blocks. Each of these blocks consisted of 30 trials. A trial consisted of a presentation of music followed by a visual RDK display for which an “up” or “down” decision was made. The presentation of each block was structured so that participants listened to a set of either ascending or descending musical scales (one direction only per block) for either 60 s (prior to the first RDK test trial) or 6 s (all subsequent RDK test trials). Thus, two ascending and two descending music blocks were presented, with no adjacent blocks featuring the same musical direction, and the first block being randomly determined across participants. As each musical scale lasted 500 ms, before the first RDK test trial, participants heard all of the musical scales (120) in a randomized order, and for each of the subsequent trials they heard a randomly selected subset of the musical scales (12), also presented in a randomized order, prior to each visual direction judgment.
After completing the main task, each participant filled out two questionnaires (see the Appendix). The first questionnaire addressed musical background (e.g., how many instruments one played and for how long), while the second questionnaire addressed previous knowledge about the MAE (e.g., whether participants had ever heard of MAEs). After completing the questionnaires, participants were debriefed and compensated.
Since participants’ responses were binary (“up” or “down”), we ran a generalized linear mixed-effects model with a logistic link (e.g., Agresti, 2007; Baayen, Davidson & Bates, 2008). The fixed effects of the model were coherence level (an ordered factor) and music direction (ascending or descending), while video number (trial) and participants were treated as random effects.
To measure model fits, we used the corrected Akaike information criterion for finite sample sizes (AICc). The AICc provides a goodness-of-fit measurement for models while providing a penalty for extra parameters. To compare between models, we used a delta AICc measure (Δi), which is obtained by subtracting the AICc of the best model and the AICc of the tested model. As a general rule of thumb, a Δi value less than 2 suggests substantial evidence for the model, while a Δi between 3 and 7 indicates that the model has considerably less support, while a Δi over 10 suggests that the model is highly unlikely (Burnham & Anderson, 2002).
These results converge with more traditional psychophysical measurements, such as points of subjective equality (PSEs). In this study, the PSE was defined as the coherence level that was required to obtain the 50 % point on the ordinate (i.e., where the fitted logistic function reached the point at which participants were equally likely to judge the visual stimulus as ascending or descending). Since the coherence levels were personally calculated for each participant, if the music had no effect on visual perception, we would expect both the ascending and descending music conditions to yield a PSE of 0 normalized coherence units. However, contrary to this null hypothesis, we found a significant shift in the PSE based on the metaphoric direction of the music (i.e., ascending or descending). Averaged across participants, the PSEs were –.15 normalized units of coherence for ascending musical stimuli, and .23 normalized units of coherence for descending musical stimuli (with downward and upward visual motion being arbitrarily coded as negative and positive, respectively). This difference in PSEs was significant [t(39) = 2.87, p < .01]. In other words, after prolonged exposure to metaphorically descending music, participants’ PSEs were shifted toward making more ascending responses for the RDK stimuli, and after prolonged exposure to metaphorically ascending music, participants’ PSEs were shifted toward making more descending responses.
In order to assess whether the impact of the music adaptor was moderated by individual musical experience, we added instrument experience as a covariate in the model. We added this covariate because of the possibility that certain instruments might support the metaphoric mapping of pitch to vertical space more than others. Specifically, we looked at the participants’ years of experience with instruments that have a congruent metaphoric mapping between pitch and vertical space (e.g., clarinet, oboe)—in which ascending physical motions produce higher pitches—as we thought that participants who had extensive experience on these instruments might show a larger MAE. An exhaustive search for the best generalized logistic mixed-effects models, however, did not include music experience as a covariate.
While the metaphoric mapping between pitch and vertical space is pervasive in Western culture, the genesis of this metaphor and the mechanism that mediates its interpretation are less clear. Specifically, the understanding of this metaphor could either be an inferential effect, based on cultural norms and knowledge, or it could be determined by perceptual effects that are rooted in the same mechanism by which we perceive visual motion. The present results provide evidence for the latter interpretation.
Paradigms such as the visual motion aftereffect have long been interpreted in terms of the nature of visual neural circuits (e.g., Barlow & Hill, 1963). Specifically, the visual MAE is taken as reflecting the reduction in neural sensitivity of direction-tuned opponent mechanisms from prolonged exposure to one direction of motion, thus increasing sensitivity to the other direction. However, the present study shows that visual motion perception is affected after listening to metaphoric musical motion, in a manner consistent with real and static visual motion adaptation. This suggests that the metaphoric motion information conveyed by music does not merely affect postperceptual reasoning. Taken in the context of the general interpretation of the MAE, this cross-modal transfer of adaptation from metaphoric auditory motion to real visual motion provides evidence that ascending and descending musical scales may well be stimulating visual direction-selective neurons. This conclusion is supported by the fact that no participant correctly identified the purpose of the study, and only a quarter of the participants had even heard of the MAE. Furthermore, previous knowledge of the MAE did not predict the observed effect, and upon debriefing, most participants described the music as “video game music,” rather than rising or falling.
Even though we have provided evidence that the perception of metaphoric motion in music may rely on the same perceptual mechanisms by which we process visual motion, this does not necessarily mean that the auditory frequency–vertical space metaphor is innate. Specifically, although the auditory frequency–vertical space metaphor is pervasive throughout Western music, it is by no means the only metaphoric mapping used to understand auditory frequency (e.g., Eitan & Timmers, 2010). Indeed, the prevalent metaphoric mapping in Bali and Java equates auditory frequency with physical size, where “low” frequencies are “large” and “high” frequencies are “small” (Van Zanten, 1986), while the Suyá in the Amazon equate auditory frequency with age, with “low” frequencies being “old” and “high” frequencies “young” (Seeger, 1987). As we assume that the effects that we report are indeed driven by metaphors, we predict that such cultures would not experience a visual aftereffect in the same fashion that we observed.
Indeed, previous studies have demonstrated that musical metaphors—as culturally driven phenomena—are not as strongly developed in children (e.g., Eitan & Tubul, 2010). Yet, other research has provided evidence for a pitch–verticality correspondence in infants (e.g., Walker et al., 2010), suggesting that this metaphoric mapping is not necessarily a byproduct of verbal metaphor or experienced correlations. Further studies should thus address the developmental and cultural components of this effect, as this would help disambiguate the extent to which the present findings were culturally derived.
The results of this study can perhaps best be explained by means of cross-modal associations that, over time, have become internalized and automatized. From the Western musical notation system, which uses vertical space to represent notes, to the common use of pitch changes to metaphorically convey object motion, it is likely that through repeated exposure in the environment, we have come to automatically associate decreasing and increasing musical pitch with descending and ascending objects, respectively. Indeed, the importance of statistical regularities on perception has long been discussed, and it is clear that observers do take advantage of this information (e.g., Barlow, 1959; Craik, 1943; Weiss, Simoncelli, & Adelson, 2002). From a neural perspective, the repeated pairing of specific auditory events with real motion information has possibly resulted in neurons that respond to both real and metaphoric motion in a direction-selective manner.
A related possibility is that the auditory metaphors did not act through an automatized representation, but rather through visual mental imagery. It is conceivable that the metaphor evoked a visual image in the minds of the observers, which then produced the observed adaptation effects. This is not inconceivable, as something like this has recently been demonstrated (Winawer, Huk, & Boroditsky, 2010). One way to distinguish these possibilities would be to administer a test of visual mental imagery (e.g., the VVIQ) to participants and see whether visual imagery scores predicted the magnitude of the effect.
Recent work on the role of analog acoustic expression (AAE) speech comprehension may also be instructive in interpreting these results. Indeed, the first empirical investigation of AAE demonstrated that people modulate the pitch of their voice while speaking in order to analogically convey vertical motion information about an object (Shintel, Nusbaum, & Okrent, 2006). While it is unclear whether music mimics the acoustic patterns in speech, or vice versa, the evidence for this metaphor in speech as well as music demonstrates the strength of this association in Western culture.
Finally, our results fit nicely into the wider recent literature on multimodal interactions between auditory and visual motion. For instance, it has been shown that some synesthetes immediately perceive specific sounds when presented with particular kinds of visual motion (Saenz & Koch, 2008) or that the perceived speed of visual motion influences judgments of auditory tempo (Su & Jonikaitis, 2011). These couplings do seem to go both ways, as according to some reports music can influence visual perception in a top-down fashion, either directly (Hidaka et al., 2011) or indirectly—for instance, via emotion (Jolij & Meurs, 2011) or by communicative gestures that have been shown to improve motion perception (Manera, Becchio, Schouten, Bara, & Verfaillie, 2011). Sounds have even been shown to induce motion perception for static displays (Teramoto et al., 2010). As in our study, the meaning of the auditory signals does matter; for instance, in tap dancing, auditory interacts with visual information in a matter that depends on how synchronized the information sources are (Arrighi, Marini, & Burr, 2009). Taken together, these results suggest a plethora of interactions between the auditory system and the perception of visual motion.
In conclusion, we found a cross-modal MAE in which direction-selective motion adaptation resulted from listening to metaphorically moving musical scales. These behavioral results suggest that extracting motion information from music involves the use of direction-selective motion mechanisms, and is not solely the result of top-down or inferential effects. While it remains unclear whether or not these metaphoric auditory events recruit the same neural networks involved in the processing of auditory and visual motion, these results provide empirical evidence that the processing of metaphoric musical motion shares perceptual mechanisms with the perception of visual motion.
We thank Anders Hogstrom for his support in data collection, and Michael Kubovy for his help and support in the data analysis. This study was supported by a New Directions Fellowship from the Andrew W. Mellon Foundation to B.H. and by NEI Grant No. F32EY019833 to P.W.