Introduction

Cross-modal correspondences are tendencies to match perceptual features from different sense modalities (Deroy & Spence, 2013; Spence, 2011). A famous example is the bouba-kiki phenomenon, in which people associate pseudowords like “kiki” and “takete” with spiky shapes, and pseudowords like “bouba” and “maluma” with curved shapes (Kohler, 1947; Ramachandran & Hubbard, 2001). These nonwords combine multiple consonant and vowel features that contribute to the effect. Voiced labial stop [b], sonorants [m, l], and back round vowels [u, o] are consistently matched to curvy shapes; voiceless alveolar and velar stops [t, k], and front vowels [i, e] are consistently matched to spiky shapes (D’Onofrio, 2014; McCormick et al. 2015; Nielsen & Rendall, 2013). The bouba-kiki effect has been documented across different cultures (Bremner et al., 2013; Styles & Gawne, 2017) and throughout development (Lockwood & Dingemanse, 2015; Maurer, Pathman, & Mondloch, 2006) – the relative importance of the involved stimulus properties can vary with culture (Chen, Huang, Woods, & Spence, 2016) and developmental stages (Chow & Ciaramitaro, 2019). It is also an instance of sound symbolism, i.e., a consistent, non-arbitrary relationship between phonetic and perceptual or semantic elements establishing sound-meaning association biases in language (Blasi, Wichmann, Hammarström, Stadler, & Christiansen, 2016; Sidhu & Pexman, 2018).

An issue that deserves careful attention is what sorts of speech-sound representations – e.g., auditory, phonological, articulatory, lexical (Hickok & Poeppel, 2007; Monahan, 2018) – drive the bouba-kiki effect. Parise and Spence (2012) suggest that it reflects structural similarities between physical features of auditory and visual stimuli. Spiky versus curved shapes have been found to correspond with lower versus higher pitch (Marks, 1987; O’Boyle & Tarte, 1980; Walker et al., 2010), sinusoidal versus square-wave tones (the latter are richer in higher frequencies), and sharp-attack versus gradual-attack musical timbres (Adeli, Rouat, & Molotchnikoff, 2014). Accordingly, “spikiness” in speech sounds has been assigned to abrupt energy changes in voiceless compared to voiced consonants, and to the high second-formant frequency of front vowels; “curviness” is seen as related to low-frequency energy due to consonant voicing (e.g., by smoothing consonant-vowel transitions) and to a lower second formant in back vowels (Fort, Martin, & Peperkamp, 2015; Knoeferle, Li, Maggioni, & Spence, 2017; Nielsen & Rendall, 2011).

Explanations based on speech-specific rather than general auditory processing have also been offered. Particularly, properties of articulatory gestures that mimic jagged or smooth visual contours, such as sharp inflections of the tongue on the palate (Ramachandran & Hubbard, 2001) or lip rounding/stretching (Maurer et al., 2006) are thought to mediate correspondences between speech sounds and shapes. Further, Styles and Gawne (2017) suggest that failures to observe the bouba-kiki effect in speakers of Syuba and Hunjara are due to the tested pseudowords being phonologically illegal in those languages. This would imply that the effect requires sounds to be mapped onto language-specific phonological structures, which seems at odds with “general auditory” explanations and with evidence of sound symbolic effects in preverbal infants as young as 4 months (Ozturk, Krehm, & Vouloumanos, 2013; but see Fort, Weiß, Matin, & Peperkamp, 2013).

The present study aimed to investigate the roles of speech-specific and general auditory processes in the bouba-kiki phenomenon by comparing two conditions: one in which the auditory stimuli are heard as speech, and one in which the same stimuli are heard as non-speech sounds. For this, we used sine-wave speech (SWS), a spectrally reduced form of speech that can be heard as speech or non-speech depending on whether or not the listener attends to the speech-likeness of the sounds. SWS consists of sinusoidal tones (usually three) imitating time-varying properties of vocal-tract resonances (Remez, Rubin, Pisoni, & Carrell, 1981). Due to the absence of the harmonic and broadband formant structure characteristic of natural vocalizations, it sounds quite different from human speech and generally elicits no phonetic perception in naïve listeners. However, through proper instruction, listeners can direct attention to phonetic information in SWS that is sufficient to support perception of the linguistic message (Remez, Rubin, & Pisoni, 1983; Remez, Rubin, Pisoni, & Carrell, 1981). As proposed by Remez and Thomas (2013), while the vocal timbre of natural speech directs listeners’ attention to modulations caused by articulatory gestures, which engages a perceptual organization of the signal into a speech stream, SWS is not sufficient to summon such attentional setting, usually requiring further information, such as instructions. Interestingly, hearing SWS as speech versus non-speech has been found to involve functionally distinct perceptual processes and brain networks (Dehaene-Lambertz, 2005; Khoshkhoo, Leonard, Mesgarani, & Chang, 2018; see also Baart, Stekelenburg, & Vroomen, 2014).

In the current experiment, participants that were trained to hear SWS stimuli as speech were compared to participants that were not informed about the nature of the same stimuli. Comparisons were made in terms of performance on an implicit association task (IAT; Parise & Spence, 2012) and a subsequent explicit cross-modal matching (CMM) task, both used to assess sound-shape correspondences. In both cases, the tested hypothesis was that, while the correspondence effect would occur in both groups due to auditory-visual associations, it would be stronger for the speech-mode group than for the non-speech group, due to associations at speech-specific processing levels.

Material and methods

Participants

Fifty-four native speakers of Brazilian Portuguese (mean age: 26.4, SD: 5.0; range: 18–34 years; 29 females) participated as volunteers. They all provided written informed consent and reported no history of hearing or neurological problems. The study was approved by the local ethics committee and was conducted in accordance with the Declaration of Helsinki. Regarding the linguistic and cultural background of the participants, it is worth noting that testing languages other than English is beneficial for the field of cross-modal correspondences. We are aware of only one other study on the bouba-kiki effect in speakers of Portuguese (Godoy et al., 2018).

Stimuli

Sine-wave versions of the legal pseudowords maluma and taketa were used as auditory stimuli in the experimental tasks. The choice of these pseudowords was based on Kohler’s (1947) seminal work and subsequent replications of the “maluma-takete” effect. Here, taketa was used instead of takete because the latter would be pronounced as [taketʃɪ] in most variants of Brazilian Portuguese. Three maluma exemplars and three taketa exemplars spoken by a male native speaker of Brazilian Portuguese were recorded. For the preparatory task, one exemplar of each of eight legal pseudowords was recorded. These pseudowords were formed by recombining syllables in maluma and taketa (half with two syllables from the former; half with two syllables from the latter): ketalu, keluma, kemata, luketa, lumata, maketa, maluta, tamalu. For each pseudoword exemplar, a SWS sound composed of three time-varying sinusoids corresponding to the three lower formants of the natural utterance were generated using a script for the software Praat (Boersma & Weenink, 2013) written by Darwin (2003). All SWS stimuli were normalized for equal root mean square intensity (70 dB SPL) and presented binaurally through TDH-39 headphones.

Visual stimuli were six abstract shapes presented in a computer monitor positioned approximately 1.0 m from the participant’s eye. Three of them were spiky; the others were curved (Fig. 1). The shapes were presented in dark gray within a white rectangle subtending approximately 7.15 × 6.30° of visual angle on the center of the screen with a black background.

Fig. 1
figure 1

Visual and auditory stimuli. (a) The spiky and curved shapes used as visual stimuli in the explicit task. The first and fourth shapes (from left to right) were used in the implicit association task. (b) Spectrograms of an original maluma utterance (left) and an original taketa utterance (right). (c) Sine-wave speech versions of the utterances represented in B

Design and procedures

Once participants learn to hear SWS as speech, there is no known way to induce them to revert to a non-speech mode of perception. Hence, a between-participants design was necessary to avoid confounds that would result from testing participants in speech-mode condition following a non-speech condition. In order to induce one group of participants into speech-mode and the other into non-speech mode, assigning them to different preparatory conditions before the main experimental tasks was inevitable. Simply telling a group of participants about the speech-likeness of the SWS stimuli in those tasks could be insufficient to induce most participants into the speech mode (see Remez et al., 1981); informing them specifically about the original maluma and taketa utterances could bias performance. Another possibility would be to present SWS stimuli paired with the original utterances to the speech-mode group, while presenting only SWS stimuli to the non-speech group (e.g., Vroomen & Stekelenburg, 2011), in which case the set of auditory stimuli would differ between groups. Here, we prioritized having both groups being exposed to the exact same stimuli under conditions requiring active listening.

Participants were randomly assigned to two groups (N = 27) that differed only in the preparatory task they were to perform before the main experimental tasks. The speech-mode group performed a pseudoword identification task, while the non-speech group performed a sound localization task. The exact same SWS stimuli were used in both tasks, but only the former directed attention to the speech-like nature of the sounds. We assume that the only way the two preparatory tasks could affect subsequent tasks differently is by inducing, or failing to induce, the speech mode of perception. After completing the preparatory task, participants performed the IAT followed by the CMM task. For all tasks, the auditory or visual stimulus was presented 1.00 s after the response to the previous trial. Feedback was given in all trials of the IAT and preparatory tasks. At the end of the experiment, participants were asked to describe what kind of sound they heard during the tasks and, following the response, whether they heard the sounds as sequences of spoken syllables. Presentation software (Neurobehavioral Systems, Albany, CA, USA) was used to program and run all tasks.

Preparatory tasks

In both preparatory tasks, SWS pseudowords ketalu, keluma, kemata, luketa, lumata, maketa, maluta, and tamalu were used as stimuli, each presented twice in a randomized sequence. The sound localization task given to the non-speech group consisted of 16 trials in each of which a SWS pseudoword was presented and the participant was required to press a button indicating whether the sound was to the left or right. In each trial, either the left or right channel was attenuated by either 15 or 10 dB. Attenuation direction and degree were counterbalanced and randomly assigned across trials.

The speech-mode group performed a pseudoword identification task. Along with the auditory stimulus, a pair of written pseudowords appeared in the center of the screen (one above the other) in each trial. One of the written pseudowords corresponded to the accompanying SWS stimulus; the other was drawn randomly from the remaining seven written pseudowords. The participant was required to press a button indicating which written pseudoword matched the auditory stimulus.

The auditory stimuli were the same in the two preparatory tasks (including left-right amplitude differences). While both tasks required active listening to the stimulus set, they critically differed in that only pseudoword identification required mapping sounds into linguistic units and involved orthographically presented pseudowords – two features that can reasonably be assumed to prime participants to attend to the speech-likeness of the stimuli.

The average proportion of correct responses was 95.5% for the non-speech group (in the sound localization task) and 91.5% for the speech-mode group (in the pseudoword identification task). This difference between groups was significant (Wilcoxon rank sum test: W = 149; p = .013). More importantly, the high proportions of correct responses suggest that participants of both groups had little difficulty in performing the preparatory task and were successfully primed to attend to either sound location or speech-related features.

Implicit association task (IAT)

Two visual shapes (curved and spiky) and two SWS pseudoword exemplars (maluma and taketa) were used as stimuli (Fig. 1). The participant was asked to keep the index fingers resting on two response keys. Each of the 12 blocks of this task was composed of three phases: teaching, training, and test. In each of four teaching trials, an auditory or visual stimulus was presented along with an arrow indicating the corresponding response key. An arrow pointing to the left (right), in the left (right) inferior corner of the screen, indicated that the left (right) key should be pressed in response to the current stimulus. Presentation order was randomized with the constraint that auditory and visual stimuli should alternate.

The training and test phases within a block were identical except for the number of trials (eight and 16, respectively) and the instruction for the participant to respond as quickly as possible (without sacrificing accuracy) during the test. In each trial, an auditory or visual stimulus was presented and the participant had to respond according to the stimulus-response mapping specified in the teaching phase. Presentation order was randomized (with no stimulus being presented in two immediately successive trials). Accuracy and response time were recorded only during the test phase. There were three blocks for each of the four possible mappings in which one auditory and one visual stimulus was mapped onto each response key. The 12 blocks were presented in randomized order. In six congruent blocks, the sound taketa (maluma) was mapped onto the same response key as the spiky (curved) shape. In other six incongruent blocks, taketa (maluma) was mapped onto the same key as the curved (spiky) shape. A brief pause was allowed between any two blocks.

Cross-modal matching (CMM) task

Six trials were presented in which the participant heard a SWS pseudoword and used a visual analog scale to indicate whether it matched better to one of two shapes presented side-by-side on the screen. Unlike traditional binary forced-choice paradigms, this task is sensitive to gradient, subcategorical detail in stimuli (Schellinger, Munson, & Edwards, 2017) and allows participants to respond “none of the shapes.” Two keyboard keys were used to scroll a gray square along a horizontal white bar at the bottom of the screen, indicating to what degree the participant thought the sound matched the left or the right shape (middle meaning “indifferent”). Three maluma and three taketa exemplars were used as auditory stimuli. Six curvy-spiky pairs formed by the shapes shown in Fig. 1 were used as visual stimuli. Spiky and curved shapes appeared an equal number of times (three) to the right and to the left. Presentation order was randomized – three consecutive exemplars of the same pseudoword were not allowed.

Results

In the speech-mode group, most participants reported hearing pseudowords or Portuguese words, while most participants in the non-speech group reported hearing whistles and/or “electronic sounds” – taketa being often described as “treble” relative to maluma. Data from four participants of the speech-mode group were excluded either for not giving at least 11 correct responses in the preparatory task or for not reporting hearing the auditory stimuli as well-defined spoken syllable sequences. Data from four participants of the non-speech group were excluded because they reported hearing auditory stimuli as well-defined spoken syllables.

Implicit association task (IAT)

Mixed-effects models with Group (speech mode × non-speech), Congruency (congruent × incongruent) and Stimulus Modality (auditory × visual) as fixed effects were used to analyze both accuracy and response-time data. By-participant random intercepts and random slopes for Congruency, Modality, and their interaction were specified. To account for learning/fatigue effects, random intercepts for block (1, 2, … 12) and trial (1, 2, … 16) nested within block were also included. Response times (for correct responses) between 300 and 3,000 ms were ln-transformed and entered into a linear model. Accuracy was entered as a binary response variable into a logit model (Jaeger, 2008). The main interest was to test for the Group × Congruency interaction, since the tested hypothesis was that cross-modal correspondences, as assessed by the effect of Congruency, would be stronger in the speech-mode compared with the non-speech group. Fixed-effect coefficients in the “response time” and “accuracy” models are shown in Tables 1 and 2, respectively. Mean response times and accuracy for the two groups in congruent and incongruent blocks are depicted in Fig. 2a and b. To calculate p-values for fixed effects, restricted models – each of which omitted one model term – were tested against the full model. For the “response time” linear model, this was done using a Type-III ANOVA with Satterthwaite’s approximation for degrees of freedom. For the “accuracy” logit model, likelihood-ratio tests were performed.

Table 1 Summary of the fixed effects in the mixed linear model (fitted with REML) on log response times in the implicit association task (7910 observations)
Table 2 Summary of the fixed effects in the mixed logit model on proportion correct data in the implicit association task (8,636 observations)
Fig. 2
figure 2

Implicit association task (IAT) performance for auditory and visual stimuli in congruent and incongruent blocks. (a) Mean response times. (b) Proportions of correct responses. The non-speech (green triangles) and the speech-mode (red circles) groups showed similar performance improvements in congruent compared to incongruent blocks. Bars represent standard errors after adjusting values for within-participant designs (Morey, 2008). (c) Left panel: violin plots showing kernel estimates of response time distributions for correct and incorrect responses in the non-speech (green) and speech-mode (red) groups. Boxplots are shown in black. Circles represent the median. Right panel: proportion correct as a function of response time. For illustration purposes, response times were binned. For each participant, modality, and congruency level, bin 1 contains the 25% fastest responses; bin 2 contains the next 25% fastest responses, and so on. The left panel in (c) was created using the vioplot R package (Adler, 2019). All other plots were generated via ggplot2 (Wickham, 2009)

The ANOVA revealed that, as in Parise and Spence (2012), responses were faster in congruent compared to incongruent blocks (F(1, 44.3) = 28.42; p < .001) and in visual compared to auditory trials (F(1, 43.5) = 115.8; p < .001). The lack of significant interactions involving Congruency and Group (F < 1) indicates that the speech and non-speech groups were similar in terms of the response time advantage in the congruent blocks – and, therefore, did not support the tested hypothesis.

Likelihood-ratio tests revealed significant Congruency (χ2(1) = 13.45; p < .001) and Modality (χ2(1) = 62.96; p < .001) effects, but no significant Group × Congruency (χ2(1) = 0.00; p = .99) nor Group × Congruency × Modality (χ2(1) = 0.55; p = .46) interaction, indicating that the congruency effect was similar between groups. Importantly, as one can see in Fig. 2a and b, response-time and accuracy measures across conditions are not even numerically consistent with the tested hypothesis, indicating that the present negative results are not due to lack of statistical power.

As pointed out by an anonymous reviewer, examining the speed-accuracy relation is relevant for the interpretation of the IAT results. None of the groups showed speed-accuracy tradeoffs. Rather, accuracy decreased with response time (Fig. 2c). After centering and scaling ln-transformed response times to unit variance, we added a “Response Time” fixed effect and the corresponding by-participant random slopes in the above-described “accuracy” logit model (see Davidson & Martin, 2013). Likelihood ratio tests revealed significant main effects of Response Time (χ2(1) = 49.00; p < .001), Congruency (χ2(1) = 8.91; p = .003), and Modality (χ2(1) = 25.07; p < .001). No other significant or marginally significant model term was revealed. Particularly, the absence of a significant Response Time × Group interaction (χ2(1) = 0.13; p = .72) suggests that the decrease in accuracy with response time was similar between groups.

Cross-modal matching (CMM)

Responses were coded as values from -1 to +1 such that positive and negative values represent, respectively, “spiky” and “curved” responses. Boxplots in Fig. 3 represent responses of the two groups to the two pseudowords. For both groups, SWS pseudowords taketa and maluma seem consistently associated with spiky and curvy shapes, respectively. However, the separation between responses to taketa and to maluma is clearer in the speech-mode than in the non-speech group – suggesting a stronger sound-shape association in the former group. Since the response variable was bounded, a nonparametric ANOVA was conducted on aligned rank transformed data (Wobbrock et al., 2011) with Pseudoword (maluma × taketa) and Group as fixed effects, and Participant as random effect. A significant interaction was found between Group and Pseudoword (F(1, 228) = 18.11; p < .001), reflecting the stronger sound-shape association in the speech-mode group and, therefore, supporting the tested hypothesis. Separate (Bonferroni corrected) ANOVAs for each group revealed highly significant effects of Pseudoword for both the non-speech (F(1, 114) = 66.74; p < .001) and the speech-mode group (F(1, 114) = 209.71; p < .001).

Fig. 3
figure 3

Boxplots for responses of the speech-mode and non-speech groups in the explicit task. Positive values represent “spiky” responses; negative values represent “curved” responses. Both groups consistently associated pseudowords maluma and taketa with curved and spiky shapes, respectively. However, this association was more pronounced for the speech-mode group

Discussion

Both groups showed the expected sound-shape correspondence effects in both tasks, confirming that the bouba-kiki effect does not require that listeners perceive sounds as speech. In the IAT, correspondence effects were not affected by whether listeners were in the non-speech or in the speech-mode group. This provides no support for the hypothesis of an important role of speech-specific processing in the bouba-kiki effect. In the explicit CMM task, the correspondence effect was significant for both groups, but stronger for the speech-mode group, indicating that speech-specific processing did affect sound-shape correspondences. Thus, the latter task seems to tap into aspects of shape-sound correspondences that are not reflected in IAT performance. However, no specific hypothesis had been advanced regarding differences between tasks. Particularly, the present study was not designed to compare implicit versus explicit processing. Thus, this aspect of our results requires cautious interpretation.

The observed correspondence effects for SWS stimuli that were not heard as speech can be explained based on associations of visual contours with auditory stimulus attributes, rather than with speech-specific representations of articulatory gestures or abstract phonological units. Consistent with findings on non-verbal sound-shape correspondences (Adeli, Rouat, & Molotchnikoff, 2014; Liew, Lindborg, Rodrigues, & Styles, 2018; Marks, 1987; O’Boyle & Tarte, 1980; Parise & Spence, 2012; Walker et al., 2010), it has been conjectured that differences in the frequency content and waveform envelope are key to the bouba-kiki effect (Fort et al., 2015; Nielsen & Rendall, 2011). Regarding the taketa-maluma pair, vowel [e] sounds brighter than [u] due to its higher formant frequencies – the second formant frequency has been identified as a major contributor to sound-shape correspondences (Knoeferle et al., 2017); energy changes associated with voiceless obstrudents [t, k] are sharper than those associated with sonorants [m,l] (see Fig. 1b and c). Indeed, without having syllables to speak about the stimuli, participants in the non-speech group reported hearing “taketa” as treble relative to “maluma”. Of note, the present findings show that speech stimuli reduced to the three lower formant contours, which excludes the fundamental frequency and noise-related features, are sufficient for the bouba-kiki effect to occur. Further experimental research manipulating and controlling for multiple acoustic and visual features (as recently reported for sound-color correspondences; Anikin & Johansson, 2019) are necessary to refine the picture of the sensory dimensions involved in sound-shape correspondences.

Hearing SWS stimuli as speech did not increase the correspondence effect in the IAT, suggesting that speech-specific processing plays little or no role in sound-shape correspondences as assessed by this task. This is consistent with Parise and Spence’s (2012) interpretation of IAT results on five types of audiovisual correspondences, involving both speech and non-speech sounds, as reflecting a single automatic mechanism that deals with associations between auditory and visual features. Of course, one could entertain the less parsimonious possibility that distinct mechanisms underlie the statistically equivalent correspondence effects for the non-speech and speech-mode groups, and that a putative language-related mechanism alone could account for the effect in the latter group. Although this possibility cannot be ruled out, it finds no support in IAT results – no main effect or any interaction involving the factor Group was found. Not only did the groups not differ significantly in response time and accuracy, but they also both showed a similar decrease in accuracy with increasing response time. This also indicates that IAT performance was not appreciably affected by whether the participant had previously performed the sound localization task (non-speech group) or the pseudoword identification task (speech-mode group), and hence that SWS stimuli were as distinguishable for participants of one group as for participants of the other.

In the CMM task, hearing stimuli in speech mode seems to strengthen the grasp participants have on cross-modal compatibilities. Thus, the two tasks employed here seem to be differentially sensitive to distinct mechanisms underlying the bouba-kiki phenomenon. In the related but distinct field of intersensory integration, studies indicate that listening to SWS stimuli as speech is crucial for the integration of visual (lipread) and auditory information in sound identification tasks, but has no effect on visually enhanced detection of SWS in noise (Eskelund, Tuomainen, & Andersen, 2011) nor on judgments of synchrony and temporal order between SWS and lipread stimuli (Vroomen & Stekelenburg, 2011). This has been interpreted as evidence for two distinct mechanisms that improve speech intelligibility via audiovisual integration: one based on speech-specific content in auditory and visual signals, and a more general mechanism that exploits cross-modal covariation to enhance auditory signal-to-noise ratio. Something analogous may occur with cross-modal correspondences between speech sounds and shapes – i.e., both speech-specific and general perceptual mechanisms may contribute to them. Possibly, while the IAT taps auditory-visual associations shared by cross-modal correspondences involving speech and non-speech sounds, CMM results reflect the latter associations combined with mappings between shapes and higher-level, speech-specific representations, which might include articulatory features (Maurer et al., 2006; Ramachandran & Hubbard, 2001), language-specific phonological units (Shang & Styles, 2017; Styles & Gawne, 2017), and the corresponding orthographic forms (Cuskley, Simner, & Kirby, 2017). This is consistent with the idea that sound-symbolism effects in adults result from both pre-linguistic and language-related biases (Ozturk et al., 2013).

While the IAT provides indirect performance measures of task-irrelevant associations reflecting presumably automatic processes (Greenwald, Poehlman, Uhlmann, & Banaji, 2009; Parise & Spence, 2012), CMM relies on introspective reports. We speculate that the requirement to judge how well pseudowords and shapes match led participants of the speech-mode group to consider similarities based on speech-specific representations in addiction to basic auditory qualities. However, CMM and IAT differ in many other ways. The IAT requires participants to keep the stimulus-response mapping for the current block in memory and respond correctly and quickly to each unimodal (either visual or auditory) stimulus. Feedback was provided to keep performance at reasonable levels and multiple trial blocks were necessary to assess correspondence effects. The CMM task is much less effortful; correspondences can be accessed directly in a few trials, each of which contain both visual and auditory stimuli; responses are not speeded and there is no sense in classifying them as correct/incorrect. Moreover, it is not clear how performing the IAT could affect CMM, which was always the last task to be performed in order to avoid directing attention to sound-shape correspondences before the IAT.

The present findings warrant future studies designed specifically to investigate which features of different tasks are associated with their sensitivity to different mechanisms underlying sound symbolism. To test whether the contribution of speech-specific processing is indeed contingent on the explicit assessment of correspondences, one should conceive a more comparable “implicit-explicit” pair of tasks. Also of interest would be a replication of the present study using the implicit, speeded classification task used by Evans and Treisman (2010), which detects correspondence effects at the perceptual level, rather than at the level of response selection as does the IAT (Parise & Spence, 2012). Testing for different instances of sound symbolism under different task requirements can be of great help in elucidating the nature of the involved representations and processes.