Cross-modal associations have been proposed as a key mechanism underlying sound-symbolic phenomena relevant to language learning and development (Dingemanse, Blasi, Lupyan, Christiansen, & Monaghan, 2015; Imai & Kita, 2014) as well as to language evolution (Cuskley & Kirby, 2013). Yet, the relationship between cross-modal associations and the much rarer phenomenon of synesthesia is still an open area of study. Understanding the relationship between synesthesia and cross-modal associations is a key step in understanding the mechanisms underlying both phenomena, and in particular, their strong relationships to learned categories and language (Cuskley & Kirby, 2013; Simner, 2007). Synesthesia is a relatively rare phenomenon, occurring in approximately 5% of the population (Simner et al., 2006), whereas cross-modal associations are much more widespread. Therefore, capturing the relationship between these phenomena requires examining large samples of participants. This article reports a large-scale study examining the associations between vowel sounds, graphemes, and colors, collected online from over 1,000 participants, simultaneously revealing salient patterns in cross-modal associations and identifying a couple hundred synesthetes in the sample. We begin with an overview of previous studies that have contrasted the associations of synesthetes and nonsynesthetes, with particular relevance to linguistic cross-modal phenomena.

Synesthesia and cross-modality

Synesthesia, from the Greek syn- (together) and -aisthes (feeling), is a phenomenon wherein a stimulus in one sensory modality (known as the inducer; e.g., a sound) elicits a response not only from that modality (e.g., hearing the sound) but also in another (e.g., seeing the sound, known as the concurrent). Synesthesia is an involuntary, automatic sensory experience and can occur in many forms, ranging from shaped tastes to smelled colors. Although about a hundred distinct forms of synesthesia have been documented, over 88% of attested forms are linguistic in nature (Simner et al., 2006), with the most widely reported and well-studied form being colored graphemes. In recent years, the study of forms of synesthesia with linguistic inducers has shed light on language processing more generally, showing distinct semantic and phonological effects (Asano & Yokosawa, 2011, 2012; Simner, 2007).

Unlike synesthesia, cross-modal associations are not involuntary or automatic, but nonetheless demonstrate strong relationships between sensory modalities. For example, people reliably associate brighter colors with higher pitch and darker colors with lower pitch (Martino & Marks, 2001). Cross- modal associations are often elicited explicitly (e.g., by asking participants to match a sound to a taste; Simner, Cuskley, & Kirby, 2010), though they have also been demonstrated in implicit contexts (e.g., Ward, Huckstep, & Tsakanikos, 2006). Cross-modal associations are widespread and often shared across a population, though some may vary cross-culturally (Dolscheid, Shayan, Majid, & Casasanto, 2013; Styles & Gawne, 2017).

Despite the commonalities between synesthesia and cross-modal associations, relatively few studies have directly contrasted the two phenomena. In large part, these studies have shown that cross-modal associations in nonsynesthetes show commonalities with inducer–concurrent patterns in synesthetes. Here we focus on a few studies especially relevant to linguistic forms of synesthesia.

An important early study tackling the similarities between synesthesia and cross-modal associations was Simner et al. (2005), which tested associations between colors and graphemes (A–Z as well as 0–9) among English- and German-speaking synesthetes and nonsynesthete controls. Using both forced choice (choosing colors from an array) and open answer (name the color that best matches the grapheme) methodologies, they showed that synesthetes and nonsynesthetes shared many trends in their choices. For instance, both overwhelmingly identified the grapheme A as red (cf. Root et al., 2018; Rouw, Case, Gosavi, & Ramachandran, 2014, for recent cross-linguistic confirmations). Using temporally spaced testing (one to three weeks for controls, and two to four months for synesthetes), they also showed that synesthetic participants were far more temporally consistent in their color choices (92%) than controls (32%), despite the longer time interval for testing synesthetes. This study marked the first large-scale demonstration that synesthetes and nonsynesthetes share trends in cross-modal mappings, and it reinforced the idea that temporal consistency is an important feature of synesthesia.

Although graphemes may be the best-known linguistic inducer of synesthesia, it has long been known that phonemes can also play a role, though this remains underexplored (Simner, 2007). In a study of cross-modal associations between vowel sounds and colors, Moos, Smith, Miller, and Simmons (2014) showed that the acoustic properties of F1 and F2 in vowels were significant predictors of color choices in both synesthetes and nonsynesthetes, although more extremely in the former group. F1 and F2 are vowel formants independent of voice pitch (although they are also measured in Hertz), which vary as a result of changing how the vocal tract filters the source sound provided by the vibrating vocal folds. Although not entirely deterministic, changes in F1 and F2 vary predominantly with tongue position in vowel production. The value of F1 is lower when the tongue is higher in the mouth (i.e., high vowels; e.g., the /i/ in beet), and higher when the tongue is lower in the mouth (i.e., low vowels; e.g., the /a/ in bot). The value of F2, on the other hand, is lower when the tongue is farther back in the mouth, and higher when the tongue is farther forward.Footnote 1 Moos et al. used 16 synthesized vowels that spanned the F1–F2 space and tested 11 English-speaking synesthetes and 20 English-speaking controls. Over hundreds of trials, participants responded to each vowel sound multiple times by choosing one of 16 color swatches, which varied in the dimensions of lightness, green–red, or yellow–blue. They found generally that lower values of F1 (i.e., higher vowels) were greener and yellower, but lower values of F2 (i.e., more back vowels) were more red and blue, and that these trends were much stronger for synesthetes than for controls. Moos et al. concluded that acoustic factors play a privileged role in vowel–color associations over graphemic factors. Similar acoustic-led trends in color choices have been found among nonsynesthetic English–Korean bilinguals (Kim, Nam, & Kim, 2018).

The goal of the present study was to examine the relationship between cross-modal associations and synesthesia in the domain of vowel–color associations among a large online sample of Dutch speakers. Using the same auditory stimuli as Moos et al. (2014), we make several novel contributions to our understanding of cross-modal associations and synesthesia. We aimed to (i) replicate prior results in a larger sample using more fine-grained color measures, (ii) test the novel question of whether categorical perception plays a role in shaping vowel–color associations, and (iii) contribute a new measure of structural isomorphism in cross-modal associations.

In the first instance, we aimed to replicate and extend the results reported in Moos et al. (2014). We used a larger sample (n = 1,164) of Dutch speakers and a more fine-grained color response space. Within our sample, we used consistency measures adapted from Rothen, Seth, Witzel, and Ward (2013) to identify synesthetes. Large-scale, online approaches are essential for the study of cross-modality and synesthesia, for several reasons. The large sample allowed us to identify synesthetes behaviorally and to sample more randomly from the population (as opposed to targeted recruitment of synesthetic volunteers, as in, e.g., Baron-Cohen, Burt, Smith-Laittan, Harrison, & Bolton, 1996). This, in turn, provided us with a larger cohort of synesthetes and nonsynesthetes to compare, a distinct advantage for understanding how synesthesia and cross-modality function across populations.

Large sample sizes have been particularly important in earlier studies that have upended our understanding of both the overall prevalence of synesthesia and the skewed sex ratios in the phenomenon. Studies such as Simner et al. (2006) showed that most earlier estimates of the prevalence of synesthesia were off by several orders of magnitude. Early volunteer-recruited studies with small samples had such high F:M ratios that some have suggested that synesthesia might even be X-linked (Baron-Cohen et al., 1996; Smilek et al., 2002). However, recent work with larger, random samples has shown this to be largely an artifact of the volunteer recruitment strategies in earlier studies (Simner & Carmichael, 2015). In short, since the variation across the population is a key issue in the study of cross-modality and synesthesia, a large-scale online approach is essential for a more complete understanding of the phenomena.

The large-scale online approach also presents some challenges. Tasks generally need to be shorter in order to maximize the number of participants completing the task, and the circumstances of participation are less controlled than in a lab setting. Unlike in Moos et al. (2014), the task reported here consisted of only three trials per vowel item, and we expected that our data might be noisier as a result of less controlled participation conditions. To the extent that the methods introduced noise, this should stack the deck against our hypotheses of finding acoustic, categorical, and cross-modal structure in vowel–color associations, rendering any findings of structure more robust. Also, as in other online experiments in behavioral science, we expected increased noise in the data to be offset by the considerable advantages of having such a large sample (Crump, McDonnell, & Gureckis, 2013).

A second aim of our study was to look beyond acoustic or graphemic factors and examine how vowel phoneme category might influence associations in both synesthetes and nonsynesthetes. While we expected to replicate Moos et al.’s (2014) results regarding acoustic factors, a key focus of our study was the role of categorical perception. We hypothesized that since vowel perception is predominantly categorical (Rosner & Pickering, 1994), vowel category may be a better predictor of color choices than are acoustic factors. We predicted that acoustic factors would be a good predictor of color choices insofar as changes in acoustic structure correlated with changes in vowel category. Given the prevalence of grapheme–color synesthesia, the best-approximated vowel grapheme might also form a strong predictor, especially for participants who are grapheme–color synesthetes. However, grapheme category might also play an influential role for nonsynesthetes, since auditory information can automatically invoke graphemic form (Cuskley, Simner, & Kirby, 2017; Ziegler & Ferrand, 1998).

Finally, we presented a novel measure of isomorphic structure in cross-modal associations using the Mantel test (Mantel, 1967). We argue that this measure forms a useful complement to existing consistency tests used to identify synesthetes, especially in large samples. This measure quantifies the extent to which an individual’s associations are isomorphically structured (i.e., similar sounds are matched with similar colors) or unstructured (i.e., similar sounds are matched with dissimilar colors). We used this measure to show that while the specific nature of cross-modal mappings may exhibit considerable individual variation, the capacity for making structured mappings is shared to a much broader extent in the population. This new measure of structure in cross-modal associations provides a promising way to probe structural isomorphism across domains or sensory modalities generally.



A total of 1,164 adult participants volunteered to take part in an online vowel–color association task as part of a larger survey on cross-modal associations advertised in the Dutch national press, popular media, and Dutch national TV as the “Groot Nationaal Onderzoek” (“Large National Survey”; van Leeuwen & Dingemanse, 2016). This is an annual public engagement initiative of the national broadcaster NTR, aiming to actively involve the general public in research. The funding attached to this initiative enabled us to design and develop a dedicated web application, and widespread advertisement in the national media facilitated the large sample size.

Participants provided informed consent prior to participation and were not required to provide any personal information. About 85% of participants provided some information in a voluntary pretask survey that could be answered anonymously. The reported age range was 18–88 years old (median = 46, SD = 16); reported gender was 667 female, 206 male, and 282 who did not select a gender.Footnote 2 A subset of 398 participants additionally carried out a grapheme–color association task.


The stimuli in the vowel–color association task were 16 vowel sounds selected to represent points spread through acoustic vowel space. Moos et al. (2014) reported the method for creating these stimuli as follows:

Recordings of the eight primary cardinal vowels were made in a high-specification sound studio. . . . To create a richer vowel continuum, eight intermediate vowels were made by morphing each neighbouring pair. . . . The 16 vowels were adjusted in intensity (to 80 dBSPEL) and duration (to 1049 ms, the mean duration of the original stimuli) using Praat’s PSOLA function Boersma and Weenink (2011). F0 [which corresponds to voice pitch] varied minimally, from 120 to 124 Hertz, and was not equalised. (Moos et al., 2014, p. 134)

The stimuli in the grapheme–color association task were the characters A–Z (capital) and 0–9 presented in black sans-serif capitals on a gray background. In each task, items were presented in randomized order and each item occurred three times, making for a total of 48 trials in the vowel task and 108 trials in the grapheme task. Consistency scores across trials per item and task were calculated in order to identify synesthetes in the sample (described in further detail below).

Color responses were recorded using an RGB color picker following widely used methods in online test batteries of synesthesia (Eagleman, Kagan, Nelson, Sagaram, & Sarma, 2007; Rothen et al., 2013). Even though RGB values are device-specific, data collected using RGB color pickers are robust enough for the detection of synesthesia, especially when converted to the more perception-veridical CIE color space (Rothen et al., 2013) and analyzed in terms of relative distances. For the model analyses below, we converted the RGB color responses to CIELuv space (as in Moos et al., 2014) using the standard illuminant D65, particularly in order to use the CIELuv-based consistency measures from Rothen et al.’s study.


Both association tests were available online. To maximize participation while assuring high-quality data, we piloted the tasks across multiple platforms, making sure they would run smoothly across devices without loss of functionality or data. The instructions were kept as clear and concise as possible and were similar to those provided for similar tasks in the lab and on other synesthesia-screening websites. Full code for the web application used to collect the data is available at

For the vowel–color task, the instructions were as follows: “In this test you associate colors to sounds. You hear a sound, then choose a color that you feel fits best. Synesthetes always see the same color for the same sound: is that the same for you? Try to respond as intuitively as possible.” For the grapheme–color task, the instructions were as follows: “In this test you associate colors to letters in a precise way. You see a letter or digit, then choose a color that you feel fits best. Synesthetes always see the same color for a letter or digit. Does this hold for you, too? Try to respond as intuitively as possible.”Footnote 3

In the vowel–color task, participants first were presented with a brief sound test to check their audio volume. Next, the 16 vowel items were presented in a random order three times, for a total of 48 trials. Each trial began with the audio file autoplaying. After the file had played once, participants could relisten to each audio file as many times as they wanted before choosing a color. In the grapheme–color task, the 36 items (the letters A–Z and digits 0–9) were presented in a random order three times, for a total of 108 trials. In both tasks, participants clicked on the “Next” button after choosing their desired color in order to continue to the next trial. The interface for the experiment during a vowel trial is shown in Fig. 1.

Fig. 1
figure 1

Screenshot of the interface for the vowel–color association task

In both tasks, the RGB spectrum of the color picker was randomly shifted on the horizontal axis after each trial, making it difficult to achieve high consistency just by clicking in the same region of the color picker (i.e., making a spatial association). A separate bar allowed participants to adjust the lightness of the chosen color. On each trial, it was possible to choose “no color.” This option was used in less than 5% of trials in the vowel color task, and in around 11% of trials in the grapheme–color task.

Data preparation

The test data were logged locally in the participant’s browser and transmitted to an SQL database on a secure server upon task completion. The results were postprocessed offline in R. To safeguard against possible duplicate data entries from the same individuals, responses from the same email address or same name, or from the same IP address without differential name or email address information, were removed. Overall, 11 completed tests were detected as duplicates and removed. Immediately after these quality control steps, the data were anonymized, keeping a random identifier as the sole link between the anonymized demographic metadata and test results.

The color responses in both tasks were converted to CIELuv values as in Moos et al. (2014), using Python’s colormath package (Taylor, 2017). CIE spaces are generally preferable to RGB space because they are based on how people perceive color (rather than, e.g., how a computer should render it). CIELuv distances in particular have been shown to be more accurate in detecting synesthetes (Rothen et al., 2013) than RGB distances (Eagleman et al., 2007). In CIELuv space, L corresponds to lightness, u corresponds to a green–red continuum, and v corresponds to a blue–yellow continuum.

Following Rothen et al. (2013), we calculate a consistency score for each participant as follows. First, we calculated d (Eq. 1) as the sum of the paired Euclidean distances in CIELuv space between the three color choice trials for each vowel or grapheme (see the equation below, adapted from Eq. 2 in Rothen et al., 2013, p. 158). The d values for each item were used to create a mean c across items for each participant, which served as their CIELuv consistency score. Following Rothen et al., we defined synesthetes as those with consistency c < 135.3.

$$ d={\sum}_{\left(i=1,2,3\right)}\sqrt{{\left({L}_1-{L}_2\right)}^2+{\left({u}_1-{u}_2\right)}^2+{\left({v}_1-{v}_2\right)}^2}, $$

A total of 34 participants were removed from the sample because they chose “No color” for more than half of the items in the vowel association task, making it impossible to calculate a valid vowel consistency score. In the remaining participants, we identified 365 vowel–color synesthetes, with a mean c = 95.60 (SD = 28.00, 95% CI = 2.88). The nonsynesthetes had a mean of c = 224.33 (SD = 63.47, 95% CI = 4.5). Figure 2 shows the overall distribution of consistency scores, with the dashed red line indicating the consistency cutoff used in Rothen et al. (2013).

Fig. 2
figure 2

Density plot of all consistency scores for vowel–color associations. The red dashed line indicates the cutoff for synesthesia from Rothen et al. (2013), with all participants to the left of the line being classified as vowel–color synesthetes

Since we lacked grapheme consistency data for much of the sample, the participants who completed the grapheme association task but did not provide data sufficient for consistency calculations were retained in the sample and classified as unknown for grapheme color synesthesia, along with participants who did not complete the grapheme task. The grapheme synesthetes had a mean c = 82.45 (SD = 38.91, 95% CI = 9.21) on the grapheme task, a value only slightly lower (higher consistency) than that reported by Rothen et al. (2013, p. 160; c = 85.51, SD = 58.27, CI not reported). Nonsynesthetes had a mean c = 243.19 (SD = 64.83, 95% CI = 10.03), a value slightly higher than the one reported by Rothen et al. (p. 160; c = 219.38, SD = 68.87, CI not reported).

Table 1 shows the synesthetic status of the 1,130 participants based on the c cutoff value in each task. The rate of synesthesia in our sample, for both grapheme–color and vowel–color synesthesia, is significantly higher than would be expected in a random sample (e.g., Simner & Carmichael, 2015; Simner et al., 2006). This is likely because the task was specifically advertised for people interested in cross-sensory associations and synesthesia, resulting in higher participation rates of synesthetes than in a truly random sample (Dingemanse & van Leeuwen 2015).

Table 1 Synesthetes and nonsynesthetes in the sample, as defined by the consistency score cutoffs reported in Rothen et al. (2013)

Although the overall prevalence of synesthesia in our sample was likely skewed, we did not find a significantly higher proportion of female synesthetes, though we did have more female participants overall (Fig. 3). This echoes recent studies showing that the early reports of much higher rates of synesthesia in females were likely due to sampling methods that relied on targeted recruitment of self-reported synesthetes (Simner & Carmichael, 2015), as contrasted with the present study, which invited participants interested in cross-modality associations generally. The use of consistency scores to automatically identify synesthetes without self-report may also have resulted in false positives, an issue we will address in part with our structure measure.

Fig. 3
figure 3

Rates of synesthesia, according to consistency scores, among female (F), male (M), and other (N.S., not selected) participants in our sample. The inset shows the proportions for each group, showing that despite the greater number of females overall, the rates of synesthesia among the three groups were comparable

To examine the role that categorical perception might play in respondents’ color choices relative to acoustic measures, we used the F1 and F2 values of the vowel stimuli provided by Moos et al. (2014) as our acoustic predictor. For categorical phoneme perception, each of the stimuli was categorized as a Dutch vowel phoneme. Using F1–F2 values from Dutch vowels taken from Adank, van Hout, and Smits (2004), each stimulus was categorized as an instance of its nearest Euclidean neighbor in F1–F2 space, on the basis of the overall mean F1–F2 values for each monophthong Dutch vowel.

Figure 4 shows each Dutch vowel phoneme plotted in F1–F2 space. We used a Voronoi tesselation (dashed lines), such that every point within each cell is closer to the central phoneme of the cell than to any other phoneme (measured in Euclidean distance). Vowel stimuli were categorized according to the cell they fell into, detailed in Table 6 in Appendix 1, which also details their F1–F2 values and grapheme category (according to the general Dutch orthography conventions from Nunn, 2006, p. 15).

Fig. 4
figure 4

Canonical Dutch monophthong vowels plotted in F1–F2 space, taken from Adank et al., 2004. The dashed lines indicate a Voronoi tesselation: Each cell contains all points that are closest to the canonical vowel within that cell. Thus, the vowel stimuli (numbered and indicated by black dots) were classified according to their nearest phoneme neighbor in Euclidean space


Model analyses

We used linear models to analyze participants’ responses in each dimension of the CIELuv space, adopting a mixed-effect approach to account for repeated measures, with random effects for participants and trial. Where individual fixed effects or interactions did not significantly improve the model fit, they were dropped, using the step() function from the lmerTest package (Kuznetsova, Brockho, & Christensen, 2018). All of the models reported below have significantly improved fit over a null model or simpler alternatives. The models were analyzed in R using the lme4 package (Bates, Maechler, Bolker, & Walker, 2013); F and p values were estimated using Straitherwaite approximations and the anova() function in the lmerTest package (Kuznetsova et al., 2018).

We examined the continuous predictors of F1 and F2 vowel formants, as well as the categorical predictors of vowel phoneme category and grapheme category, summarized in Table 2. We compared models with different predictors using the sem.model.fits function in the piecewiseSEM package (Lefcheck, 2016), which compares Akaike information criterion (AIC) values across models and estimates the marginal (fixed effect) and conditional (fixed + random effect) R2 values for each model. Although phonemic, graphemic, and acoustic factors all play roles, we found that vowel phoneme category is the best overall predictor of color choice, and that grapheme category accounts for more variation than acoustic factors. Details of the grapheme models are included in Appendix 2, but they will not be discussed at length here, since vowel category is a better predictor of responses. In the following analyses, we focus first on acoustic factors for a comparison with earlier work, followed by further analyses of vowel phoneme category.

Table 2 Marginal and conditional R2 values for each predictor in each dimension of color space

Acoustic factors

Figure 5 shows the results in color space in the same way as in Moos et al. (2014). As in Moos et al.’s study, we found that synesthetes chose a generally wider range of colors, but the general shape of the mapping between vowels and colors was shared between synesthetes and nonsynesthetes. Figure 6 shows more exact color values chosen in each dimension for each item, plotted in F1–F2 space.

Fig. 5
figure 5

Vowels plotted in color space for nonsynesthetes (left) and synesthetes (right), after Moos et al. (2014). Note that the colors represented here are slightly idealized, for comparison with the plots in Moos et al. (2014), and the lightness choices indicated are an approximate gradient interpolation between the lightest and darkest values chosen (i.e., the midrange values are not exact). For more precise color and lightness choices, see Fig. 6

Fig. 6
figure 6

Mean L (top left), u (top right), and v (bottom) color values chosen for vowel items, as plotted in F1–F2 space for synesthetes (squares) and nonsynesthetes (circles). Axis labels indicate vowel quality (vowel height on the F1 axis and front/back vowels on the F2 axis). For each color scale, the values shown reflect the mean for the item in the relevant dimension, with the other dimensions held at middle values (the middle values for L, u, and v, are 50, 0, and 0, respectively)

In the formant models described below, F1, F2, and synesthetic status were tested as fixed-effect predictors of L, u, and v. The models tested interactions with each acoustic predictor and synesthetic status (but not between acoustic predictors) as fixed effects, after Moos et al. (2014). Below we describe the results for L, u, and v, respectively.

For acoustic factors, more variance was accounted for in lightness choices (~ 15%, marginal R2 = .155) than in the u and v dimensions, where the fixed effects accounted for less than 10% of variation (u, red–green, marginal R2 = .017; v, blue–yellow, marginal R2 = .08).

Participants generally chose lighter colors for more front vowels and darker colors for more back vowels (Fig. 6, top left), reflected by a strong significant main effect of F2 in the model (Table 3). The model also showed that participants chose slightly lighter colors for lower vowels. Synesthetes were likely to choose slightly darker colors overall, but an interaction between F2 and synesthetic status showed that synesthetes tended to choose slightly lighter colors for front vowels than nonsynesthetes. On the other hand, the interaction between F1 and synesthetic status showed that synesthetes chose slightly darker colors for lower vowels (higher values of F1) than nonsynesthetes. Although the synesthete/nonsynesthete contrasts are strongly significant, they are difficult to detect in Fig. 6, since the color swatches reflect means of choices from a large, continuous color palette instead from a predefined set of 16 colors, as in Moos et al. (2014). For the specific estimates and p values of the fixed effects, see Table 3. For examples of the individual response patterns, see Fig. 10 below.

Table 3 Model: L ~ F1 + F2 + vSyn + F2 × vSyn + F1 × vSyn + (1 | Participant) + (1 | Trial)

In terms of the u (green–red) scale (Fig. 6, top right), acoustic factors accounted for the least amount of variation in the u (green–red) scale, with less than 2% of the variation in responses being predicted by variation in F1, F2, and synesthetic status. Participants generally chose redder colors for lower vowels (higher F1) and more back vowels (higher F2). There was no main effect of synesthetic status in this dimension; however, there were significant interactions with F1 and F2. Synesthetes chose colors that were greener for low and back vowels than did nonsynesthetes. The detailed estimates, along with F and p values, are outlined in Table 4.

Table 4 Model: u ~ F1 + F2 + vSyn + F2 × vSyn + F1 × vSyn + (1 | Participant) + (1 | Trial)

On the blue–yellow (v) dimension (Fig. 6, bottom), the interaction between F1 and synesthetic status was dropped, since it did not improve model fit. Participants preferred yellower colors for high (low F1) and front (high F2) vowels. Synesthetes preferred bluer colors overall; however, they chose significantly yellower colors for front vowels than did nonsynesthetic participants, reflected in the significant interaction between F2 and synesthetic status. The detailed estimates, along with F and p values, are outlined in Table 5.

Table 5 Model: v ~ F1 + F2 + vowelSyn + F2 × vowelSyn + (1 | Participant) + (1 | Trial)

Phonemes and graphemes

In the categorical predictor models described in this section, phoneme (and grapheme) category and synesthetic status were fixed-effect predictors of L, u, and v. As with acoustic predictors, these models also tested interactions between phoneme (and grapheme) category and synesthetic status as fixed effects. For grapheme category, grapheme synesthete status was used as the synesthetic predictor instead of vowel synesthete status. Since vowel category was a better predictor of responses, we will not detail the grapheme results here, but they are provided in Appendix 2. The detailed results of the vowel models are provided below.

For vowel category, we observed significant main effects of this variable in all three color dimensions (L, F = 1,493, p < .001; u, F = 166, p < .001; v, F = 660, p < .001), as well as a significant main effect of synesthetic status in the L and v dimensions (L, F = 10.73, p = .001; u, F = 0.75, p = .39; v, F = 660, p < .001). These results echo those found for acoustic factors, with synesthetes choosing lighter (estimate = 1.34, SE = 0.49) and yellower (estimate = 6.92, SE = 2.71) colors than nonsynesthetes.

To assess specific effects for vowels, we calculated contrasts between all vowel categories in each dimension, with Bonferroni adjustments for all reported p values, using the lsmeans package (Lenth, 2016). The differences between vowel categories were mostly highly significant for all color dimensions and are fully listed in Appendix 3, with estimated means, confidence intervals, t ratios, and p values. Since the contrasts were mostly significant, Fig. 7 shows which vowels were not different from one another, using black lines for uncorrected nonsignificant contrasts (p > .05), a dotted line for a contrast in which p < .05, and a dashed line for a contrast in which p < .01. In other words, the more solid a line between two vowels is, the more similarly participants responded to them. For all unmarked contrasts, p < .001. Vowels are plotted with their canonical Dutch phoneme values from Adank et al. (2004). Figure 7 shows the results in each dimension.

Fig. 7
figure 7

Mean color choices for and contrasts between vowel categories plotted in F1–F2 space, labeled with canonical Dutch phoneme values from Adank et al. (2004). The axis labels indicate vowel quality (vowel height on the F1 axis, and front/back vowels on the F2 axis). For each color scale, the values shown reflect the mean for the item in the relevant dimension with the other dimensions held at middle values (the middle values for L, u, and v, are 50, 0, and 0, respectively). Solid lines between vowels indicate nonsignificant contrasts, as per the legend. Thus, clusters of vowels connected by lines did not elicit significantly different color choices from participants, and can be interpreted as informal groups. Asterisks next to a phoneme indicate that synesthetes and nonsynesthetes chose significantly different colors for that phoneme (see the text for details)

In all dimensions, the vowels /e, , ε, ø/ tended not to be significantly different, forming a “mid- front” grouping for which participants generally chose similar colors that were lighter, greener, and yellower than those of low back vowels. In both the L and v spaces, the vowel /i/ stood apart from this group, being even lighter and yellower. The vowels /u/ and // were also informally grouped in the u and v dimensions. In the u dimension, these also converged with /a/, although /ɑ/ was set apart as the reddest of all the vowels.

Contrasts between synesthetes and nonsynesthetes were not calculated for the u dimension, due to the lack of a main effect of synesthetic status. For the other dimensions, significant contrasts between synesthetes and nonsynesthetes are marked in Fig. 7 as follows: ***p < .001, **p < .01, *p < .05. The vowels in the “mid-front” grouping described above were generally lighter for synesthetes than for nonsynesthetes, and this was also true for the highest front vowel /i/. In the v dimension, synesthetes’ choices were generally yellower than nonsynesthetes’ for the same “mid-front” grouping, and also yellower for the high front vowel /i/. They were also slightly yellower for the low vowels /a/ and /ɑ/.


In a large sample of Dutch speakers, we found evidence of shared vowel–color associations. As in earlier work, our data showed that the acoustic factors F1 and F2 were predictive of color choices: Higher values of F1 (i.e., lower vowels) are darker, redder, and bluer; higher values of F2 (i.e., more front vowels) are lighter, greener, and yellower. These results echo those found for English speakers by Moos et al. (2014), and for Korean–English bilinguals in Kim et al. (2018). Although the general shape of associations is shared across synesthetes and nonsynesthetes (Fig. 5), synesthetes show more extreme color and lightness choices, selecting lighter and yellower colors for high values of F2 especially. The differences between synesthetes and nonsynesthetes in our results are not as marked as those reported by Moos et al.; we address several potential reasons for this in the General Discussion.

However, we also found that an approximation of phoneme category is a better predictor of color choice than the acoustic measures, indicating that categorical perception can shape the structure of cross-modal associations. As would be expected, given the acoustic results, front vowels are lighter, greener, and yellower, whereas low and back vowels are darker, redder, and bluer. There were generally no significant differences between the “mid-front” group of /e/, //, /ø/, and /ε/. However, particularly for this group, synesthetes chose slightly lighter and yellower colors than did nonsynesthetes. They also chose lighter and yellower colors for the high front vowel /i/, and darker colors for the low back vowel /u/.

Although the values from the phoneme model are very close to those from the grapheme model, comparisons of the two predictors showed that the phoneme model is significantly superior in every color dimension. It may be that grapheme category is only a good predictor of color choices insofar as it is a fairly good, though imperfect, predictor of vowel category. Although a rough mid-front vowel grouping emerged in the phoneme category analyses, this does not correspond to a larger grapheme grouping, since the four relevant phonemes map onto three different graphemes (/e/ and /ε/ to e, // to i, and /ø/ to u) in Dutch orthography.

Mapping structure

So far, our analyses were largely concerned with the kind of questions asked in classic cross-modal association studies, linking specific colors or color dimensions to acoustic and phonemic features. We also looked at contrasts between synesthetes and nonsynesthetes, based largely on consistency across trials. Although the temporal consistency of mappings is rightly considered a benchmark of genuine synesthesia, and some earlier studies have considered how the mappings of synesthetes relate to those of nonsynesthetes, less consideration has been given to the internal structure of synesthetic and cross-modal mappings. The traditional approach tells us something about whether synesthetes choose colors for sounds that are different from or similar to the kinds chosen by nonsynesthetes, but it is less adept at detecting overall structure in cross-modal mappings or telling us whether the mappings of synesthetes are more internally structured than those of nonsynesthetes. Are there structural regularities in how we link one sensory domain to another? Does the shape of the vowel space map onto the color space more reliably for synesthetes than for nonsynesthetes?

We operationalized structure in this context by comparing paired distances across spaces, using a method borrowed from ecology (Mantel, 1967). This method has been used extensively in iterated artificial-language learning studies to detect structured mappings between form and meaning spaces (e.g., Kirby, Cornish, & Smith, 2008). The Mantel test in this context measures whether distances in form correlate with distances in meaning. To the extent that they do, we can say there is structure in the mappings between two spaces.

In the context of the present data, for example, a mapping would be structured when pairs of vowels that are similar in F1–F2 acoustic space map onto pairs of colors that are similar in three-dimensional color space, and when pairs of vowels that are dissimilar in F1–F2 acoustic space map onto pairs of colors that are dissimilar. Thus, structure implies a degree of isomorphism across (multidimensional) sensory spaces—in this case, acoustic and color spaces.

Whereas our earlier consistency measures had used CIELuv space to align with prior work (Rothen et al., 2013), for this measure we used the related CIELab space. In CIELab space, the L dimension is identical, whereas a corresponds to a green–red continuum (similar to u) and b corresponds to a blue–yellow continuum (similar to v). The benefit of CIElab space for the structure measure is that it allows us to use more perceptually realistic distances, specifically ∆E2000 (Sharma, Wencheng, & Dalal, 2005). The ∆E2000 distance takes into account that Euclidean distances have nonuniform perceptual effects, particularly at the edges of the color space. For example, as lightness increases to the white point, the perceptual differences between chroma shrink and eventually disappear, even though their plain Euclidean distance in the CIELab or CIELuv spaces would be identical to those of two perceptually distinct colors elsewhere in the space. Therefore, our structure measure relies on ∆E2000 distances in CIELab space. For vowel distances, we used Euclidean distance in F1–F2 space using the canonical phoneme values shown in Fig. 7.Footnote 4 Since Euclidean distance can be high-dimensional, this allowed us to use all three dimensions of a participants’ color response to an item at once.

Where there is structure, the pairwise distances within each space will be correlated with one another. Once we had pairwise distances for every mapping in each space from our participants’ data, we permuted the vowel–color mapping between the two spaces and then recalculated the pairwise distances in order to get a distribution of potential correlations between the two spaces. Using this distribution, we obtained a z score indicating where a given participant’s mappings were on the actual distribution, and a p value that indicated the likelihood that any random mapping of the vowel–color space would be more structured than the actual one.Footnote 5 In other words, p < .05 in this case means that fewer than 5% of mappings generated in the simulation were more structured than the real one.

Pairwise distances were calculated between all sounds and all colors chosen by a particular participant, and a veridical correlation was calculated between these distance matrices. To account for multiple responses to the same stimuli, the responses were first shuffled within a particular item, and then across items. These matrices were then shuffled into 10,000 random permutations, each with its own r, allowing us to calculate a z score as described above. Python code for performing the Mantel structure analyses is available online at Figure 8 shows a density plot of z scores of synesthetes and nonsynesthetes in the vowel–color association task.

Fig. 8
figure 8

Distributions of structure scores among synesthetes and nonsynesthetes. Higher z-score values indicate more structured mappings. Values to the right of the dashed line indicate mappings that are significantly structured

Three findings stand out. First, the mappings of synesthetes tend to be more structured than those of nonsynesthetes (t = – 7.09, df = 660, p < .001). Second, the majority of participants’ mappings are more structured than would be expected by chance: All participants to the right of the vertical dotted line had correlations between the vowel and color spaces greater than 95% of random permutations generated by the Mantel test. Third, there is correlation between the structure score and CIELuv consistency scores: Participants with more consistent associations across trials (i.e., lower consistency scores) tended to have more structured mappings across the vowel and color spaces (r = – .313, t = – 11.07, p < .001; Fig. 9).

Fig. 9
figure 9

Relationship between structure score and consistency score, showing that participants with more consistency (i.e., lower CIELuv color distance) across trials tend to have more structured mappings (i.e., higher z scores)

This measure provides a new way to quantify the structure of cross-modal mappings and is a valuable quantitative complement to traditional unimodal consistency scores. Figure 10 shows individual participants that fall in specific parts of the consistency-structure space. The participant in panel a, who was classified as a synesthete according to consistency, appears to have achieved this by having high consistency across items (i.e., choosing the same color regardless of stimulus or trial), rather than by having structured, temporally consistent associations. This indicates that participants with high consistency and low structure are less likely to be genuine synesthetes, perhaps explaining the slight peak of unstructured synesthetes in Fig. 8. The participant in panel b has both low structure and low consistency, having chosen idiosyncratic colors on each trial and across the space, and sometimes mapping distant vowels (e.g., low-central and high-back/high-front vowels) with similar colors.

Fig. 10
figure 10

Mappings of individual participants showing, clockwise from bottom left, (a) a participant with very low structure yet high consistency across trials and items, probably indicating a false positive synesthete, (b) a typical nonsynesthete with inconsistent and unstructured mappings, (c) a middling participant with significant structure but inconsistent choices across trials, and (d) a highly structured but inconsistent participant, and (e) a typical vowel–color synesthete, with highly structured, consistent and categorical mappings

A nonsynesthete participant with middling consistency and significant but not especially high structure is shown in panel c. This participant shows structured mappings for some parts of the space—for instance, showing similar yellow/green choices for the cluster of high-front vowels, and brownish choices for the cluster of high back vowels, resulting in significant structure. However, this participant was inconsistent across trials for the same item, distinguishing the participant from highly structured synesthetes like the participant in panel e, showing categorical, structured associations that are highly consistent across trials. Finally, panel d shows a participant with high structure but low consistency: This participant made structured mappings across the space, but seems to have done so differently on each trial, as indicated by the inversions of green/blue in mid-front vowels and red/blue in back vowels across trials.


We examined vowel–color associations in a large number of participants in an online study. Four key findings emerge.

First, acoustic factors (F1 and F2) predict some of the variance in color responses, replicating a result by Moos et al. (2014), but in a different and larger population. Participants chose lighter colors for more front vowels (i.e., vowels with higher F2 values), redder colors for lower vowels (i.e., vowels with higher F1 values), and greener colors for high and front vowels (i.e., lower F1 and higher F2 values, respectively). Synesthetes showed the same patterns but were slightly more extreme, choosing even lighter and yellower colors for high vowels in particular, an effect also found by Moos et al. Overall, the locations of associations in the color space were not identical to those found by Moos et al., but the general differences in shape were comparable: Synesthetes’ choices were generally more extreme than those of nonsynesthetes.

Our findings resemble those of previous studies of vowel–color mappings in smaller samples of (non)synesthetes, with lighter colors (yellow, green) being associated with more front vowels (higher F2; e.g., /i/, /e/), darker colors (e.g., red, brown, blue) being associated with back vowels (/o/, /u/), and redder colors being associated with low front vowels (e.g., /a/) (Marks, 1975; Wrembel, 2009; see Guillamon, 2014, for a cross-linguistic overview, particularly their Table 5, p. 44). The association of front vowels with high F2 to lighter colors could be related to more widespread cross-modal phenomena such as pitch–lightness associations (e.g., Ward et al., 2006), although this potential relationship requires further study, since vowel quality and pitch are independent (see Ohala, 1994, for an account that integrates these). Size–sound symbolism may also play a role: The space in the oral cavity is smaller for higher and more front vowels, which may in turn be associated with lightness and brightness (Cuskley & Kirby, 2013).

Second, color associations are predicted better by phonemes than by acoustic features, showing that cross-modal associations between vowel sounds and colors are modulated to an important degree by categorical perception (Goldstone & Hendrickson, 2010). Prior work on sound–color associations has mostly focused on how gradual changes in chromaticity are associated with low-level acoustic factors (Moos et al., 2014; Ward et al., 2006). Here we have shown that shifts in color associations correspond to category boundaries in participants’ vowel systems. The importance of categorical structure in vowel–color associations has implications for the underlying nature of synesthetic associations, pointing to the important role of learning (Mroczko-Wąsowicz & Nikolić, 2014; Simner, 2007).

For example, recent studies have shown that learning can play an influential role in synesthesia: Specifically, the presence of colored fridge magnets in childhood was formative for grapheme–color synesthesia (Witthoft & Winawer, 2013; Witthoft, Winawer, & Eagleman, 2015). However, the learning implied in the categorical effects that we observed is perhaps qualitatively different: Rather than being based on highly specific early perceptual experiences, our effects arise from a categorical warping of the vowel space that is a crucial part of spoken language acquisition.

Although our results indicate that acquired categories play a key role in vowel–color associations, there remains an important role for lower-level acoustic perceptual cues. For example, in the u dimension, we observed synesthetic effects for acoustic factors but not for phoneme category. This may point to the relevance of lower-level acoustic properties for some synesthetes specifically. It is possible that synesthetes react to acoustic factors in a way that nonsynesthetes do not, making acoustic factors a stronger predictor of color choices for synesthetes than for nonsynesthetes. Further targeted studies will be required in order to address this question.

Third, we have introduced a novel measure of structure in cross-modal associations. Most participants showed a significant degree of structure, implying that vowel–color associations rely at least in part on establishing structural isomorphism across perceptual domains. We found that this measure correlates with consistency across trials: Participants with more temporally consistent associations tend to have more structured mappings, with synesthetes being at the most consistent and structured extreme. Although synesthetic associations are idiosyncratic to some degree, the prevalence of structure shows that these associations share similarities with cross-modal associations in the general population (Simner et al., 2005). This implies that private, involuntary synesthetic associations, as well as overt, elicited cross-modal associations, may be underpinned to a significant degree by common principles of isomorphic mapping across sensory modalities. Our structure measure makes it possible to probe cross-modal and synesthetic associations in a way that is at least partly independent of consistency. This has the potential to provide a valuable complementary measure of genuineness in synesthesia (Simner, 2012). The structure measure can be applied in any domain in which the perceptual features of stimuli and responses (or of inducers and concurrents) are quantifiable in terms of some distance measure.

Fourth, we have shown that an online task can be used to learn about cross-modal associations and to infer synesthesia for a subset of participants along the way, providing a scalable method for identifying synesthetes and for studying the relation between cross-modal associations and synesthetic mappings from a population perspective. Although online tasks have become a fixture of synesthesia studies at least since Eagleman et al. (2007), they have mostly been used to test preselected sets of participants. Here we have shown that a widely advertised survey can succeed in capturing a broad and diverse sample of the population, including synesthetes. To foster more of this work, the code underlying our web application is openly available.

Limitations and future work

Although acoustic and vowel phoneme predictors accounted for significant amounts of variation in color choices, as shown by the analyses above, the conditional R2 values in Table 2 indicate that individual variation accounted for at least as much variation in responses (the marginal-conditional R2 quantifies, roughly, the variance accounted for by both fixed and random effects). This indicates that, although there were obvious trends and although our structure measures point to a shared capacity for structure mappings, the specific identity of mappings may still vary considerably across participants. However, our structure results show that the capacity to structure mappings is strong and is shared across much of the population.

Further targeted studies should be done to tease apart the relative contributions of vowel category and grapheme category, especially the extent to which these may interact with how stimuli are presented. The data analyzed here were responses to auditory stimuli. Although there is evidence that auditory stimuli automatically activate graphemic representations to some extent (Ziegler & Ferrand, 1998), auditory presentation privileges acoustic features over graphemes (Cuskley et al., 2017), and graphemic representations may be less likely to be activated when auditory stimuli are presented devoid of a word-like context (e.g., within a nonword). Synesthetes tend to associate similar colors to similarly shaped graphemes (Brang, Rouw, Ramachandran, & Coulson, 2011; Eagleman, 2010; Jürgens, Mausfeld, & Nikolic, 2010; Watson, Akins, & Enns, 2012), implying an important role for visual grapheme shape in mediating the associations. However, other factors, such as ordinality in the alphabet and frequency in language, also play a role (e.g., van Leeuwen, Dingemanse, Todil, Agameya, & Majid, 2016; Watson et al., 2012).

Although most of the broad patterns found in Moos et al. (2014) were replicated in our data, the differences between the associations of synesthetes and nonsynesthetes was less pronounced than in their study (Fig. 5, and cf. Fig. 2 in Moos et al., 2014, p. 136). Two important differences between the studies are sheer sample size and online presentation, both of which added more variability to the responses in our study. Another possibility is that the confined set of 16 color swatches used by Moos et al. constrained responses in such a way that the divisions between synesthetes and nonsynesthetes became clearer. It may also be that Dutch speakers differ from English speakers in the finer details of how they map vowels onto color space. However, despite the differences, our results show quantitative differences between synesthetes and nonsynesthetes, as well as an important degree of shared mappings.

Although we expect the general role of categorical perception to be replicated across languages (mediated by known acoustic factors), we do not expect the precise groupings of vowel sounds to be replicated, because this is a function of how acoustic space is carved into a language-specific phoneme inventory. This opens up the possibility of a degree of linguistic relativity in cross-modal associations and synesthetic experience. Just as lexicalization patterns in the domain of color can shape low-level processes of color perception (Roberson, Pak, & Hanley, 2008), so phonemic structure may shape cross-modal associations. This is one place where linguistic diversity in phonetics, phonology, and orthography can be used to learn more about the mechanisms underlying vowel–color associations and to tease apart the roles of acoustic, phonemic, and graphemic features in cross-modal associations (Root et al., 2018; van Leeuwen et al., 2016).

Finally, further work will be needed to combine consistency scores with structure scores to create reliable behavioral indicators of genuine synesthesia, especially for use with large-scale online methodologies. Earlier reports of consistency scores had used them primarily to confirm self-identified synesthetes, rather than to detect synesthesia in a random sample. Figure 10a shows that our consistency cutoff to identify synesthetes (taken from Rothen et al., 2013) likely resulted in some false positives. A few potential issues may have contributed to this problem. First, Rothen et al. measured consistency in a grapheme–color task, which had 36 items. It may be that with fewer items (16 vowels), the consistency threshold was easier to pass. More importantly, consistency across items also needs to be taken into account: If a participant is consistent across trials and items, this is a likely flag for a false positive synesthete. Our structure measure captured this well, since cross-item distances form the core of the measure. Further work with confirmed synesthetes and nonsynesthetes will be needed in order to fully understand how best to combine structure and consistency scores to reliably detect synesthetes.


In the first half of the 20th century, a common term for synesthesia was audition colorée, or “colored hearing”, after one of the most commonly reported forms of synesthesia: a connection between vowels and colors. Colored hearing was famously described by Nabokov (1989) and studied by such noted linguists as Roman Jakobson (Reichard, Jakobson, & Werth, 1949). This early focus on sound–color associations probably was one of the reasons for a fruitful period of experimental work on cross-modal correspondences between sound and color (e.g., Marks, 1978). In contrast, most modern work on synesthesia has tested graphemes in alphabetic writing systems (written representations of speech sounds), and indeed, grapheme–color synesthesia is likely by far the most studied variant of synesthesia to date.

Here we have brought these traditions together in pursuit of fundamental questions into the nature of synesthesia and cross-modal correspondences. We studied associations between vowel sounds and colors in over 1,100 people, including hundreds of synesthetes. We replicated earlier findings on the relation between acoustic cues and color choices, but additionally showed an important role for categorical perception over and above such cues. Our findings underline the roles of learned categories and structural isomorphisms in the cross-modal associations made by synesthetes and nonsynesthetes alike. The measure of structural isomorphism we have introduced can help create more nuanced diagnostic tools for synesthesia. As work on synesthesia and cross-modal associations grows to accommodate larger samples and more varied measures, it will provide fundamental insights into how mappings across sensory modalities are made and maintained.