We addressed how the speech production and perception systems interact during online processing of vowels by using an experimental task which requires concurrent use of both systems. During the cue–distractor task, participants repeatedly produce responses prompted by a visual cue. Shortly after presentation of the cue but before any response is given, participants hear a distractor syllable via headphones. By systematically manipulating the relation between the response and the distractor, this task has been used extensively to demonstrate perceptuomotor integration effects (that is, perception effects on production) mainly with consonants: Response times speed up when the distractor syllable begins with a consonant that shares properties (such as articulator or voicing) with the initial consonant of response compared with conditions when properties are not shared. Here it is demonstrated that perceptuomotor integration is not limited to consonants and that congruency effects go beyond phonemic (non-)identity also for vowels.
Our first hypothesis was that phonemically congruent response–distractor pairs of vowels (e.g., /e/–/e/ or /u/–/u/) will result in faster RTs than phonemically incongruent pairs (e.g., /e/–/u/ or /u/–/i/), as has previously been reported for consonants (Galantucci et al., 2009; Kerzel & Bekkering, 2000) and in one recent study also for vowels (Adank et al., 2018). This hypothesis was confirmed. We found that distractor vowels that are phonemically different from the response vowel resulted in RTs that were on average 6.94 ms slower than distractor vowels that are phonemically identical. This first assessment of perceptuomotor effects for vowels does not consider the subphonemic properties of the vowels in the response–distractor vowel pairs because all phonemically different (from the response) distractor vowels were collapsed into a single group per spoken response. Our second hypothesis was that distractor vowels which are subphonemically more like the response vowel will speed up RTs compared with distractor vowels which are subphonemically less like the response. We defined subphonemic similarity in vowel pairs by referring to featural distance (i.e., the number of phonological features that differ between the two vowels in each pair). Extending earlier findings for consonants (Roon & Gafos, 2015) to the case of vowels, we observed that greater featural distance between the response and distractor resulted in slower RTs (by approximately 3.32 ms per unit of featural distance) compared with when the response and distractor are phonemically the same. Table 6 summarizes mean RT delays for each incongruent response–distractor vowel pair relative to the corresponding congruent response–distractor pair as estimated by our models.
Table 6 Perceptuomotor interference effects for each response–distractor vowel pair Comparing the present results to previous studies using the cue–distractor task, it appears that the interference effect—the temporal delay in giving a spoken response—is typically larger for consonants. For example, Galantucci et al. (2009) report an average effect of 28 ms. Our results uncovered an average interference effect about 7 ms for phonemic incongruency (see Table 1). This is in line with the results reported by Adank et al. (2018) for vowels embedded in a consonant frame (13 ms for their Experiment 1 and 7 ms for their Experiment 2; note that these values include audio, visual and audiovisual modalities).
A possible explanation for the generally smaller delays in RTs observed for vowels compared with consonants may be due to differences in the granularity of the representations activated during speech perception in contrasts involved among vowels versus those involved among consonants. Across Galantucci et al.’s (2009) three experiments on consonants, responses and distractors always shared the same main articulator on congruent trials and differed on incongruent trials (e.g., the lips for /pa/, /ba/, and /ma/ vs. tongue tip for /ta/, /da/, and /na/). In another cue–distractor study, Klein et al. (2015) report RTs in which the responses and distractors shared (e.g., /da–ta/) or differed in the main articulator (/ka/–/ta/); additionally, the distractor’s voice onset time (VOT) was shorter or longer than the participant’s own VOT. While RTs were slower on trials when the articulator of the response was different from that of the distractor (compared with trials which shared the main articulator), RTs did not differ between trials where distractor VOTs were different (e.g., /ka/ with a longer VOT) relative to the participant’s mean VOT (e.g., /ka/ with a VOT shorter than that of the distractor). Likewise, Roon and Gafos (2015) found a small (4 ms) but nonsignificant effect when responses and distractors differed in voicing compared with no difference in voicing (e.g., /ta/–/ba/ vs. /da/–/ba/). While previous results show clear effects on RTs due to the distractor consonant’s articulator (Galantucci et al., 2009), this is not necessarily the case for VOT or voicing differences (Klein et al., 2015; Roon & Gafos, 2015). A plausible interpretation of the divergent effects of VOT/voicing and articulator in consonants, which bears on the case of vowels, may relate to possible differences in the somatomotor representations activated during speech perception of these different contrasts in terms of voicing versus main articulator. Suppose the response is /ba/ and the distractor is /da/. Planning to say /ba/ activates motor codes for the lips and hearing /da/ activates motor codes for a different main articulator (namely, the tongue tip). By contrast, planning to say /ka/ with a long VOT activates motor codes for the tongue dorsum and hearing /ka/ with a shorter VOT activates motor codes for exactly the same main articulator. In the case of vowels, differences between the response and distractor always involve the same articulators (lips and tongue). However, these articulators are used in different ways across different vowels—for example, planning to say /u/ actives motor codes for the lips and tongue and hearing /i/ also activates motor codes for the lips and tongue but, importantly, with different configurations. It is conceivable that the smaller delays in RTs for vowels (compared with stop consonant pairs like /ba/–/da/) are because similar but not identical motor codes (rather than motor codes for distinct articulators) are involved across the response and distractor on incongruent trials. Adank et al. (2018) suggest that future studies should investigate this very issue by comparing the articulatory complexity of different sounds via exploring the somatotopy of perceived sounds using transcranial magnetic stimulation (TMS) and measuring motor evoked potentials (MEPs) from lip and tongue muscles.
Our results raise new questions about how the substance underlying phonemic categories is to be described, because perceptuomotor effects were evidently not uniform across the two responses. Recall that our second model revealed a significant interaction of featural distance and response. Subsequent analyses showed that, whereas RT modulations for the response /u/ depended on the identity of the distractor, RTs for the response /e/ were not reliably affected by the exact distractor vowel (/i/, /o/ or /u/). Thus, there are different patterns of perceptuomotor effects for the /e/ and /u/ responses.
Why might perceptuomotor effects occur for the response /u/, but not for /e/? To define featural distance, we followed convention by using the standard phonological representations for vowels that assume that all features of every vowel are specified. An alternative to this approach to featural representations is Lahiri and Reetz’s (2010) featurally underspecified lexicon model (FUL). In this model, representations may be underspecified for certain features. For instance, the feature [coronal], originally used to represent sounds produced with a raised tongue tip blade (Chomsky & Halle, 1968, p. 304), but adopted in Lahiri and Reetz’s (2010) set of features to also represent front vowels, is claimed not be specified in mental representations. Thus, while the representation for /e/ lacks feature values for coronality and height, the representation for /u/ is fully specified ([dorsal], [labial] and [high]). According to FUL (Lahiri & Reetz, 2010), in perceiving speech, the underspecified features of a phoneme are extracted from the incoming signal and these ‘surface’ features may or may not mismatch with another phoneme’s mental representations. In the present study, the features in the mental representations of the response vowels may or may not mismatch with the surface features extracted from perceiving a distractor vowel. For instance, when the response is /e/ and the distractor is /u/, the features of the response vowel /e/ will not mismatch with any of the surface features extracted from perceiving the incoming distractor /u/ (namely, [dorsal], [labial], [high]) because the mental representation of /e/ has no specification for any of these features. On the other hand, when the response is /u/ and the distractor is /e/, there is a mismatch between the [dorsal] feature in /u/’s mental representation and the [coronal] feature extracted from the incoming distractor /e/. Table 7 shows a schematic of the relationships between the features for the mental representations of the two response vowels /e/ and /u/ and the features extracted from perceiving different distractor vowels.
Table 7 Phonological features extracted from the perceived incoming signal (surface form) and those in the mental representation (for production of the response) based on an alternative model of representations admitting underspecification (Lahiri & Reetz, 2010) Our results are compatible with the predictions of the underspecified representations model; whereas we found a modulation of RTs depending on the distractor for the response /u/, no such modulation effects were found for any of the distractors /i/, /o/, or /u/ when the response was /e/. The absence of such RT modulations for response /e/ may be seen as a consequence of the lack of mismatch between response /e/ and any of the incoming distractor vowels as shown in Table 7. In contrast, for response /u/, its mental representation is specified for [dorsal], [labial], and [high]. Consequently, hearing the distractors /e/ or /i/ while planning to produce the response /u/ would result in a mismatch between /u/’s [dorsal] representation and the surface [coronal] feature of the distractors /e/ or /i/. This is in line with the results indicating significant effects (delays in RTs) for the distractors /e/ and /i/ when the response was /u/. Additionally, FUL predicts no mismatch between the underlying features of the response /u/ and the surface features extracted from the distractor /o/. This is because the distractor /o/ exhibits [dorsal] and [labial], which are also exhibited in the mental representation of the response /u/. In line with this, the results showed no significant difference in RTs when the response was /u/ and the distractor was /o/ compared with when the response and distractor were both /u/.
While our results show that previous findings for consonants about phonemic and subphonemic properties influencing perceptuomotor effects also apply in the case of vowels, the design of the present study differs in one notable way from previous cue–distractor studies. In this study, the required responses and the distractors were isolated vowels (e.g., /e/ or /u/) instead of more linguistically typical syllables containing consonant–vowel (CV) or consonant–vowel–consonant (CVC) phoneme sequences (e.g., /ba/ or heed). Note that stop consonants like [b] and [d] cannot form (acoustic) stimuli in isolation (due to the lack of acoustic output from a fully constricted vocal tract) whereas vowels can. In some sense, our design decision of using isolated vowels provides the simplest test bed for assessing perceptuomotor effects, and such a test bed is not available for consonants; all previous perceptuomotor studies for consonants use the simplest possible form—namely, a consonant–vowel syllable where the vowel is kept the same across the differing with respect to the consonant stimuli. However, our design decision for vowels may not be optimal in a different sense. In vowel perception, identification accuracy is substantially reduced for isolated vowels which have been excised from coarticulated CVC syllables compared with vowels in intact CVC syllables (for a review, see Strange & Jenkins, 2013). Strange and Jenkins (2013) hypothesize that the dynamic time-varying spectral structure found in the edges of vowels in coarticulated syllables is rich in information for perceiving phonemic identity in vowels. Recall that our distractor stimuli were isolated vowels excised from CVC syllables which excluded much of the spectrotemporal structure around the edges of the vowels. It is thus possible that, if the distractor vowels had been presented in the original CVC syllables (and the required responses were also CVC syllables), modulations in RTs due to the (in)congruency of response–distractor pairs would have been more pronounced than those observed, potentially due to the availability of spectrotemporal cues relevant for perceiving phonemic identity which is enhanced in CVC syllables. In future studies, we thus plan to examine RT modulations with vowels in CVC syllables. Additionally, while the findings of this study seem compatible with featural representations, our study was not designed to address whether a metric of similarity based on acoustic parameters versus phonological features best accounts for the observed RT modulations. A major challenge in separating feature-based from acoustic-based notions of similarity is that featural congruency often correlates with acoustic similarity. In the general case, the more features that differ between two vowels, the more different the vowels are acoustically. Teasing apart the two notions of similarity requires comparing perceptuomotor effects for acoustically similar but featurally different vowels with featurally similar but acoustically different vowels. We plan to undertake such comparisons by employing languages with vowel inventories which include such contrasting pairs in future studies.
Let us address next implications of our results about speech perception and production models. The findings of our study necessitate a formulation of the perception–production link where motor codes activated in the process of planning a response vowel are also activated automatically by a perceived distractor vowel. According to the motor theory of speech perception (Liberman & Mattingly, 1985), these motor codes are the sole object of perception, and our results are therefore fully compatible with that theory. However, our results do not entail that the codes activated in perception must exclusively be motor codes and our study was not designed to address whether in perception non-motor codes are also activated (cf. Diehl, Lotto, & Holt, 2004; Galantucci, Fowler, & Turvey, 2006). Thus, our results are compatible with theories in which nonmotor codes are also activated during perception (e.g., Fowler, 1986; Mitterer & Ernestus, 2008; Ohala, 1996), provided some link between the nonmotor and motor codes can be assumed (Viviani, 2002). In terms of speech production models, our findings of RT modulations due to subphonemic properties are more in line with speech production models which assign roles for subphonemic parameters in the process of speech production (e.g., Dell et al., 1993) compared with other theories which do not assign important roles for subphonemic features in the process of planning articulation (e.g., Roelofs, 1997).
In summary, using an experimental paradigm which requires concurrent use of the perception and production systems, we studied perceptuomotor integration effects—that is, how perception of an auditory vowel stimulus (the distractor) affects production of a cued vowel response. Our results contribute to and extend previous perceptuomotor studies focusing on consonants (Galantucci et al., 2009; Kerzel & Bekkering, 2000; Roon & Gafos, 2015). In line with previous studies on consonants using the cue–distractor task, we found that RTs for producing a required vowel response speed up when the distractor vowel is phonemically identical to the cued spoken response compared with when the distractor is phonemically different. We also found evidence that subphonemic properties – below the level of phonemic category—modulate RTs. The fact that participants in our experiments were told to ignore the distractor and yet reliable effects of distractors on responses are observed, both in our current study and previous studies using the same paradigm (e.g., Adank et al., 2018; Galantucci et al., 2009; Roon & Gafos, 2015), attests to the automaticity of the perceptuomotor effects and to the promise of this paradigm in elucidating the perception–production link for vowels as well as for consonants. In future studies, we plan to employ more speech-typical utterances with vowels embedded in syllables (as opposed to isolated vowels) as well as assess the extent to which other similarity metrics based on acoustic dimensions (as opposed to phonological features) provide a better basis for congruency relations giving rise to perceptuomotor integration effects.