Gradient and categorical patterns of spoken-word recognition and processing of phonetic details

Desmeules-Trudel, Félix; Zamuner, Tania S.

doi:10.3758/s13414-019-01693-9

Gradient and categorical patterns of spoken-word recognition and processing of phonetic details

Published: 27 February 2019

Volume 81, pages 1654–1672, (2019)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Gradient and categorical patterns of spoken-word recognition and processing of phonetic details

Download PDF

1461 Accesses
6 Citations
Explore all metrics

A Correction to this article was published on 29 March 2019

This article has been updated

Abstract

The speech signal is inherently rich, and this reflects complexities of speech articulation. During spoken-word recognition, listeners must process time-dependent perceptual cues, and the role that these cues play varies depending on the phonological status of the sounds across languages. For example, Canadian French has both phonologically nasal vowels (i.e., contrastive) and coarticulatorily nasalized vowels, as opposed to English, which only has coarticulatorily nasalized vowels. We investigated how vowel nasalization duration, a time-dependent phonetic cue to the French nasal contrast, affects spoken-word recognition. Using eye tracking in two visual world paradigm experiments, the results show that fine-grained phonetic information is important for lexical recognition, and that lexical access is dependent on small variations in the signal. The results also show gradient interpretation of ambiguous vowel nasalization despite the phonemic distinction between phonological nasal vowels and coarticulatorily nasalized vowels in Canadian French. Gradience was found when words were ambiguous, and interpretation was more categorical when words were unambiguous. These results support the hypothesis of gradient interpretation of phonetic cues for ambiguously produced stimuli and the storage of coarticulatory information in phono-lexical representations for a language that has a phonological contrast for nasality (i.e., French).

Perception of vocoded speech in domestic dogs

Article Open access 16 April 2024

How feature integration theory integrated cognitive psychology, neurophysiology, and psychophysics

Article 09 July 2019

Multisensory mental representation of objects in typical and Gifted Word Learner dogs

Article Open access 08 June 2022

During speech processing, listeners are presented with time-dependent and variable, fine-grained acoustic cues for phoneme and word recognition. These cues include within-category variability and coarticulation, which is the result of the overlap of adjacent sounds’ articulatory movements (Fowler, 1980). These cues were traditionally considered redundant for formal theories of phonetic, phonological, and lexical representations (Keating, 1988). In these frameworks (Archangeli, 1988; Keating, 1988; Steriade, 1995), within-category variability is idiosyncratic, and coarticulatory effects derive from rules (Cohn, 1990), and are not specified in word representations. Consequently, in perception, listeners are not expected to use these fine-grained phonetic details for word recognition. However, experimental evidence in favor of the importance of such fine-grained details has been repeatedly found in psycholinguistic studies (Beddor, McGowan, Boland, Coetzee, & Brasher, 2013; Cross & Joanisse, 2018; Dahan, Magnuson, Tanenhaus, & Hogan, 2001; Desmeules-Trudel, Moore, & Zamuner, 2019; Gow, 2003; McMurray, Clayards, Tanenhaus, & Aslin, 2008; McMurray, Tanenhaus, & Aslin, 2002; Paquette-Smith, Fecher, & Johnson, 2016; Zamuner, Moore, & Desmeules-Trudel, 2016).

For example, in English, speakers begin lowering their velum (an articulatory movement associated to nasalization) early during the production of vowels that are followed by a nasal consonant. This velum-lowering movement has an influence on the acoustic output of vowels that precede nasal consonants, yielding a partially or entirely nasalized vowel in production (Beddor, 2009). However, coarticulatory vowel nasalization has traditionally not been included as a part of phonological or lexical representations in English (Lahiri & Marslen-Wilson, 1991), even though listeners use nasal coarticulation for word recognition (Beddor et al., 2013).

In addition to the actual use of fine-grained phonetic information, researchers have shown that coarticulation and within-category variability is not necessarily perceived categorically (McMurray et al., 2002), as would be expected based on seminal work on voice onset time (VOT) perception (Liberman, Harris, Hoffman, & Griffith, 1957). For example, it has been shown that English adults gradiently pay attention to vowel nasalization (Beddor et al., 2013) when processing spoken words. Specifically, English listeners are faster at recognizing words that contain a nasal consonant (e.g., scent) when the preceding vowel is nasalized early than when the vowel is not (or nasalized later; Beddor et al., 2013). However, while there has been considerable research on listeners’ sensitivity to coarticulatory cues (see references above), most of these studies have focused on English, yielding to an underrepresentation of data from a variety of language systems and the influence of a variety of variable phonetic cues, which would lead to a better understanding of the general word recognition capacity. Investigating a variety of languages is thus crucial to a better understanding of word processing.

Furthermore, the influence of variability in the realization of phonological contrasts on word recognition has not yet been thoroughly investigated, even less for vowel contrasts (for reports that include consonant contrasts, see Dahan et al., 2001; McMurray et al., 2008; McMurray et al., 2002). It is thus important to investigate participants from a variety of language backgrounds and on a variety of phonetic-detail types to gain a better understanding of the general process of word recognition, and of the influence of fine-grained phonetic cues on higher order processing. Mainly, this is because different languages make use of phonetic cues in different manners, and thus listeners can be expected to use those cues differently.

In this paper, we examine gradient versus categorical processing of vowel nasalization in Canadian French, a language that has contrastive nasal vowels (as in pain [pɛ̃] “bread”) and coarticulatorily nasalized vowels (Desmeules-Trudel & Brunelle, 2018; Léon, 1983)—that is, when oral vowels are followed by a nasal consonant (as in peigne realized [pɛɲ] or [pɛ̃ɲ] “comb,” which can have a partially and optionally nasalized vowel). Nasal vowels can be followed by nasal consonants in French as well (e.g., grand-mère [ɡʁãmɛʁ] “grandmother,” as opposed to grammaire [ɡʁamɛʁ] “grammar”), which clearly demonstrates their contrastive status. Examining a single cue that varies in phonological status (contrastive vs. coarticulatory) within Canadian French enables us to formulate a new set of predictions on how listeners use fine-grained phonetic details during spoken-word recognition, as compared with coarticulatory (i.e., noncontrastive) processing in English. Note that another account of the influence of vowel nasalization on word recognition focused on Bengali, a language that has both phonologically and coarticulatorily nasalized vowels (Lahiri & Marslen-Wilson, 1991), and found evidence for the early use of coarticulatory nasalization in this language. However, Lahiri and Marslen-Wilson (1991) did not directly manipulate duration of nasalization on the vowels, which is crucial to investigate the actual impact of fine-grained variations in nasalization duration on word recognition and if phonological contrasts are categorically or gradiently perceived.

For the current study, we manipulated duration of nasalization, similarly to Beddor et al. (2013). This manipulation enables us to formulate different predictions regarding perceptual patterns (i.e., gradient or categorical) of vowel nasalization in Canadian French compared with English, which is gradient. For example, categorical patterns of perception could be expected in French because listeners have to phonologically discriminate coarticulatorily nasalized vowels (e.g., peigne realized [pɛ̃ɲ]) and contrastive nasal vowels (e.g., pain realized [pɛ̃]). On the other hand, it is possible that French listeners will still use variations in vowel nasalization in a gradient manner similarly to Beddor et al.’s (2013) English participants, since evidence for gradient perception has been found using phonological consonant contrasts (e.g., voicing in English; McMurray et al., 2002), in addition to evidence that listeners are able to pay attention to fine-grained phonetic details (see references above). Our main aim is thus to determine if variation in fine-grained phonetic details pertaining to a vocalic contrast is perceived in a gradient or categorical manner.

Phonetic integration and gradient perception

As mentioned above, one challenge faced by the word-recognition system is the time-dependent aspect of spoken-word recognition, since phonetic cue duration can be the main indicator of a sound category. For example, voice onset time (VOT), a cue that varies in duration, is linked to word-initial consonant voicing in English (Liberman et al., 1957). In English, long-lag VOT duration is associated with voiceless stop consonants, and short-lag VOT duration is associated with voiced consonants. Consequently, in perception, listeners must consider these dynamics before making a decision on the phonological identity of a stop consonant.

Previous research has found support for gradient sensitivity to VOT variations during word recognition (McMurray et al., 2008; McMurray et al., 2002), despite the categorical character of VOT in production (long-lag vs. short-lag). For example, McMurray et al. (2002) showed that native English listeners were sensitive to small variations in VOT values on a continuum between voiced (e.g., /b/, VOT = 0 ms) and voiceless (e.g., /p/, VOT = 40 ms) consonants. In their eye-tracking experiment, listeners were instructed to listen to auditory stimuli composed of minimal pairs (e.g., bear–pear) that had been modified to form nine-step continua between 0 ms and 40 ms in VOT in 5-ms increments. Participants were asked to click on an image corresponding to the word they heard while their eye movements were recorded. McMurray et al. (2002) found that stimuli at the end points of the continuum (0 ms and 40 ms) led to high proportions of fixations to the correct target (high proportions of fixations to the beach for 0 ms VOT, and high proportions of fixations to the peach for 40 ms VOT), but that listeners displayed intermediate proportions of fixations for VOT values nearer to the category boundary. This is an example of gradient sensitivity to a contrast (see also McMurray et al., 2008).

Vowel nasalization in (Canadian) French

As mentioned above, Canadian French vowel nasality is a phonologically contrastive property, meaning that words such as pain (/pɛ̃/ “bread”) and paix (/pɛ/ “peace”) constitute minimally different lexical items that are phonologically distinguished based on the nasality of the vowel. There are four contrastive nasal vowels in Canadian French: /ɛ̃, ã, ɔ̃, œ̃/ (Côté, 2012; Martin, 2002; Martin, Beaudoin-Bégin, Goulet, & Roy, 2001), which can be variably realized depending on a number of factors (Carignan, 2013; Delvaux, 2006; Desmeules-Trudel & Brunelle, 2018; Léon, 1983; Martin, Beaudoin-Bégin, Goulet, & Roy 2001). For example, Delvaux (2006) and Desmeules-Trudel and Brunelle (2018) found that nasalization of phonological nasal vowels can start at vowel onset or can be delayed (up to 50% of its duration) in Canadian French. In other words, phonological nasal vowels can be nasalized for their entire duration, or they can be only partly nasalized; this varies based on syllable structure, prosodic context, and individual speakers, among other factors. To our knowledge, however, no study has directly investigated how common are vowels that are nasalized for the entirety of their duration, but research suggested that they are variable overall and can be fully nasalized (Desmeules-Trudel & Brunelle, 2018; Léon, 1983). Furthermore, oral vowels followed by a nasal consonant can be (optionally) coarticulatorily nasalized in Canadian French (Desmeules-Trudel & Brunelle, 2018; Léon, 1983)

In addition, Carignan (2013) and Desmeules-Trudel and Brunelle (2018) found that short excrescent nasal consonants, also referred to as nasal appendices, are often present in the realization of phonological nasal vowels. This can lead to the ambiguous realization of a phonological nasal or coarticulatory nasalized vowel, when controlling for other differences in their realization (e.g., oral resonances; Carignan, 2013, 2014). For example, if the vowel in pain [pɛ̃] “bread” is produced with an excrescent nasal appendix (i.e., [pɛ̃ɲ]), it could be confused with the word peigne ([pɛɲ] “comb”), especially given that the vowel in peigne can be slightly nasalized as well. Consequently, it is likely that listeners consider the nasalization timing information (i.e., the moment at which nasalization starts within the vowel) when interpreting nasal vowels in Canadian French.

In the experiments below, we varied the duration of nasalization on vowels (i.e., timing information) in real French words, analogously to McMurray et al.’s (2002) VOT continuum, and we varied the presence or absence of nasal appendices, to investigate how Canadian French listeners interpret words that contain vowels that are early-nasalized (i.e., vowels that are nasalized from onset or shortly after onset, and onwards) and late-nasalized (i.e., vowels that are nasalized late in their duration, until vowel offset). By cross-splicing portions of phonological nasal vowels (i.e., a part or the entirety of the vowel in pain [pɛ̃] “bread”) onto vowels that can be coarticulatorily nasalized (i.e., peigne [pɛɲ] “comb”), we were able to control for small variations in the duration of nasalization. To our knowledge, the possibility of gradient versus categorical perception has never been directly assessed in vowels during word recognition, even though it has been shown that consonants are generally perceived more categorically than vowels in syllables (Fry, Abramson, Eimas, & Liberman, 1962).

Study predictions

Two eye-tracking experiments were designed to test how variations in fine-grained phonetic information on vowels is interpreted during spoken-word recognition (i.e., categorically vs. gradiently). Listeners had to identify spoken words in a Visual World Paradigm (Huettig, Rommers, & Meyer, 2011). Similarly to Beddor et al. (2013) and given the realization of the nasality contrast in Canadian French (Desmeules-Trudel & Brunelle, 2018), it was expected that variation in the duration of nasalization would have an influence on the recognition of spoken words containing oral vowels followed by nasal consonants (CVN) and phonological nasal vowels (CṼ). For example, vowels that are nasalized for 50% or more of their duration are expected to be recognized mostly as phonological nasal vowels, especially when not followed by a nasal appendix. On the other hand, vowels followed by a nasal consonant (VN) in a closed syllable can be nasalized (i.e., coarticulated) for approximately 20% to 25% of their duration towards vowel offset (Desmeules-Trudel & Brunelle, 2018). Therefore, vowels that are nasalized for 20% or less of their duration are expected to be mostly recognized as oral vowels (followed by nasal consonants).

The main goal of this paper was to verify if phonetic information related to vowel nasalization is perceived categorically or gradiently by listeners of Canadian French. On the one hand, it is expected that listeners will display categorical patterns of nasalization perception since this property is phonological in Canadian French. Listeners might not pay attention to within-category variability, as was shown in early experiments on categorical perception with VOT (Liberman et al., 1957), and in studies on perceptual compensation (Beddor & Krakow, 1999; Fowler, 2006). For instance, Beddor and Krakow (1999) showed that in nasal contexts, English and Thai native listeners (partially) compensated for vowel nasalization and attributed the coarticulatory vocalic phonetic cues to a following nasal consonant. In the current paper, if listeners perceive a nasal consonant (see stimuli below) and compensate for vowel nasalization, categorical patterns are expected to emerge.

On the other hand, analogously to Beddor et al.’s (2013) study, Canadian French listeners may look earlier at words that contained a contrastive nasal vowel when the stimulus is nasalized early in its duration than when it is nasalized late, and gradiently fixate more to the CVN word as vowel nasalization starts later on the vowel. This pattern can be expected based on other gradient perception studies (McMurray et al., 2002)—listeners dynamically integrate phonetic cues as they unfold in the speech signal and “update” their word choice online (McClelland & Elman, 1986). Earlier nasalization on the vowel could enable listeners to anticipate the identity of the (upcoming) nasal vowel, yielding gradient/continuous patterns of recognition. We used eye tracking because the investigated questions involve the timing of nasalization onset, and eye movements have been shown to be sensitive to such timing information (Beddor et al., 2013; McMurray et al., 2008; McMurray et al., 2002; Salverda, Kleinschmidt, & Tanenhaus, 2014).

Experiment 1

The stimuli in Experiment 1 contained a nasalized vowel and a following nasal excrescent consonant, that is, a short consonant-like nasal sound, in coda position of the word. The choice was made to include a short consonant after the vowel to reflect the phonetic realization of phonological nasal vowels (Ṽ; Desmeules-Trudel & Brunelle, 2018) in addition to variations in nasalization duration. However, phonologically, nasal vowels followed by a full nasal consonant in coda position are banned in French. Consequently, if listeners interpret the excrescent consonant as a full nasal consonant, stimuli can be considered to contain conflicting cues, that is, cues that pertain to both a phonological nasal vowel and an oral vowel followed by a consonant.

Method

Participants

Twenty-three native speakers of Canadian French (17 female, six male), between 18 and 36 years of age (X̅ = 23.3 years, SD = 5.3), were paid or received partial course credit for their participation in the experiment. Thirteen participants were from Ontario, nine from Québec, and one from New Brunswick. All listeners reported having normal hearing, normal or corrected-to-normal vision, and did not report any type or history of language, hearing, or speech impairment. All participants completed a background language questionnaire, and self-reported knowing English as a second language at a moderate to high level of proficiency, as well as other languages. Most speakers were late bilinguals in English (N = 16), and some were early bilinguals in English (N = 7). The early versus late bilingual status was determined based on a criterion of participants having on average 30% or more exposure to English as a second language in one speaking context before 5 years of age (e.g., in the family, in daycare, with peers). Since all participants were bilingual, we do not expect differences to emerge regarding the listeners’ language backgrounds, especially given that nasalization is only coarticulatory in English (or other languages known to the listeners). No speaker had been exposed to another language containing contrastive nasal vowels in its phonological inventory.

Stimuli and experimental conditions

Stimuli were nine triads of monosyllabic, picturable French words containing either a contrastive nasal vowel (CṼ), an oral vowel in a nasalization context (i.e., followed by a nasal consonant; CVN), or an oral vowel followed by an oral consonant (CVC; see Table 1). Note that the place of articulation of the oral consonant in CVC words did not always match the place of articulation of the nasal consonant in CVN words. Three triads contained midfront [ɛ–ɛ̃], three contained low [a–ã], and three contained midback [ɔ–ɔ̃]. Note that the vowels [œ–œ̃] were not included as the functional load of the nasal vowel is very low in the French lexicon (Martin et al., 2001). Furthermore, frequency of the nasal and coarticulated words could not be controlled due to constraints on word choice, and was therefore not analyzed in the current paper. The word list was recorded by five native speakers of Canadian French (two female, three male, between the ages of 23 and 27 years) in order to avoid listener habituation to one speaker during the test phase. The speakers were from different regions in Canada to ensure representation of a variety of Canadian French dialects. The words were embedded into different carrier sentences that matched the target word meaning in order to avoid focus effects and were placed at the end of an intonational phrase in the carrier sentences. Speakers were instructed to read the sentences twice in a natural, but careful way.

Table 1. List of minimal pairs (and fully oral words) that were used as auditory and visual referents

Full size table

Each word was hand-segmented in Praat (Boersma & Weenink, 2015) and normalized for amplitude at 70 dB. The CṼ and CVN tokens were compared for each speaker, and the most similar tokens were selected. Observation of the acoustic spectrum (absence of nasal antiformants and nasal peaks in the 800–1500 Hz region, though these cues can be variable across speakers for the production of nasalization) and auditory confirmation enabled us to ensure that the first 80% of the CVN token vowels’ duration were not nasalized. This was important since nasalization of this portion of the CVN vowels could compromise the 20N %NasDur condition (see below). Details about the duration of oral and nasalized portions of the stimuli vowels are found in Table 3 of the Appendix.

Each speaker contributed one or two word pairs. For each experimental stimulus, the consonant frame of a CVN token was kept by removing a part or the entirety of the oral vowel. This yielded five experimental conditions (i.e., proportion of duration of the vowel that is nasalized—%NasDur): fully oral vowels (0N), partially nasalized vowels (20N, 50N, 80N), and fully nasal vowels (100N). The %NasDur values are illustrated in Table 2. For 0N, the vowel from the CVN matrix token was removed entirely and replaced by an oral vowel of a CVC token. However, the 0N %NasDur condition was not included in the current analyses, since place of articulation of the consonant following the vowel in the CVC word was not consistently the same as the place of articulation of the CVN word. This is further discussed in the Discussion. For 20N, 50N, and 80N, the duration of the oral vowel was calculated, and 20%, 50% or 80% of the vowel duration was cut from the original vowel (CVN). The portion of the vowel that was removed from the original token was replaced by a part of the nasal vowel of the matching CṼ token, considering zero passages in order to avoid unwanted acoustic artifacts from the splicing such as clicks or noise in the signal (see Fig. 1). Amplitude peak reduplication was performed when necessary to adjust the duration of the vowel. For 100N, the oral vowel of the original token was entirely replaced by the full vowel of the matching CṼ token. Note that the 50N, 80N, and 100N contain potentially conflicting cues, as they are nasalized for a significant part of their duration (similar to phonological nasal vowels) and also contain an excrescent nasal coda. Final adjustments on fundamental frequency and amplitude were made, and the stimuli quality was verified by the first author (a native speaker of Canadian French). Participants from both experiments reported that the stimuli sounded natural, but sometimes ambiguous regarding their meaning. Fifty ms of the final nasal consonant following the vowel in the CVN word were kept in the stimuli in order to mimic the natural realization of nasal vowels’ appendices in Canadian French connected speech (Desmeules-Trudel, 2015; Desmeules-Trudel & Brunelle, 2018). A total of 45 experimental stimuli was used in the experiment (i.e., five stimuli for each word pair), and data from 36 items were analyzed for the current paper.

Table 2. Summary of the analyzed experimental conditions (%NasDur)—Experiment 1

Full size table

Visual stimuli corresponded to the words (Table 1) and were taken from the International Picture Naming Project (IPNP) database (Székely et al., 2004), which contains black-and-white pictures of a variety of (English) nouns. Pictures of words that could not be found in the database (12 critical CṼ or CVN items, seven noncritical CVC items, N = 19) were hand drawn by a professional visual artist in the same style as the IPNP images, scanned and saved as JPEG files. All visual stimuli were 5.6 cm × 5.6 cm (220 × 220 pixels), and images corresponding to the triads with one member containing a phonological nasal vowel (CṼ), a coarticulated nasalized vowel (CVN), and an oral vowel (CVC), as well as a distractor image were arranged together on a display. Words corresponding to the distractor image had the same initial consonant as the other stimuli of the triads. Images were embedded within approximately 6 cm × 6 cm (250 × 250 pixels) interest areas for data collection. For example, CṼ pain (“bread” [pɛ̃]), CVN peigne (“comb” [pɛɲ]), CVC pêche (“peach” [pɛʃ]) and filler pluie (“rain” [plɥi]; see Fig. 2) were simultaneously presented on the screen.

Procedure

The experiment was programmed and presented in Experiment Builder (SR Research; Version 1.10.63) and presented with an EyeLink 1000 (SR Research), using a chin rest, monocular recording, and sampling at 500 Hz. The experiment started with a 5-point calibration then validation, keeping the maximum and average errors below 1° of visual angle for all participants. A familiarization phase followed, during which participants saw an image on the screen with the corresponding unambiguous auditory word. Unspliced tokens of the critical CṼ (N = 9) and CVN words (N = 9), CVC words (N = 9), and monosyllabic filler words (N = 43) were presented (total items in familiarization phase, N = 70). The odd number of filler words is due to the fact that some fillers were used simultaneously with the experimental triads (i.e., experimental trials), while other fillers were used on displays containing four filler images (i.e., filler trials). After familiarization, participants were tested on the word-picture association task. Drift correction was performed before each trial, which was experimenter controlled. On each trial, participants saw four images corresponding to the CṼ, CVN, CVC, and filler words for 500 ms, followed by the presentation of an audio token corresponding to an experimental (see Table 2) or filler stimulus. Participants were asked to click on the image that corresponded to the word they heard. This ended the trial and changed the display to the next drift correction. No feedback was provided on the accuracy of participants’ responses, as some stimuli were inherently ambiguous. Image position was pre-randomized across trials. There were 90 critical trials (i.e., each critical stimulus was presented twice), of which 72 critical trials were analyzed in the current paper since the 0N %NasDur condition was removed from the analyses (see Discussion), and 70 filler trials per participant. The experiment lasted approximately 25 to 30 minutes. Participants were randomly assigned to one of four experimental lists to counterbalance block order presentation.

The collected data included the images that participants clicked on and the fixations to the various images on the display. Due to an error in programming, responses corresponding to the coarticulated image (CVN; e.g., peigne), the oral word (CVC; e.g., pêche), and the unrelated image (e.g., pluie) were all collapsed into the coarticulated (CVN) response category. To overcome this problem, we calculated an eye-tracking measure to infer which image participants chose in these cases. We based this on the image that was fixated to the most in the last 1,000 ms of each trial. Therefore, the chosen image measure that will be presented below consists of the image that was fixated the longest in the last 1,000 ms of each trial.^{Footnote 1} Proportions of fixations to each image (i.e., fixations within the interest areas around the images) were calculated in 50 ms time bins using a Python script provided by SR Research. Analysis will focus on the proportions of fixations to the coarticulated-vowel image (CVN) when listeners chose the CVN item (969 trials), and on the proportions of fixations to the contrastive-nasal-vowel image (CṼ) when listeners chose the CṼ word (526 trials) to uncover categorical or gradient patterns of perception depending on proportion of nasalization conditions.

Analysis and variables

Both the chosen images and proportions of fixations were modeled using (separate) generalized additive mixed models (GAMMs; Wood, 2017). This statistical technique allows one to analyze factorial and/or gradient predictors on potentially nonlinear data, which is often the case for eye movements. Random effects structure (intercepts, slopes, and nonlinear smooth terms) can be added to the model. Also, autocorrelation of time series data is considered (Baayen, van Rij, de Cat, & Wood, 2018), which is necessary for time-dependent data such as eye tracking because one data point is correlated in time to the preceding point. Finally, GAMMs can handle unbalanced data (i.e., missing data points), which is common in eye tracking as participants are free to fixate outside of the predetermined interest areas during a trial. GAMMs have successfully been used in past research to analyze visual world eye movements (Porretta, Kyrölaïnen, van Rij, & Järvikivi, 2017; Porretta, Tucker, & Järvikivi, 2016; van Rij, Hollebrandse, & Hendriks, 2016). Also note that GAMMs require visual inspection of model estimates to interpret (non-)significance of factors, especially since “nonlinear trends are difficult to capture with a single parameter” (Porretta et al., 2017, p. 270). The mgcv (Version 1.8-16; Wood, 2017), itsadug (Version 2.2; van Rij, Wieling, Baayen, & van Rijn, 2016), and ggplot2 (Version 2.2.1; Wickham, 2009) packages were used for analysis and visualization in R (R Core Team, 2017; Version 3.4.2).

For the GAMM analysis of chosen images, models included %NasDur as the main factor of interest to assess if participants’ responses were gradient or categorical, outputting estimates of the probability (in log odds) of giving a CṼ response for each condition. Random intercepts by participant and item were also included. The p values of parametric coefficients represent a significant difference of one level to the baseline. By using the 50N %NasDur condition as a baseline, we were able to detect differences between this latter %NasDur value and the other ones—significant differences between 50N and multiple other levels of the same factor (i.e., significant differences between 20N–50N, 50N–80N and 50N–100N) suggest gradient use of vowel nasalization variations, while nonsignificant differences between 50N–80N and 50N–100N would suggest more categorical patterns of perception. For example, if the probability of giving a nasal (CṼ) response is significantly lower in the 20N %NasDur condition than in the 50N %NasDur condition, and significantly higher in the 80N than the 50N %NasDur condition, we can conclude that listeners gradiently interpreted nasalization timing variations on vowels. On the other hand, if one or more levels are not significantly different from the 50N %NasDur baseline condition, this would suggest more categorical patterns of interpretation overall. 50N %NasDur was chosen as a baseline since the boundary between contrastive nasal vowels and coarticulatorily nasalized vowels in production is between 20% and 50% of the vowel duration (see airflow evidence in Desmeules-Trudel & Brunelle, 2018). Additional factors of interest (vowel quality, trial number), which do not directly pertain to processing of fine-grained phonetic detail, were also assessed by adding each factor to the baseline model (which only contained %NasDur and random intercepts), and performing chi-square tests between the baseline and more complex model using the compareML() function from the itsadug package. This procedure provided chi-square scores and p values, which enabled us to assess the contribution of each factor to model fit. The results of the second part of the analysis are presented in the online Supplementary Materials associated to the current paper.

For the analysis proportions of fixations to the CṼ image (when participants chose the CṼ image) and to the CVN image (when participants chose the CVN image), empirical logits (Barr, 2008) were used as the input variable to the model. We decided to analyze the fixations to the CṼ and CVN images separately since our stimuli were inherently ambiguous and a constant target could not be chosen across the entire experiment. Thus, we separated the data into two separate data frames—when participants chose the CṼ image and when participants chose the CVN image. This had the effect to provide a proper “target” for analysis, and therefore enabled us to carefully assess the gradient versus categorical patterns of fixations to the chosen image.

Similar to chosen images, the fixation analysis was twofold. First, to assess the gradient versus categorical perception of vowel nasalization timing variability, a “simple” model was fitted to the eye-tracking data, which included parametric %NasDur condition, time window of analysis (1,000 ms, starting 200 ms after vowel onset to account for delay in eye-movement planning; Fischer, 1992), an interaction between %NasDur and time window, and random intercepts for each individual trial. Also note that the time window of analysis was selected semi-arbitrarily based on the observation of raw data across experiments—participants seem to have made their choice on the word identity at around 1,200 ms based on the observation of higher proportions of fixations at that point in time. Random effects structure included random intercepts for each trial. The p values of parametric coefficients represent a significant difference of one level to the baseline, and the p values of smooth terms indicate if it is different from zero. In the body of the text, we report the model estimates as well as a graphical representation of the parametric estimates of this model.

Second, to assess the impact of additional factors on the participants eye movements, each additional factor of interest (vowel quality, trial number, and interactions through time) was tested against the baseline model one at a time using the compareML() function from the itsadug package, similarly to the response analysis. Note that for testing the interactions (e.g., time × trial), the single factor (e.g., trial) was also included in the model to offer a baseline for the interaction term. In the Supplementary Materials, we report chi-square scores, difference in chi-square scores between the baseline and tested models, as well as p values that indicate significance and Akaike information criterion difference values, an indication of quality of model fit.

Results

Chosen images

Figure 3 depicts the model predictions, outputting the probability (in log odds) of choosing the CṼ image in Experiment 1, with 50N %NasDur (134/380 CṼ image choices in this %NasDur condition, 35.3%) as a baseline against which the other levels of the factor were tested. The actual output of the GAMM model, as generated by the summary() function in the mgcv (Wood, 2017) package, is presented in Table 4 of the Appendix. In the figure, we observe that 20N %NasDur (61/371 CṼ image choices in this %NasDur condition, 16.4%) yielded significantly lower probability of choosing the CṼ image than 50N. On the other hand, both the 80N (152/368 CṼ image choices in this %NasDur condition, 41.3%) and 100N %NasDur (180/376 CṼ image choices in this %NasDur condition, 47.9%) values yielded significantly higher probability of choosing the CṼ image than in the 50N %NasDur condition. This is consistent with the gradient perception hypothesis, which predicts that listeners will display gradiently increasing response patterns as the phonetic properties of the segment get closer to the canonical realization of the phoneme. For instance, contrastive nasal vowels in Canadian French are expected to be nasalized for a longer period (Desmeules-Trudel & Brunelle, 2018); therefore, listeners are expected to give more CṼ responses if the vowel is actually nasalized for a longer period. This is also consistent with Beddor et al.’s (2013) results in English, which also support gradient perception of (coarticulatory) nasalization.

Eye movements

This subsection presents the eye-tracking results of Experiment 1, with the proportions of fixations to the nasal (CṼ) image when listeners chose the nasal image, and proportions of fixations to the coarticulated (CVN) image when they chose the coarticulated image between 200 (i.e., delay for programming an eye movement; Fischer, 1992) and 1,200 ms after vowel onset (between the dashed lines in Fig. 4a–b).

Observing the proportions of fixations in Fig. 4a–b, both fixations to the CṼ image for CṼ image choices and fixations to CVN for CVN image choices are higher than chance (dotted line at 25%; four images on the screen) shortly after the eye-movement programming delay. This suggests that listeners chose which image the auditory stimulus corresponded to shortly after the stimulus vowel onset. In Fig. 4a, although the error bars for the various conditions overlap across the whole window of analysis (and after thereof), the 20N %NasDur condition yields slightly lower proportions of fixations than the other conditions of Fig. 4a between 550 ms and 800 ms (see Fig. 4c), similarly to the 100N in Fig. 4b between 500 ms and 1,000 ms (see Fig. 4d). This suggests that when the vowel was nasalized for a short period of time (20N %NasDur), participants fixated slightly less to the nasal image during a short period of time even though they chose the CṼ image as their final choice, similar to the 100N %NasDur–CVN image combination. On the other hand, on Fig. 4a, we see that the 100N %NasDur value yielded to higher proportions of fixations to the CṼ image early on (200 ms to 650 ms) when participants chose the CṼ image. This is expected, since a vowel that is nasalized early corresponds to this image choice (except for when listeners interpreted the appendix as a realized nasal consonant).

Results of the GAMM analysis of fixations to the CṼ image (CṼ image choices) and to the CVN image (CVN image choices) are presented in Fig. 5a and b, respectively (parametric %NasDur factor), and Table 5 and Table 6, respectively, in the Appendix. Let us recall that like the image choice analysis, the first part of the analysis focused on the parametric %NasDur factor to assess gradient or categorical patterns of fixations, presented here in Fig. 5. The p values of parametric factors lower than .05 suggest a significant to the (50N %NasDur) baseline. The p values of smooth factors lower than .05 in the tables suggest that the smooth term is significantly different from zero, suggesting statistical difference between two or more time bins through time (e.g., if proportions of fixations are significantly lower at 200 ms than at 1,000 ms, the term will be significant). The time factor only includes the window of analysis, taking into account the eye-movement planning delay (0 ms represents the beginning of the analysis window, 200 ms after vowel onset), and the auto-correlation AR1 value was set to 0.745 based on the data. This value corresponds to the average correlation between one data point and the preceding one across the data on the time dimension.

Figure 5a shows that no level of the %NasDur factor significantly differed from the 50N baseline overall (i.e., without considering the time factor), except for a trend towards significance for the 50N–100N %NasDur (p = .0764), for the fixations to the nasal (CṼ) image when participants chose the nasal (CṼ) image. In this case, fixations to the nasal (CṼ) image are slightly higher in the 100N than in the 50N %NasDur. This result is expected, but does not support gradience in fixations to the nasal image for nasal image choices. However, this result does not convincingly support categorical use of variations in %NasDur for the recognition of nasal (CṼ) words either. Rather, it seems that when listeners chose the nasal (CṼ) image, they did not pay close attention to fine-grained variations in %NasDur.

On the other hand, in Fig. 5b, the difference between overall fixations to the coarticulated (CVN) image (CVN image choices) in the (baseline) 50N %NasDur is significantly different from overall fixations in the 100N %NasDur. Overall fixations in the 100N %NasDur are lower than in the 50N %NasDur, which is expected based on the realization of vowels in CVN words in production (Desmeules-Trudel & Brunelle, 2018). The other %NasDur conditions (20N and 80N) did not yield significant differences with the baseline 50N condition. In the case of fixations to CVN words for CVN image choices, listeners categorically interpreted the words, as expected based on the predictions of categorical perception.

Discussion

Our results indicate that variations in the duration of nasalization (%NasDur) have a significant influence on the recognition of phonological nasal (CṼ) and coarticulated nasalized (CVN) vowels by listeners of Canadian French. This is consistent with Beddor et al.’s (2013) finding that (English) listeners are sensitive to the timing of nasalization onset on a vowel. For example, they found that listeners are slower at recognizing words that contain a nasal consonant (e.g., scent) when vowels are nasalized late in their duration than when they are nasalized early. This suggests that they are highly sensitive to nasalization timing information on the vowel. Here, participants’ probability of choosing the nasal (CṼ) image significantly varied across %NasDur values, which was expected based on acoustic analyses of the production of these vowels (Desmeules-Trudel & Brunelle, 2018). Importantly, the significant differences between the (baseline) vowels that were nasalized for 50% of their durations and all the other levels of the %NasDur factor suggest that listeners gradiently used variations in nasalization duration to identify the spoken words—both the observation of the patterns (i.e., constant increase in probability of choosing the CṼ image as %NasDur values increase) and results of the statistical analysis support this gradience hypothesis. This pattern was found for vowels that were nasalized for a relatively long portion of their duration (i.e., 50% or more) and contained an excrescent (short) nasal coda, two cues that can be considered conflicting within the same word.

On the other hand, the analysis of proportions of fixations did not provide firm evidence in favor of gradient nor categorical patterns of perception. In the analysis of fixations to the CṼ image, we found a trend towards significance between the baseline 50N and 100N %NasDur. Fixations to the nasal CṼ image are higher in the 100N condition than in the 50N condition. However, when participants chose the CVN image, they fixated to the coarticulated image significantly less in the 100N condition than in the 50N condition. This suggests that, depending on the interpretation of the spoken word, the actual duration of nasalization yielded to different patterns of fixations. Taken individually, results of the individual statistical models support more categorical patterns of perception. However, more generally, the proportions of fixations to each image depend on fine-grained phonetic details. Specifically, when the duration of nasalization canonically corresponds to a nasal vowel (i.e., 100N %NasDur) and participants interpret it as such, they fixate to the target more than when the vowel is more ambiguous (i.e., partly nasalized). However, when they interpret the same fully nasalized (100N) vowel as a CVN token, they fixate to the target significantly less than when the vowel was only partly nasalized. Note that in the latter 100N–CVN pairing, the stimulus contains “mismatching” cues (i.e., a vowel that corresponds to a contrastive nasal vowel and a short nasal consonant in an isolated word).

In summary, image choice data provide clear support in favor of gradient interpretation of spoken words based on our %NasDur continuum for stimuli that contain conflicting phonetic cues (i.e., long nasalization on the vowel and excrescent nasal coda). On the other hand, proportions of fixations did not reveal a clear pattern. Taken individually, nasal (CṼ) image choices and coarticulated (CVN) image choices suggested more categorical patterns of interpretation. However, the 100N %NasDur condition behaved differently depending on the image choice, which in turn suggests variability in how phonetic detail is interpreted. Although still not a decisive gradient pattern, the data do not support a strict categorical perceptual pattern either. The key of the current finding, however, could be that listeners had difficulties interpreting the stimuli that were (sometimes) ambiguous (see below). Further discussion of the link between word ambiguity and gradience will be provided in the General Discussion.

It is also important to remind that (oral) vowels followed by a nasal consonant are only optionally nasalized in Canadian French. The stimuli that were analyzed here were all nasalized, therefore not reflecting the entire spectrum of possible phonetic realizations of these vowels. Participants in the experiment were also presented with vowels that were not nasalized at all (i.e., 0N %NasDur condition), but the stimuli were rejected from the analysis for two main reasons. Firstly, all these stimuli were expected to be categorized as coarticulated (CVN) words, which could create a ceiling effect and “artificially” increase the number of CVN responses. This would not be a problem per se, but would also not contribute to the analysis of nasal (CṼ) image choices and to the influence of vowel nasalization on coarticulatorily nasalized and contrastive nasal vowels. Secondly, the splicing procedure implied pasting the vowel from a CVC word (i.e., a word that did not contain a nasal consonant) into a CVN word matrix. However, due to limitations for word choice in the French lexicon, the final consonants of the CVC words did not all have the same place of articulation as the CVN words. Including these stimuli in the analysis could create some additional interference in the participants’ responses and eye movements. Exposure to these 0N stimuli could also yield to a “learning” problem, meaning that listeners’ responses could be influenced by the presence of misleading, and potentially unreliable, phonetic cues for responding to the analyzed stimuli over the course of the experiment, and extend to all heard stimuli. However, results concerning the unambiguous 20N %NasDur condition, that is, a short-nasalized vowel and a short nasal consonant/appendix are not conflicting cues within a word, already are at floor performance (i.e., 16.4% CṼ image choices in this 20N %NasDur condition). This suggests that listeners did interpret 20N stimuli as CVN words, and that the presence of other ambiguous stimuli did not impact their performance in this specific %NasDur condition. Furthermore, a subanalysis of the first four (out of 10) blocks of the experiment is presented in the supplementary materials, evaluating the image choice patterns early in the experiment, before learning could have occurred, and the effect of %NasDur. A discussion of the potential influence of stimulus habituation over the course of the experiment and its impact on recognition is also found in the Supplementary Materials. Based on the results in the 20N %NasDur and the early emergence of the %NasDur effect during the experiment, we can thus reject the idea that listeners learned not to pay attention to the variations in nasalization duration over the course of the experiment, and reiterate the support for the gradient perception hypothesis when stimuli contained conflicting phonetic cues.

Finally, as mentioned above, the stimuli that were used in Experiment 1 were somewhat ambiguous, as some of the stimuli (e.g., 80N and 100N) had contradicting cues: a long-nasalized vowel and a short nasal consonant, which are not a possible combination for isolated words in Canadian French if the consonant is interpreted as a full segment, even though the nasal appendix is pervasive in connected speech (Desmeules-Trudel, 2015; Desmeules-Trudel & Brunelle, 2018). However, in general, isolated words do not vary as much (Farnetani & Recasens, 2010), and it has never been shown, to our knowledge, that word-final nasal vowels in (Canadian) French have a nasal appendix. In order to investigate if the nasal appendix had a significant influence on perceptual patterns in addition to variations in nasalization duration, a second experiment was conducted with another group of L1 speakers of Canadian French. We used different stimuli to verify if the effects that were found in Experiment 1 also apply when stimuli do not have a word-final nasal appendix.

Experiment 2

In Experiment 1, the words were presented in isolation for the recognition task. However, it is expected that nasal (CṼ) and coarticulated (CVN) vowels are realized differently and more constantly in isolation than when they are produced in connected speech (Farnetani & Recasens, 2010). For instance, in Experiment 1, it is likely that listeners expected phonological nasal (Ṽ) vowels not to be followed by a nasal appendix in more careful speech. Stimuli from Experiment 1 were thus modified in Experiment 2 by removing the final nasal consonantal appendix. This allowed us to test whether “unambiguous” words (i.e., that do not have a nasal appendix) are processed differently than “ambiguous” ones (i.e., that contain both a nasalized vowel and a final nasal consonant) and allowed us to have a more thorough idea of how phonological nasal (Ṽ) and coarticulated nasalized (VN) vowels are processed and eventually recognized by listeners of Canadian French, and how phonetic information is interpreted by the spoken-word recognition system.