Production study
Acoustic (MK 224 microphone and MV 181A preamplifier, Cirrus Research, Hunmanby, UK) and electroglottography (EGG) (EG2-PCX, Glottal Enterprises, Syracuse, NY, USA) signals of EP and Italian C1V1C2V2 clusters in the carrier sentence (EP, Diga C
1
V
1
C
2
V
2
outra vez/‘Say C1V1C2V2 again’; Italian, Dite C
1
V
1
C
2
V
2
ogni dì/‘Say C1V1C2V2 once more’) were recorded (PMD 671 Solid State Recorder, Marantz, Eindhoven, The Netherlands; 16 Bits, 48 kHz) in a soundproof environment. As can be seen from the carrier sentences, both preceding and following vowels of the target C1V1C2V2 items were very similar for the two languages, thus excluding influences from differing vowel or consonant context. Both the velar stops /k ɡ/ and the four vowel contexts /i e o a/ were pairwise identical for the target item (e.g. ‘kaka’ or ‘gigi’), sentence stress was laid on the C1V1C2V2 pseudoword, and lexical stress was set to its first syllable. Thus, we obtained two different consonant positions (intervocalic initial and medial position). For both languages, each item was repeated nine times in randomised order by six native speakers with university education. The EP speakers were from Central Portugal (mean age 25 years, recorded at the Speech, Language and Hearing Laboratory (SLHlab), University of Aveiro, Portugal), and the Italian native speakers were from the Veneto region in Northern Italy (mean age 40 years, recorded at the Instituto di Scienze e Tecnologie della Cognizione (ISTC), Padova, Italy). Speech rate was held constant during all recordings; correct realisations of all items were supervised by a trained phonetician (both languages). One trained phonetician manually labelled the following landmarks:
-
Onset and offset of the neutral lead-in vowel (preceding the target word)
-
Onset of first stop (V1) burst, if present
-
Onset and offset of the first target vowel (V1)
-
Onset of the second stop (C2) burst, if present
-
Onset and offset of the second target vowel (V2)
Similar to [12], we computed the time-dependent voicing status for all stops, sampled at 10 equidistant points throughout the complete stop duration. The first point was set to the beginning of the stop closure (i.e. the preceding vowel offset), whereas the 10th point was set to the following vowel onset. To determine the voicing status for each point a PRAAT 5.2 [17] automatic voicing detection algorithm (AC pitch extraction algorithm with the settings voiceless decision = 0.55 and silence threshold = 0.1) was used and then manually checked against the synchronised EGG signals.
Perception study
Thirty two native EP participants from Central Portugal and 10 native Italian participants from the Veneto region listened to stimuli in a soundproof room (EP, SLHlab Aveiro; Italian, ISTC Padova). They did not receive course credit or financial compensation for their participation. None of the listeners reported speech or hearing problems. Open headphones with a linear frequency response (Sennheiser HD 600, Wedemark Wennebostel, Germany) connected to the internal headphone output of a notebook computer (no other processes running and all networking interfaces disabled) were used. Listeners' responses were collected by means of mouse clicks placed on two screen buttons. The loudness of the stimuli presentation was held constant across all listeners to a comfortable level. The sampling frequency of the presented stimuli was 48,000 Hz at 16 bits.
The speech material generated for the perceptual experiments (extensively described in [18]) consisted of biomechanically modelled VCV stimuli [19] acoustically synthesised frame by frame with a parametric model of the vocal tract [20] driven by a three-mass vocal fold model [21]. The reason for using biomechanical modelling in contrast to, for example, a Klatt synthesiser, lies in the ability of the biomechanical models to generate physically realistic trajectories between consecutive phonemes. In other words, articulatory trajectories are not linearly interpolated, as is normally the case with other synthesis approaches. Research on trajectories has shown that the characteristics of curved paths are explained by anatomical factors and muscle mechanics, for arm movements [22–24] as well as for speech movements [19, 25, 26]. Biomechanical modelling, in contrast to other synthesis approaches, has the advantage that all obtained tongue movements, trajectories and phoneme targets are comparable to natural speech. This allows the manipulation of glottal source parameters while maintaining articulatory realism. Thus, the use of biomechanical modelling is the best compromise to guarantee highly realistic perceptual stimuli, without the risk of missing hidden perceptual cues (which cannot be controlled for) when using manipulated natural speech. In other words, the use of the biomechanical model has the main advantage that all obtained tongue movements, trajectories and phoneme targets are comparable to natural speech (see for example the modelling of articulatory loops), with the additional possibility to manipulate relevant temporal and glottal source parameters while at the same time maintaining articulatory realism. Thus, the use of biomechanical modelling is the best compromise to guarantee highly realistic perceptual stimuli and to independently control parameters such as duration, transition and targets. Figure 1 shows the comparison of synthesised waveforms and spectrograms of the generated /aɡa/ stimulus (top) with the /aɡa/ item as produced by an EP speaker (bottom).
Three different factors known to influence the perception of stop voicing were examined: stop duration, contextual vowel duration and voicing maintenance during stop closure. Each factor was laid out in a continuum with several levels and was combined with all levels of the other factors (i.e. fully crossed and non-adaptive design). The extremes of the continua (perception experiments) were determined according to the values of the speech production study described before:
-
1.
Stop duration: mean durations (rounded to the closest decimal) of the voiceless and phonologically voiced velar stops /k ɡ/ in the vowel contexts /a o/ were taken as the limits of the stop duration continuum, i.e. 100 ms (mean of the voiced stop) and 150 ms (mean of the voiceless stop). One intermediate value (125 ms) was introduced.
-
2.
Vowel duration: mean durations (rounded to the closest decimal) of the preceding vowels /a o/ of the voiceless/voiced velar stops /k ɡ/ were taken as the limits of the vowel duration continuum, i.e. 70 ms (mean of the preceding vowel of the voiceless velar stop) and 130 ms (mean of the preceding vowel of the voiced velar stop). One intermediate value (100 ms) was introduced.
-
3.
Voicing maintenance: the voicing maintenance continuum was defined by the two endpoints fully voiced and fully devoiced/voiceless. For the intermediate values, five conditions were defined (12.5%, 25%, 37.5%, 50% and 75%) at which the stop voicing ceases and thus the stop devoicing begins (and remains until its offset). The unequal step sizes result from the hypothesis that the perceptual differences would be smaller towards higher voicing percentages of the stimuli, so smaller step sizes for higher voicing percentages were excluded to obtain a reduced total number of stimuli. The fully devoiced condition denotes different underlying control mechanisms than the voiceless condition [27, 28], although the result, i.e. the voicing maintenance, is identical in both conditions.
A three-factor design with 3 × 3 × 7 levels of the corresponding continuum was thus used in the perception experiment (see Table 1). In this experiment, all possible combinations of contextual vowel duration (70, 100 and 130 ms) with all combinations of stop duration (100, 125 and 150 ms) and all voicing maintenance steps (0%, 12.5%, 25%, 37.5%, 50%, 75% and 100%) were tested. The experiment was performed in two different vowel conditions (/a/ and /o/). Five repetitions of the complete stimuli set for each of the listeners were played in randomised order. In summary, a total of 630 stimuli (three vowel durations × three closure durations × seven voicing maintenance conditions × two vowel identities × five repetitions) were generated. The average time to perform the task was 20 min. There was a practice session of 25 stimuli prior to the beginning of the main experiment. Listeners were informed that they would hear synthetic VCV items, and their task was to identify whether the consonant was /ɡ/ or /k/ (forced choice). Speed of response was emphasised, asking listeners to respond as quickly and accurately as possible. Stimulus repetition was not possible. Alvin v 1.27 [29] open source software for stimulus and visual presentation was used. The computer screen for the identification task showed two buttons (labelled g and k) at identical distances around a next button at the screen centre. After selecting their response, listeners had to click on the next button to proceed, thus placing the cursor at the exact centre of the screen before the next stimulus presentation (guaranteeing identical distances for the two answer possibilities). All button options and accompanying text were written in Portuguese for EP listeners and in Italian for Italian listeners in order not to confuse listeners' internal language presentation. The placement of all buttons was rotated 180° for one half of the participants, thus counterbalancing biases of horizontal movement and listener preference.
Table 1
Three fully crossed factors for the perceptual study: vowel duration, stop duration and percentage of stop voicing maintenance
Statistical analysis
Production study
For the statistic validation of the (production) voicing patterns, we chose to examine the three central landmarks (i.e. point 5, point 6 and point 7) as the dependent variables for the statistical analyses. The obstruent onset (point 1) and offset (point 10) are voiced by definition (i.e. vowel formants offset and onset), but the more central points can be regarded as a valid representative to examine significant voicing differences during stop closure. To statistically analyse the devoicing behaviour at these landmarks, a series of logit models (function lmer [30]) with mixed effects were run in the R environment [31]. The logit models are based on binomial distributions (z-scores, generalized linear mixed model, GLMM). This allows modelling based on binary decisions [32, 33], since a binary voicing decision (either voiced or voiceless/devoiced) is obtained for each of the 10 consecutive landmarks (point 1 to point 10). The devoicing occurrences of the three central points of the phonologically voiced velar stop /ɡ/ (dependent variables) were analysed with a p < 0.05 significance threshold for effects of the factors language (EP, Italian), consonant position (initial, medial) and vowel context /i e o a/, and their interactions. All numerical fixed factors were centred (z-transformation). Speaker was chosen as random factor.
Perception study
A series of logit models with mixed effects was run to statistically analyse the listeners’ response patterns. Again, logit models based on binomial distributions (z-scores, Generalized Linear Mixed Model) were used to model binary decisions [33] by our listener responses (i.e., the listeners' /ɡ/ or /k/ response). The dependent variable was the listener response; fixed factors were language (EP, Italian), stop duration (100, 125 and 150 ms), contextual vowel duration (100, 125 and 150 ms) and voicing maintenance percentage during stop closure (0%, 12.5%, 25%, 37.5%, 50%, 75%, 100%) and their interactions. All numerical fixed factors were centred (z-transformation). Listener was chosen as random factor.