Assessing segmentation processes by click detection: online measure of statistical learning, or simple interference?

Franco, Ana; Gaillard, Vinciane; Cleeremans, Axel; Destrebecqz, Arnaud

doi:10.3758/s13428-014-0548-x

Assessing segmentation processes by click detection: online measure of statistical learning, or simple interference?

Published: 17 December 2014

Volume 47, pages 1393–1403, (2015)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Assessing segmentation processes by click detection: online measure of statistical learning, or simple interference?

Download PDF

Ana Franco^1,2,
Vinciane Gaillard^1,2,
Axel Cleeremans^1,2 &
…
Arnaud Destrebecqz^1,2

1631 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

Statistical learning can be used to extract the words from continuous speech. Gómez, Bion, and Mehler (Language and Cognitive Processes, 26, 212–223, 2011) proposed an online measure of statistical learning: They superimposed auditory clicks on a continuous artificial speech stream made up of a random succession of trisyllabic nonwords. Participants were instructed to detect these clicks, which could be located either within or between words. The results showed that, over the length of exposure, reaction times (RTs) increased more for within-word than for between-word clicks. This result has been accounted for by means of statistical learning of the between-word boundaries. However, even though statistical learning occurs without an intention to learn, it nevertheless requires attentional resources. Therefore, this process could be affected by a concurrent task such as click detection. In the present study, we evaluated the extent to which the click detection task indeed reflects successful statistical learning. Our results suggest that the emergence of RT differences between within- and between-word click detection is neither systematic nor related to the successful segmentation of the artificial language. Therefore, instead of being an online measure of learning, the click detection task seems to interfere with the extraction of statistical regularities.

Conducting spoken word recognition research online: Validation and a new timing method

Article 19 May 2015

Generalized Processing Tree Models: Jointly Modeling Discrete and Continuous Variables

Article 24 May 2018

Chronset: An automated tool for detecting speech onset

Article Open access 06 December 2016

Over the past 15 years, statistical-learning research has shown that adults, children, and infants are able to track statistical patterns in their environment (Aslin, Saffran, & Newport, 1998; Fiser & Aslin, 2002; Saffran, Aslin, & Newport, 1996). This ability seems to involve incidental (Saffran, Newport, Aslin, Tunick, & Barrueco, 1997), automatic (Turk-Browne, Jungé, & Scholl, 2005), and domain-general (Kirkham, Slemmer, & Johnson, 2002) learning mechanisms. Importantly, statistical learning may play an important role in the early stages of language acquisition, such as speech segmentation (see Romberg & Saffran, 2010, for a review). Indeed, within a language sample, pairs of syllables or sounds are in general more correlated with each other within a word than when they occur at the boundary between two consecutive words. Accordingly, a series of studies have confirmed that the transitional probabilities (TPs) between sounds can be tracked in order to discover word boundaries (e.g., Aslin et al., 1998; Saffran, Aslin, & Newport, 1996; Saffran, Newport, & Aslin, 1996; Thiessen & Saffran, 2003).

Traditionally, statistical learning of artificial languages is measured through offline measures. After exposure to a continuous speech stream, participants perform a two-alternative forced choice (2AFC) task or a recognition task in which they have to distinguish between the words and novel arrangements of the same syllables (Abla, Katahira, & Okanoya, 2008; Saffran et al., 1997). Other techniques have been developed in order to measure statistical learning in infants. Most studies have used the head turn preference procedure (Kemler Nelson et al., 1995; see Gerken & Aslin, 2005, for a review). In this paradigm, infants are seated between two speakers with mounted lights and are free to turn their heads. After exposure, in a test phase, either the right or the left light flashes as an auditory test stimulus is emitted from the corresponding speaker. The time during which the infant turns her head toward the emitting speaker is used as a preference measure for the corresponding stimulus. Noise detection (Morgan, 1994; see also Morgan & Saffran, 1995) is another technique, which consists in training 9-month-old infants to turn their heads in response to short buzzes. In a first phase, these buzzes are presented during the time intervals between multisyllabic strings. In a second phase, buzzes are presented within these strings, between any two syllable pairs. The differences in response latencies between the two phases have been interpreted as an indication of infants’ ability to perceptually organize the stimulus strings.

Although these tasks inform us about the nature and amount of the acquired knowledge, they say little about the temporal dynamics of statistical learning. To address this issue, Gómez, Bion, and Mehler (2011) used a click detection task to study statistical learning online. This procedure was inspired by an experimental paradigm developed in the ’60s to study online syntactic processing (Bever, Lackner, & Stolz, 1969; Fodor & Bever, 1965; Fodor, Bever, & Garrett, 1974). Fodor and Bever’s original click location task consisted in the presentation of auditory sentences in which clicks were inserted at specific spots. Participants’ instructions were to locate the clicks by pressing a key as fast as possible. Back then, the results showed that participants made more errors when the clicks occurred at the boundaries between clauses. When the click co-occurred with the first word of a highly redundant two-word sequence—that is, a sequence with a high TP—participants subjectively perceived that the click occurred later, in the middle of the sequence. When the click co-occurred with the second word of a low-TP two-word sequence, participants subjectively perceived that the click occurred after the sequence. The authors concluded that the TPs between words modulate click localization.

On the basis of Fodor and collaborators’ findings, Gómez et al. (2011) proposed that the TPs between syllables in continuous speech could likewise modulate the detection of clicks located between or within the words. They presented adult participants with a continuous speech stream consisting of the randomized repetition of four trisyllabic nonsense words. The speech stream was produced by a speech synthesizer, so that no other segmentation cues but the TPs were present. Clicks were superimposed on the stream, and could occur at two different positions: either at the boundaries of two words (between words) or between the first and second syllables of a word (within words). Participants were instructed to listen to the speech stream and to press a key as fast as possible each time they detected a click. After 2 min of exposure, participants were faster in detecting the clicks located between than within words. The authors proposed that the evolution of reaction time (RT) differences between the two types of click locations reflects the extraction of TPs and the emergence of word candidates. In their view, participants progressively built different expectations about future events when processing the stream of syllables. As a matter of fact, the stream was built in such a way that the TP between two syllables was 33 % across word boundaries and 100 % within words. The rationale was that during learning, participants form representations of word candidates and develop stronger expectations about the next syllable when it is part of a within- rather than a between-word transition. These expectations would in turn modulate participants’ tolerances for disruption in the syllables stream, making them less likely to integrate extraneous elements within a word, so that clicks would tend to be perceived at word boundaries rather than within words.

Gómez et al. (2011) concluded that the click detection task could provide an online measure of word segmentation based on statistical learning. However, just like offline measures of statistical learning such as 2AFC tasks, their results could also be explained by an emerging sensitivity to the regularities that was not necessarily accompanied by the extraction and memorization of the words. In other words (no pun intended), although click detection makes it possible to explore how we learn, it does not provide a measure of what, or how much, we learn. Gómez and colleagues themselves acknowledged that the click detection method and a classical offline measure do not necessarily correlate. Actually, since their study did not include such an offline, direct test of word knowledge, the question remains open as to whether participants correctly segmented the words. It might be the case that participants only focused on detecting the clicks, and did not attend (or only partially attended) to the speech stream. In that case, participants would have promoted one of the two concurrent tasks, instead of equally sharing their attentional resources between the tasks. As a consequence, in dual-task situations such as the click detection task, performance should be monitored in both tasks, in order to ensure a full understanding of the mechanisms that subtend performance. A similar critique has been raised to Saffran and colleagues’ (1997) study, reporting successful statistical learning when participants were exposed to a speech stream during a simple concurrent task (free drawing). Toro, Sinnett, and Soto-Faraco (2005) argued that there was “no actual guarantee that participants did not occasionally direct their attention to the irrelevant speech stream while they were performing the free drawing task” (p. B26). Indeed, there is evidence that dividing attention during statistical learning leads to poor performance in offline tests (Turk-Browne et al., 2005). There is even an additional cost when the stream of information in the statistical-learning task and the concurrent task share the same modality (Toro et al., 2005).

In summary, it is a safe bet that online click detection will have a negative impact on performance at the offline task, because the click detection can be considered a secondary task recruiting the same sensory modality as the primary learning material. The aim of our study was to investigate the impact of a click detection task on statistical learning. In Experiment 1, we replicated the method used by Gómez et al. (2011) and added an offline measure in order to evaluate the link between these two measures. In Experiment 2, we measured the impact of the click detection task by isolating two factors: the mere presence of clicks in the continuous speech stream and the need for participants to process those clicks or not. Finally, in Experiment 3, we examined the impact of the clicks’ location on statistical learning. Overall, our results showed that the clicks act as extraneous auditory elements interfering with word segmentation and should, therefore, be used with caution.

Experiment 1

Method

Participants

Twenty-eight French-speaking undergraduate psychology students (18 women, 10 men) were included in this study and received course credits for their participation. None of the participants had previous experience with the artificial languages presented in this experiment. All reported no hearing problems.

Material

Two artificial speech streams were generated using the MBROLA speech synthesiser (Dutoit, 1997) with the French male diphone database fr1 with a sampling frequency of 16 kHz. The streams consisted of the continuous presentation of four nonsense trisyllabic words (bamoti, bikochu, lumake, telicha) without pauses. Each syllable lasted 200 ms. Each word was presented 90 to 100 times. Each speech stream lasted for 4 min 10 s, with the words being presented in a pseudorandom order: The same word never occurred twice in succession. Thus, both speech streams were identical and only differed in word order presentation. A set of clicks was inserted into the speech stream with the software Praat (Boersma, 2001). We used the same procedure that was presented in the Gómez et al. (2011) study: The clicks corresponded to five consecutive samples of the audio waveform clipped together. Each click could occur either between two words or within words, between the first and second syllables (within1_2). Figure 1 illustrates the placement of the clicks in the speech stream. Each minute, eight between- and eight within-word clicks were inserted, resulting in a total of 64 clicks, with an average interval of 3.8 s between any two consecutive clicks. During the 4-min exposure, the 32 within-word clicks were equally distributed between the four words (eight clicks per word), and the 32 between-word clicks were evenly distributed, in order to control the probability of occurrence of a click across the different between-word transitions. The probability of one syllable being preceded by a click was 8.4 % for both the first and the second syllable of each word (between- and within-word clicks, respectively), and 0 % for the third syllable of each word. The click positions were similarly distributed in the two different versions of the speech stream. Participants were randomly assigned to one of the two streams (14 per condition).

Procedure

Participants were tested individually in a quiet room. They were instructed to pay attention to the speech stream spoken in an “unknown language,” to try to extract the words from the speech, and to press a key as fast as possible each time they heard a click. The speech stream was presented binaurally through headphones. Immediately after the 4-min exposure, participants performed a 2AFC task in which they were presented with two trisyllabic sequences on each trial. One sequence was a word of the artificial language, and the other sequence was a nonword, made up of the same syllables but with null TPs between them (i.e., any two successive syllables had never been presented in succession in the exposure phase). Participants were instructed to decide which sequence of each pair sounded more like the unknown language that they had just heard. Four nonwords were used (baluti, chubima, liteko, and mokecha). Each word was paired with each nonword twice—either as the first or as the second element of the pair—resulting in a total of 32 trials. The experiment was run on a Mac Mini 1.33-GHz PowerPC G4 using Psyscope X B53 (Cohen, MacWhinney, Flatt, & Provost, 1993) and a Psyscope USB button box to record the RTs of each click response.

Results and discussion

Time course of RTs for both types of clicks

As in the original study (Gómez et al., 2011), nonresponded clicks and RTs longer than 1,000 ms or shorter than 100 ms were excluded from all of the analyses. These criteria resulted in excluding 3.8 % of the trials from the analysis (on average, 2.5 clicks out of 64). Among these, 0.7 % were missed clicks, 0.1 % were RTs shorter than 100 ms, and 2.9 % were RTs longer than 1,000 ms. The overall mean RT was 318 ms (SD = 48). We computed the mean RTs by minute (one mean RT by minute) and by location (within or between words) for each participant, resulting in a total of eight mean RTs for each participant. A repeated measures analysis of variance (ANOVA) with Minute (four levels) and Location (two levels) as within-subjects factors showed a significant effect of minute, F(3, 81) = 6.466, p = .001, η _p ² = .193. However, we failed to find a significant effect of location, F(1, 27) = 2.009, p > .1. The Minute × Location interaction was also nonsignificant, F < 1 (Fig. 2a).

In Gómez et al. (2011), 24 out of the 28 participants showed a positive difference between within- and between-word RTs as a function of time. Four of them showed the reverse interaction: They responded faster to clicks emitted within than between words. One possible explanation for our results could be that in our study, a larger number of participants showed the reverse interaction, thus resulting in null differences when considering all participants together. To ascertain whether this was the case, we computed, for each participant, the mean RT difference between within- and between-word clicks at each minute. Next, we computed the number of participants for whom the mean difference was positive (i.e., RTs for within-word clicks exceeded those for between-word clicks). Following the analysis conducted by Gómez et al., we considered chance level to be 50 %—that is, 14 out of 28 participants. As is shown in Fig. 2b, at Minutes 1 and 2, 12 out of 28 participants showed a positive difference between mean RTs. At Minute 3, however, only six participants showed a positive difference. Finally, at Minute 4, ten showed a positive difference, and 18 participants showed a negative difference. Thus, in contrast with Gómez et al.’s results, participants’ tendency to respond faster to between-word clicks when the word candidates emerged seems to have been less marked in our study.

To sum up, as in the original study, we observed a gradual increase of the mean RTs during exposure. With respect to the time course of RTs for the different click locations, however, our results differed from those of Gómez et al. (2011). Although Gómez et al. reported an increase of RTs after the second minute (mostly for clicks located within words), our results suggest that this click-dependent RT difference may not be systematically observed. Indeed, in their study, Gómez and collaborators found that 24 out of 28 participants showed this tendency, whereas four participants showed the reverse effect. In our study, we found a larger number of participants showing the opposite trend (18 out of 28).

2AFC task

The overall forced choice task performance was 56.81 %, which was above chance level (50 %), t(27) = 2.123, p < .05, bilateral. Although on average the participants performed above chance, the mean performance was quite low, and the performance of half of the participants (N = 14) was at chance. The raw distribution of participants’ scores is presented in Fig. 2c.

Are the RT time course and the mean performance at the 2AFC task related?

The rationale of Gomez et al.’s (2011) study was that the extraction of the TPs present in the stream (high or low within or between words) would produce stronger expectations of the next syllable within words, making participants less likely to expect extraneous elements such as a click. This would be reflected by slower responses to clicks located within than between words. If this was the case, one should expect a positive between-word minus within-word RT difference at the fourth minute in the click detection task. Participants should also have been successful in the 2AFC task. By contrast, a null—or negative—difference should be associated with a low performance level in the offline test. We performed a regression analysis to test the hypothesis that there was a relationship between the sign of the RT difference and accuracy in the 2AFC task. As is shown in Fig. 2d, the RT difference at the fourth minute was not a significant predictor of accuracy in the 2AFC task, β = –.242, p > .1. We did not find a reliable association between RT and forced choice performance in this study.

Finally, we asked whether a specific pattern of RT evolution was associated with successful performance in the 2AFC task. In other words, did those participants who successfully completed the 2AFC task show a specific RT evolution in the click detection task? To address this question, we divided participants into two groups based on their performance in the 2AFC task. A binomial test showed that scores exceeding 22 word or nonword identifications would constitute a statistically significant deviation from 50 % at a .05 alpha level. According to that criterion, 23 participants performed at chance (mean = 50.55 %), and only five of them performed above the chance level (mean = 85.62 %). We analyzed the RT patterns in the latter group. A Wilcoxon matched-pairs signed rank test showed that the RTs for between- and within-word clicks during the fourth minute (369 and 328, respectively) did not statistically differ from each other (Z = –1.753, p = .080). This result showed that even those participants who performed above chance level in the 2AFC task did not exhibit a positive RT difference, possibly reflecting word extraction from the speech stream.

Taken together, these results raise two additional questions: (1) Does the click detection task negatively influence correct word extraction, and, as a consequence, does it result in poor performance in the 2AFC task? and (2) Does the RT evolution in the click detection task only predict the success or failure of speech segmentation, or is it also “contaminated” by the click detection task? In the latter case, even if the RTs pattern reflects statistical learning, its interpretation becomes challenging. In Experiment 2, we examined the impact of the click detection task on the extraction of the words embedded in the speech stream by means of two distinct conditions: one in which participants were exposed to the exact same speech stream superimposed with clicks, as in Experiment 1, but were not instructed to respond to the clicks, and another in which they were exposed to the same speech stream without the clicks.