Splitting the variance of statistical learning performance: A parametric investigation of exposure duration and transitional probabilities
What determines individuals’ efficacy in detecting regularities in visual statistical learning? Our theoretical starting point assumes that the variance in performance of statistical learning (SL) can be split into the variance related to efficiency in encoding representations within a modality and the variance related to the relative computational efficiency of detecting the distributional properties of the encoded representations. Using a novel methodology, we dissociated encoding from higher-order learning factors, by independently manipulating exposure duration and transitional probabilities in a stream of visual shapes. Our results show that the encoding of shapes and the retrieving of their transitional probabilities are not independent and additive processes, but interact to jointly determine SL performance. The theoretical implications of these findings for a mechanistic explanation of SL are discussed.
KeywordsVisual statistical learning Sequence learning Individual differences
Statistical learning (SL), or learning of the distributional properties of sensory input across time and space, is the mechanism by which cognitive systems discover the underlying regularities in the environment. As such, SL plays a key role in the segmentation, discrimination, and categorization of input, shaping the basic representations for a wide range of sensory, motor, and cognitive abilities (see Frost, Armstrong, Siegelman, & Christiansen, 2015, for a discussion). The term “SL” thus refers to the ability to learn and assimilate an array of possible statistical properties of sensory events. These include their aggregated relative frequency, their variance, and mostly, the extent of their co-occurrence (see Thiessen, Kronstein, & Hufnagle, 2013, for a review). The present article is concerned with the latter form of computation.
Starting from Saffran’s original work (Saffran, Aslin, & Newport, 1996), which revealed that infants are able to segment speech on the basis transitional probabilities, a large number of studies have demonstrated that people often display remarkable sensitivity to the co-occurrence of items embedded in a continuous stream. This has been shown across ages from adults to newborns (e.g., Bulf, Johnson, & Valenza, 2011), across sensory modalities (Visual: e.g., Fiser & Aslin, 2001; Kirkham, Slemmer, & Johnson, 2002; Turk-Browne, Jungé, & Scholl, 2005; Auditory: e.g., Gebhart, Newport, & Aslin, 2009; Saffran, Newport, Aslin, Tunick, & Barrueco, 1997; Tactile: e.g., Conway & Christiansen, 2005), with both adjacent (e.g., Endress & Mehler, 2009) and nonadjacent contingencies (e.g., Gomez, 2002; Newport & Aslin, 2004). Interestingly, although in all of these studies the tested sample as a group showed clear evidence of learning, not all individuals were shown to perform better than chance (see Siegelman & Frost, 2015, for a discussion). What determines the efficacy of detecting the co-occurrence of events in a stream? Why do some individuals show clear evidence of learning in typical SL tasks, whereas others seem to perform at chance? What are the cognitive operations underlying this capacity? These complex questions hold the promise of revealing critical insights regarding the mechanisms driving SL, leading to deeper comprehension of what SL abilities could predict and why.
In a recent theoretical discussion of the factors influencing the variance of SL performance, Frost and his colleagues (2015) suggested that this variance should be split into two main sources: (1) variance related to the efficiency in encoding the individual elements in the stream within the modality of their presentation—that is, the ability to create internal representations of each element of the continuous perceptual input—and (2) variance related to the relative computational efficiency of detecting the distributional properties of the encoded representations, registering their transitional probabilities. Whereas the efficacy of creating detailed and reliable internal representations of individual elements appearing in a fast sequential input could be traced to the neuronal mechanisms that determine the effective resolution of one’s sensory system, the computational efficiency of detecting the transitional probabilities of these elements could be traced to capacities for binding temporal and spatial contingencies by the medial-temporal lobe memory system (Karuza et al., 2013; Kim, Lewis-Peacock, Norman, & Turk-Browne, 2014; Schapiro, Gregory, Landau, McCloskey, & Turk-Browne, 2014). This view suggests that both encoding and binding abilities constrain the learning of regularities, and they jointly determine the actual performance of an individual in a given SL task. Moreover, it presupposes some form of temporal processing modularity, in which the internal representations computed from the inputs are subject to higher-level computations that bind them to register their distributional properties. Here we explored for the first time the possible predictions of this theoretical framework. We orthogonally manipulated factors related to encoding and binding constraints, and measured their relative contributions to SL performance on the group and individual levels.
In the study we report here, we focused on performance in the extensively used, visual statistical-learning (VSL) task (Arciuli & Simpson, 2011, 2012; Frost, Siegelman, Narkiss, & Afek, 2013; Siegelman & Frost, 2015; Turk-Browne et al., 2005). In the VSL task, participants are presented with a stream of complex visual shapes, organized in pairs or triplets whose constituent shapes follow each other in a predictable sequence (typically, transitional probability = 1). Following a familiarization phase, participants are tested to assess their ability to report which shapes appeared in the stream in the original order. The VSL task allows a unique opportunity to experimentally address our theoretical question by disentangling the (1) encoding and (2) learning components of statistical dependencies in SL. In any continuous stream of shapes, experimenters can independently manipulate (1) shape exposure duration (ED)—that is, the amount of time that the stimulus is physically available for processing—and (2) the transitional probabilities (TPs) within the shapes. Whereas ED is a parameter affecting the efficacy of processing the visual stimuli for encoding them into internal representations (e.g., Loftus & Kallman, 1979; Potter & Levy, 1969), TP is a parameter related to the efficiency of registering their distributional properties. Jointly manipulating these parameters within subjects could thus provide important information regarding individual susceptibility to encoding constraints versus individual sensitivity to correlational transparency (see Frost et al., 2015, for discussion).
In the present study, we did exactly this. Our participants, university students, participated in a series of VSL tasks, in all of which they watched evenly paced streams of complex visual shapes. However, rather than using a fixed ED or a fixed TP, as in most current SL studies (but see Hunt & Aslin, 2001), in each session we manipulated the EDs and TPs of shapes in the stream in a within-subjects factorial design. The ED was set to 200, 600, or 1,000 ms per shape, and the TP between shapes could be quasi-regular (.6, .8) or fully regular (1.0). Following the familiarization phase, participants were tested to assess how well they had learned the statistical contingencies of the shapes in each of the streams, given the different presentation constraints. Thus, by looking at the change in performance across the tasks, we could examine the independent influences of the EDs and TPs within the stream on SL performance, as well as their possible interaction.
This design allowed us to address, in parallel, critical theoretical questions that have not been addressed so far: What are the impacts of incremental ED and TP changes on SL? Do they impact SL independently of (and additively to) each other, as would be predicted by a temporal-processing modularity assumption, or do they show substantial interaction? If they do interact, what is the nature of this interaction? Finally, what does the distribution of individual sensitivities to both factors look like in the population? More generally, we asked, how is human performance in extracting regularities from the input affected by different constraints (linearly? logarithmically? inverted-U shaped?) when constraints are imposed on the time allocated for encoding events and when the extent of event predictability is varied.
Fifty adults (38 females, 12 males), all students at the Hebrew University, participated in the study for course credit or payment. Their ages ranged from 21 to 27 years (mean = 23.4). The participants were all native Hebrew speakers.
Design and materials
The experiment required each participant to perform nine VSL tasks. The VSL tasks included 22 complex visual shapes (Turk-Browne et al., 2005). In each condition and for each participant, 16 of the 22 shapes were randomly chosen and randomly organized to create eight ordered pairs (the remaining six shapes were used for the screening items; see below). The eight pairs were presented continuously, one after the other, in a random order, to create a familiarization stream in which each pair appeared 24 times, with the constraint that the same pair could not be repeated twice in a row.
Depending on the ED used, keeping the number of repetitions and the shape interval constant, the familiarization phase of each of the tasks lasted from 2 to 7 min. Participants were asked to attend to the stream and were not told that the stream was constructed of pairs. Following the familiarization stream, participants were instructed that they would now see two pairs of shapes on the screen (see Fig. 1) and that their task would be to report which pair was more familiar to them. They were then tested with 38 two-alternative forced choice trials. Thirty-two of the test trials contrasted (1) “true pairs”—two shapes that appeared as a pair during the familiarization phase (the TP between the shapes being .6, .8 or 1.0, depending on the condition)—and (2) “foils”—two shapes that did not appear as a pair during familiarization. Foils were constructed without violating the position of the shapes within the original pairs (e.g., for two true pairs AB and CD, the possible foils could be AD or CB, but not AC or DB). Scores in the SL task then ranged from 0 to 32, calculated as the number of correct identifications of pairs during the test phase. The remaining six test trials aimed to identify and screen participants who did not attend the familiarization stream. These trials contrasted “true pairs” with a pair containing a novel shape that had not appeared at all during familiarization (see Romberg & Saffran, 2013, for a similar procedure). Participants who missed 18 or more of the 54 screening items (six in each of the nine tasks) were excluded from the analyses. Following this screening, eight participants were excluded from the analysis. Due to a technical problem, the data of two participants in the ED = 200, TP = 1 condition were not saved. All subsequent analyses are based on the remaining 42 participants.
The nine SL subtests were initiated by the participants from home, through an online platform. All nine subtests had to be completed in a period of 30 days, with no less than 24 h between sessions. The mean time interval between sessions was 2.3 days (SD = 1.2). Participants were instructed to do the task alone in a quiet room and to avoid external distractions (i.e., to turn off their cell phone and music), and they were asked to have only the experiment window open. The order of the tasks was random.
Mean performance rate in each of the nine tasks (standard deviations are in parentheses)
TP = .6
TP = .8
TP = 1
ED = 200 ms
54.8 % (11 %)
56.3 % (14 %)
58.4 % (14 %)
ED = 600 ms
59.5 % (13 %)
59.7 % (15 %)
67.6 % (17 %)
ED = 1,000 ms
61.8 % (12 %)
65.6 % (17 %)
72.8 % (18 %)
In order to examine the influences of TP, ED, and their interaction on SL performance, we conducted a logistic mixed-effect analysis using the lme4 package in R (Bates, Maechler, Bolker, & Walker, 2015). The dependent variable was accuracy in the forced choice test (excluding the screening items). The model included the fixed effects of TP, ED, their interaction, the position of the target pair within the forced choice question (i.e., whether the target was first or second), and trial number in the test.1 The random-effect structure included a by-subjects random intercept and random slopes for TP, ED, and their interaction. The predictors TP and ED were both centered and standardized, trial number was centered, and the target position variable was dummy-coded (target in first position = 0; second position = 1). The model included N = 12,032 observations and had a log-likelihood of −7,664.7.
In contrast, the effect of TPs seemed to deviate from linearity (with a small difference of 1.9 % between the lower TPs of .6 and .8, which implicated both quasi-regularities, and a larger difference of 5.9 % between .8 and full regularity, TP = 1). Indeed, a paired t test revealed a marginally significant difference between the TP = .6 to .8 and the TP = .8 to 1 differences [t(41) = 1.78, p = .08], suggesting a possible nonlinearity in the influence of TP on SL performance. This suggests a possibly qualitative difference between the full regularity and quasi-regularity of shapes in the stream.
In the present study, we independently manipulated TP and ED in a visual SL task to dissociate factors related to the encoding of visual shapes and the higher-order process of learning their distributional properties. We asked how each type of constraint affects SL and how their interaction determines performance in the task. Our results provide a set of critical findings. First, we found that, at least within the range of 200 to 1,000 ms, ED impacts SL performance in a linear way, so that longer exposure of shapes results in better learning of their conditional probabilities. This converges with the earlier evidence provided by Turk-Browne et al. (2005) and Arciuli and Simpson (2011), who manipulated ED between subjects and reported improved SL performance at slower presentation rates (see also Emberson, Conway, & Christiansen, 2011). Second, we found that introducing quasi-regularity in the stream impacts learning along a trajectory that seems to deviate from linearity. Although this deviation was marginally significant, our results showed relatively small changes in SL performance when TP increased from .6 to .8, but a substantially large improvement when the TP implicated full regularity. This pattern of performance suggests that, at least at the group level, full regularity of shapes in the streams may be qualitatively different from any quasi-regularity, in terms of improving SL performance.
However, importantly, our experimental design allowed us to go beyond the independent influences of ED and TP within the stream on SL performance, to examine their interaction. The striking finding of our study was the interplay between encoding constraints and the extent of regularity in determining the learning outcomes. Overall, our results suggest that sensitivity to the extent of TP in the stream was modulated by ED, and vice versa. Although our findings show that even very short EDs (of 200 ms) were sufficient to encode the visual shapes, resulting in above-chance learning, the extent of regularity of the shapes in the stream had a relatively smaller impact on learning, so that SL performance was relatively low even with full regularity. With additional exposure (ED of 600 ms), a large difference in performance between full regularity (TPs = 1) and quasi-regularity (TPs = .6, .8) emerged, but no difference between the two levels of quasi-regularity. Full sensitivity to the full range of TPs was found only with the longest exposure of the shapes. This pattern of findings suggests that encoding shapes and retrieving their TPs are not independent and additive processes. Rather, the distributional properties of shapes in the stream and their predictability may serve to facilitate their encoding, in the case of suboptimal, shorter EDs, and conversely, an increase in the exposure time enhances sensitivity to fine differences in TP. These findings have implications for a mechanistic description of the cognitive events occurring in the typical VSL task. Rather than considering temporal-processing modularity, in which the encoding of shapes into internal representations feeds into the subsequent phase of extracting their distributional properties, the encoding and extraction of TPs seems to be a two-way street, with each dimension affecting the other. Whether this bidirectional dependency is causal in one direction or the other requires further investigation.
The present findings are relevant to current debates regarding the extent of modality specificity in SL (see Frost et al., 2015, for a review and discussion), and the relations between the subprocesses involved in SL (e.g., Thiessen et al., 2013). In the context of visual shapes, recent imaging studies have implicated, on the one hand, higher-level visual networks (Nastase, Iacovella, & Hasson, 2014), and on the other hand, the domain-general hippocampus and medial-temporal lobe memory system (Schapiro et al., 2014; Turk-Browne, Scholl, Chun, & Johnson, 2009). Our findings thus offer possible constraints for understanding how both modality-specific (encoding of visual shapes) and modality-general (extracting distributional properties) computations result in the extent of learning regularities in the visual modality. These processes do not seem to be independent and sequential, so that the completion of one would initiate the launching of the other.
Our discussion so far has focused on the group-level performance, yet from an individual-differences perspective, another striking result is the high correlation between the ED and TP trajectories within participants. This high correlation suggests that individuals who showed greater sensitivity to changes of ED tended to also show greater sensitivity to changes of TP, and vice versa (note that this correlation is based on a relatively large number of participants and held even after we removed individuals who did not exhibit significant learning). This is an intriguing finding, since it suggests that individual abilities to overcome both encoding constraints (here operationalized as limitations of event duration) and learning constraints (here operationalized as noise related to event regularity) are interrelated. A possible interpretation of this finding is that, perhaps, the high correlation between sensitivities to ED and TP was driven by peripheral factors, such as a general state of attentiveness to the task. To investigate this hypothesis, we calculated the partial correlations between the individual ED and TP slopes, controlling for average performance on the screening items, a proxy for attentiveness. The partial correlation between slopes within participants was still large and significant (full sample, rpartial = .43, p < .01; after removing outliers, rpartial = .68, p < .001; with only above-chance participants, rpartial = .40, p < .05).
A point that deserves some attention is the presence of a number of negative individual slopes (see Fig. 4). Whereas a negative slope associated with the ED manipulation intuitively makes sense (some individuals who are fast encoders might fail to allocate attention when shapes in the stream are presented at a slow rate), the negative slopes for TP are harder to explain. Although it is possible that the negative slopes for TP represent simple noise, which is inevitable in such an experimental design, possible insight for this phenomenon can be drawn from recent studies suggesting that different populations of neurons encode full regularity and quasi-regularity (Nastase et al., 2014). From the perspective of information theory, quasi-regularity is more informative than full regularity. Indeed, Kidd, Piantadosi, and Aslin (2012) have recently shown that infants maximally attend to stimuli that are neither too predictable nor too unpredictable. The negative slopes of some of the participants in our sample may hint toward individual differences in the point of optimal degree of the extent of regularity for learning. This, however, will require further investigation, aiming to establish whether individual slopes for ED and TP are indeed stable characteristics of an individual (see Siegelman & Frost, 2015, for measures of reliability in SL tasks).
In conclusion, the present study suggests that manipulating task parameters in a within-subjects parametric design provides considerable insight regarding the cognitive operations underlying visual SL. Research using a similar methodology has the promise of establishing how encoding and higher-order learning factors account for the variance in performance in other modalities, leading to a better understanding of the mechanisms of SL.
Trial number was not a significant predictor of performance, β = −.002, SE = 0.002, p = .18 suggesting that the repeating the same target pairs and foils in the test phase did not alter performance.
Because in our procedure the interstimulus interval between shapes remained constant (see the Design and Materials section), the possibility that our results reflected the rate (or length) of presentation, rather than ED per se, cannot be overruled and should be acknowledged.
The exclusion criterion for this analysis was set to success in 159 trials out of the 288 across the nine tasks (i.e., a mean success rate of 55.2 %). According the binomial distribution, this is the minimal number of successful trials needed to present significantly above-chance learning at the individual level.
This article was supported by the Israel Science Foundation (Grant No. 217/14, awarded to R.F.), and by the National Institute of Child Health and Human Development (Grant Nos. RO1 HD 067364, awarded to Ken Pugh and R.F., and PO1-HD 01994, awarded to Haskins Laboratories). L.B. is a research fellow of the Fyssen Foundation. We are indebted to Steve Frost for his valuable comments.
- Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). lme4: Linear mixed-effects models using Eigen and S4 (R package version 1.1-8). Retrieved from http://cran.r-project.org/package=lme4
- Loftus, G. R., & Kallman, H. J. (1979). Encoding and use of detail information in picture recognition. Journal of Experimental Psychology: Human Learning and Memory, 5, 197–211.Google Scholar