Visual statistical learning of temporal structures at different hierarchical levels

Jun, Jihyang; Chong, Sang Chul

doi:10.3758/s13414-016-1104-9

Visual statistical learning of temporal structures at different hierarchical levels

Published: 11 April 2016

Volume 78, pages 1308–1323, (2016)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Visual statistical learning of temporal structures at different hierarchical levels

Download PDF

Jihyang Jun¹ &
Sang Chul Chong^1,2

2022 Accesses
7 Citations
Explore all metrics

Abstract

Visual environments are complex. In order to process the complex information provided by visual environments, the visual system adopts strategies to reduce its complexity. One strategy, called visual statistical learning, or VSL, is to extract the statistical regularities from the environment. Another strategy is to use the hierarchical structure of a scene (e.g., the co-occurrence between local and global information). Through a series of experiments, this study investigated whether the utilization of the statistical regularities and the hierarchical structure could work together to reduce the complexity of a scene. In the familiarization phase, the participants were asked to passively view a stream of hierarchical scenes where the shapes were concurrently presented at the local and global levels. At each of the two levels there were temporal regularities among the three shapes, which always appeared in the same order. In the test phase, the participants judged the familiarity between 2 triplets, whose temporal regularities were either preserved or not. We found that the participants extracted the temporal regularities at each of the local and global levels (Experiment 1). The hierarchical structure influenced the ability to extract the temporal regularities (Experiment 2). Specifically, VSL was either enhanced or impaired depending on whether the hierarchical structure was informative or not. In summary, in order to process a complex scene, the visual system flexibly uses statistical regularities and the hierarchical structure of the scene.

Visual statistical learning at basic and subordinate category levels in real-world images

Article 16 July 2018

Statistical learning of spatiotemporal regularities dynamically guides visual attention across space

Article Open access 07 October 2022

The effects of perceptual cues on visual statistical learning: Evidence from children and adults

Article 19 April 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The visual system has to cope with complex and ever-changing inputs in daily life. For example, most objects differ from one another in their surface features (e.g., color, brightness, orientation, texture), internal configurations, locations in space, and so on (Biederman, 1987). In addition, due to the continuous movements of objects and ourselves, our visual experience is dynamic rather than stationary (Freyd, 1987; Gibson, 1979; Pylyshyn & Storm, 1988). Despite the complexity of visual information, we can still perceive the world vividly (Brady, Konkle, Alvarez, & Oliva, 2008) and recognize visual environments without considerable effort (Humphreys & Riddoch, 2001). How can our visual system process complex information efficiently?

One possible way to examine this is to use statistical regularities in a scene (Chun, 2003; Turk-Browne, 2012). Our visual environments are not random but contain statistical regularities such as repeated contexts (Chun, 2000, 2003; Chun & Jiang, 1999; Oliva & Torralba, 2007), spatial configurations (Fiser & Aslin, 2001), or regular sequences of objects (Fiser & Aslin, 2002a). For example, sequences of objects (e.g., the parking lot, elevator, and office room during one’s daily commute) constantly appear in the same order. Using a process called visual statistical learning, or VSL, the visual system can both learn and use these statistical regularities (Chun & Jiang, 1998; Fiser & Aslin, 2001; Orbán, Fiser, Aslin, & Lengyel, 2008). When statistical regularities are present in a scene or a sequence, observers can learn these regularities (Brady & Oliva, 2008; Chun & Jiang, 1998; Fiser, 2009; Fiser & Aslin, 2001, 2002a, 2002b, 2005; Fiser, Scholl, & Aslin, 2007; Kirkham, Slemmer, & Johnson, 2002; Otsuka, Nishiyama, Nakahara, & Kawaguchi, 2013; Turk-Browne, Scholl, Chun & Johnson, 2009).

The learning of statistical regularities can compensate for the limited capacity of the visual system (Palmer, 1975). For instance, Brady, Konkle, and Alvarez (2009) showed that when a display contained statistical regularities within pairs of colored items, the participants could hold almost five colors at once, exceeding their capacity limit of three or four. In another study by Umemoto, Scolari, Vogel, and Awh (2010), the participants could detect changes more rapidly and accurately if these changes repeatedly took place in a given quadrant. Similarly, Zhao, Al-Aidroos, and Turk-Browne (2013) found that the participants searched for targets faster in a location that featured statistical regularities. In this location, three shapes always appeared in the same order, whereas in other locations every shape appeared in random order.

Another way to cope with the complexity of a scene is to use the hierarchical structure of our environments (Brady, Konkle, & Alvarez, 2011; Im & Chong, 2014). Specifically, the visual information of a scene can be organized into local- or global-level information (Navon, 1977). Whether the visual information is local or global is decided by the relative level where the properties exist within the hierarchical structure of a scene (Kimchi, 2014). For example, when we perceive a building, we estimate the average height and overall shape of the building (e.g., the height of 25 or 30 stories with the shape of a skyscraper). On the other hand, we may notice features of several components that comprise the building (e.g., small gray bricks and white balconies on each floor). In this example, the former (i.e., the overall properties of the building) is at the top of the hierarchical structure compared to the latter (i.e., the features of individual items). Therefore, extracting the overall information is at a relatively more global level than encoding the features of the individual elements that comprise the building. This hierarchical structure of visual information is present in our memory representations, and the visual memory at the two different hierarchical levels can interact with one another (Brady et al., 2011).

The encoding of hierarchical structures helps observers to compensate for their limited capacity and to acquire detailed information (Brady et al., 2011). When the number of items in a display exceeds the limited capacity of working memory, observers experience an information overload (Cowan, 2001). However, Brady and Alvarez (2011) found that observers could accurately memorize the size of individual items when their estimations were close to ensemble statistics (e.g., the mean size of all items in the display or the mean size of a set of items sharing the same color). They suggested that the encoding of hierarchical structures helped observers to base their judgments on global-level aspects of the display (i.e., ensemble statistics) instead of guessing. Im and Chong (2014) also showed that observers could better estimate the mean sizes of groups of items when the items that shared the same color were spatially grouped, thereby allowing extraction of the hierarchical relationship between the items to facilitate the process.

In summary, these two mechanisms (i.e., the utilization of the statistical regularities and the hierarchical structure) can explain how the visual system is able to handle complex scenes despite its limited capacity. For example, people who commute every day can learn the statistical regularities on their route: there is a parking lot on the right side of the main entrance (i.e., spatial regularities), and they have to pass the parking lot and take an elevator before they reach their office (i.e., temporal regularities). On the other hand, they can represent the hierarchical structure of the inputs: Their car, next to a friend’s car (i.e., local information), is situated on the left side of the parking lot (i.e., global information). By taking advantage of these statistical regularities and hierarchical structure, they can efficiently represent the location of their cars, the entrance, the elevator, their desks, computers, and so on.

How are these two mechanisms related to one another? According to Fiser and Aslin (2005), when statistical regularities and hierarchical structure coexist in the same spatial domain, the two mechanisms seem to interfere with one another. Specifically, spatial statistical learning appears to be constrained by the hierarchical structure when spatial regularities are present at different hierarchical levels. In their experiments, four shapes always appeared in the same configuration and formed a quadruple, with the constraint that spatial regularities not only existed in the quadruple but also in a pair embedded in the quadruple. Therefore, the statistical regularities in the embedded pair could be considered local whereas those in the quadruple could be considered global. Spatial statistical learning only occurred at the global level of the quadruple but not for the local-level embedded pair. However, spatial statistical learning occurred for another type of pair that appeared adjacent to, but separate from, the quadruple. In other words, VSL occurred when a pair was not part of a hierarchical structure (e.g., the quadruple). These results indicate that the learning of statistics at a global level (i.e., in the quadruple) could constrain the extraction of regularities at the local level (i.e., in the embedded pair).

However, in our visual environments, hierarchical structures and statistical regularities do not always coexist in the same spatial domain. For instance, while the hierarchical structure of a scene may be represented in the spatial domain (e.g., a building consists of six floors), the statistical regularities of that scene can exist in the temporal domain (e.g., after passing the parking lot and the elevator, people can reach their office). When the hierarchical structure and statistical regularities do not occur in the same domain, how do the two mechanisms interact with one another? Based on several findings suggesting that spatial and temporal information can be processed independently (Bengtsson, Ehrsson, Forssberg, & Ullén, 2004; Karabanov & Ullén, 2008; Kornysheva, Sierk, & Diedrichsen, 2013; Ullén & Bengtsson, 2003), we assume that they can work together. In Ullén and Bengtsson’s (2003) sequential learning task, the participants learned and reproduced three types of sequences in different blocks: one temporal, one spatial, and one combined sequence. In the temporal sequence, each stimulus appeared for varying durations, but always in the same location, so that only temporal information was present in the sequence. In the spatial sequence, each stimulus appeared in certain locations, but always for the same duration, so that only spatial information was present in the sequence. The combined sequence was a combination of these temporal and spatial sequences so that both temporal and spatial information were present in the sequence. When the participants learned the combined sequence first, the ensuing learning of the temporal and spatial sequences was facilitated. In addition, when the participants separately learned the temporal and spatial sequences first, the learning of the combined sequence was also facilitated in both temporal and spatial domains. Based on these results, the authors suggested independent representations of temporal and spatial structures. Furthermore, a functional magnetic resonance imaging (fMRI) study (Bengtsson et al., 2004) showed a dissociation between the brain regions involved in the learning of temporal structures (i.e., the presupplementary motor area, right inferior frontal gyrus, precentral sulcus, and bilateral superior temporal gyri) and those involved in the learning of spatial structures (i.e., the lateral fronto-parietal areas, basal ganglia, and cerebellum).

In this study, we used a hierarchical structure and statistical regularities occurring in different domains. While the hierarchical structure was constructed in the spatial domain, the statistical regularities occurred in the temporal domain. We investigated whether the participants could extract the temporal regularities represented at different hierarchical levels simultaneously (Experiment 1A). We additionally tested whether the participants used the hierarchical structure, not two different features of a hierarchical stimulus (i.e., smaller and larger shapes) to extract the temporal regularities (Experiments 1B and 1C). Next, we examined whether the processing of the hierarchical structure could influence the degree of statistical learning (Experiment 2). In each experiment, a familiarization phase was followed by a test phase. In the familiarization phase, the participants passively viewed a sequence of Navon-like objects (see Fig. 1), where two different shapes formed a hierarchical structure at the local and global levels. Unbeknownst to the participants, the sequence featured temporal regularities among the three shapes (i.e., the triplet) at both the local and global levels, as these three shapes always appeared in the same order. The familiarization phase of the two experiments was identical. In the test phase, the participants performed a surprise two-alternative forced-choice (2AFC) familiarity judgment task. In each trial, the participants were asked to choose the more familiar sequence of the two triplets; that is, a base triplet and a nonbase triplet. In the base triplet, the temporal regularities were preserved compared to the familiarization phase whereas in the nonbase triplet the temporal regularities were changed.

In Experiment 1A, the temporal statistical learning was tested at each local and global level. Depending on the tested level (e.g., local level), each display contained only one shape (e.g., a shape at the local level) that had appeared in that level during the familiarization phase, so that the two levels were tested separately. To rule out an alternative possibility that the participants might have used two different shapes of the hierarchical stimulus, not the hierarchical structure per se, in Experiment 1B, we again tested the temporal VSL at the global level by eliminating the shape information and only providing the spatial hierarchy. In Experiment 1C, we conducted the same test conditions as in Experiment 1B, after equating the number of test trials at the local level with that at the global level.

To further examine the relationship between the two mechanisms, in Experiment 2 we tested whether the hierarchical structure could influence the degree of statistical learning. We preserved the hierarchical structure during the test phase; that is, the two shapes at both levels simultaneously appeared the same as in the familiarization phase. The temporal statistical learning was tested at both levels simultaneously, as well as at each of the two levels. In other words, depending on the tested level (e.g., local level), only the shapes that had occurred at that level (e.g., shape at the local level) during the familiarization phase changed the temporal regularities in the nonbase triplets. However, the shapes at the other level (e.g., shape at the global level) featured the same temporal regularities as in the base triplets or during the familiarization phase.

To provide a preview of our results, temporal statistical learning was possible at each of the different hierarchical levels (Experiment 1A). In addition, the degree of statistical learning did not differ between the two levels. We also found that the participants used the hierarchical structure when they extracted the temporal regularities (Experiments 1B and 1C). In Experiment 2, statistical learning was influenced by the hierarchical structure. When the hierarchical structure was helpful for familiarity judgments, it enabled VSL; conversely, it impaired VSL when it was not informative.

Experiment 1A

Visual statistical learning at different hierarchical levels

In Experiment 1A, we investigated whether temporal statistical learning could occur at different hierarchical levels (i.e., local and global). Previously, Fiser and Aslin (2005) showed that VSL did not occur when statistical regularities and hierarchical structures coexisted in the same spatial domain. However, this study used the temporal domain to present the statistical regularities and the spatial domain to present the hierarchical structure. Experiment 1A tested whether the participants could extract temporal regularities at the local and global levels.

Method

Participants

Twenty naïve students from Yonsei University participated in Experiment 1A. They were paid 5,000 Won for their participation. All of the participants reported normal or corrected-to-normal vision. The protocol of the experiment was approved by the Yonsei University Institutional Review Board, and the participants were informed of their rights before signing written consent forms.

Apparatus and stimuli

We presented the stimuli using MATLAB and Psychophysics Toolbox (Brainard, 1997; Pelli, 1997). The display was a linearized Samsung 21-in. monitor with a resolution of 1,600 × 1,200 pixels and a refresh rate of 85 Hz. The experiment was conducted in a dark room. The participants’ heads were fixed on a chin-and-forehead rest at a viewing distance of 90 cm; one pixel subtended 0.016° at this distance.

We created a total of 12 novel shapes (see Fig. 1a). Half of the shapes were randomly chosen to be used as shapes at the local level and the other half at the global level. As in Navon (1977), the shape of the global-level structure was composed of the local-level shapes (see Fig. 1b). The two shapes at the two different levels were always different and presented in a 3 × 3 grid. Each local-level shape occupied a single cell of the grid. To form the global-level shape, this local shape repeatedly appeared in more than three (but less than seven) cells. The background of the cells was filled with gray color in order to make the global shape evident. The color of the local shapes was black (0.10 cd/m²) and that of the global shapes was gray (18.8 cd/m²). The line color of the grid was black and the color of the background was white (91.4 cd/m²). The maximum width and height of the shapes was 2.29° and 12° at the local and global level, respectively. The side of a cell in the grid was 4°.

Design and procedure

Experiment 1A comprised two phases: a familiarization phase and a test phase. The participants completed the familiarization phase (see Fig. 2) and then performed the test phase (see Fig. 3a and b).

Familiarization phase

The familiarization phase included a movie with a 2 × 2 design, that is, two levels of display (local and global) and two kinds of triplets in each level (Triplets 1 and 2). Unbeknownst to the participants, the 12 shapes were randomly allocated to one of the four triplets: two triplets were displayed at the local level and the other two were shown at the global level. The three shapes in each triplet presented temporal regularities, as they always appeared in the same order. For instance, if naming the six shapes of the local level as numbers (i.e., 1–6) and those of the global level as capital letters (i.e., A–F), the 12 shapes across the four triplets could appear in the following manner: 1-2-3 or 4-5-6 at the local level and A-B-C or D-E-F at the global level. Every display contained a hierarchical structure composed of the local and global shapes. When two different shapes occurred simultaneously at the local and global levels, each shape had its own temporal position (i.e., first, second, or third) within its triplet. In this way, the combination of the two triplets at the local level with the other two triplets at the global level created four possible kinds of triplet displays that were hierarchically structured (i.e., 1A-2B-3C, 1D-2E-3 F, 4A-5B-6C, and 4D-5E-6 F). As the four kinds of triplet displays were repeated 35 times, the familiarization movie could be segmented into 140 triplet displays or 140 triplets for each level; however, there was no segmentation cue between the triplet displays or the triplets. The same triplet display was not presented successively. As each of the 140 triplet displays consisted of three displays presented sequentially, 420 displays were presented to the participants one after another. The frequency of the 12 individual shapes was the same. At either the local or the global level, the joint probability of the three shapes in each triplet was 0.17.

The experimental procedure adopted an observational learning paradigm (Fiser & Aslin, 2001). The participants passively viewed a 12-min movie without any instructions. The movie consisted of a series of 420 hierarchically structured displays, with blank intervals between each display. The duration of each display was 1 s, and the duration of the blank interval was 750 ms.

Test phase

In Experiment 1A, we tested whether the participants could extract temporal statistics at the local (see Fig. 3a) and global level (see Fig. 3b), and whether the extent of the statistical learning was different between the two levels. To investigate this, we tested whether the participants could discriminate a base triplet from a nonbase triplet. The temporal order was preserved for the base triplets but changed for the nonbase triplets.

There were four within-subject variables. First, the extent of the temporal VSL was tested at each of the two different levels: local and global. Depending on the tested level, only individual shapes that had been present at the corresponding level during the familiarization phase were sequentially presented, whereas shapes that had been present at the other level during the familiarization phase were not presented. Second, there were two kinds of base triplets for each of the two levels: 1-2-3, 4-5-6 for the local level, and A-B-C, D-E-F for the global level. Third, there were two kinds of nonbase triplets for each of the two levels: 2-4-1 and 6-3-5 or 3-4-2 and 6-1-5 for the local level or B-D-A and F-C-E or C-D-B and F-A-E for the global level. We constructed the nonbase triplets by choosing three shapes from two different base triplets such that the joint probability of the three shapes in the nonbase triplet became 0. In contrast, the joint probability of the three shapes in the base triplets was 0.17, as in the familiarization. Fourth, there were two different orders of presentation: the base triplet first or the nonbase triplet first. The participants therefore performed 16 randomized trials. The frequency of the 12 individual shapes was equal for the base and nonbase triplets.

The procedure adopted a two-alternative forced-choice (2AFC) task for familiarity judgments. Before either the base or nonbase triplets were presented, the words First and Second were shown on a blank screen for 1 s in order to distinguish the two triplets. The duration of each display and of the blank intervals between the displays were identical to those of the familiarization. After viewing the two sequences (i.e., the base and nonbase triplets), the participants were asked to decide what sequence was more familiar, based on their experience of the previous 12-min movie, by pressing 1 for the first sequence or 2 for the second sequence.

After the familiarity test, we also examined whether explicit awareness of temporal regularities influenced visual statistical learning, by using a binary confidence judgment task including two different statements (Bertels, Franco, & Destrebecqz, 2012). The first statement indicated that the test had been performed based on some kind of explicit knowledge (i.e., “I chose the answers based on some kind of knowledge that I learned during the familiarization phase.”). The second statement indicated that the test had been performed in an implicit manner (i.e., “I guessed the answers based on my intuition.”). Only the participants who had chosen the first statement were further asked what kind of knowledge they had by reporting whether they were aware of the existence of the temporal regularities and how many shapes regularly appeared in a sequence.

Results and discussion

In this and the following experiment, the extent of the temporal VSL was measured with the mean percentage of correct trials, that is, trials where the participants had selected the base triplets as being more familiar than the nonbase triplets were. The results of Experiment 1A are shown in Fig. 4.

Overall, the mean percentage of correct responses was 66.25 %, which was significantly higher than 50 % (chance level), as assessed by a one-sample t test, t(19) = 3.73, p = .001. Specifically, the mean percentage for the local level was 67.50 %, which was significantly higher than chance, t(19) = 3.50, p = .002, while the mean percentage for the global level was 65.00 %, which also significantly differed from chance, t(19) = 3.21, p = .005. Therefore, the participants had learned the temporal regularities at both the local and global levels. However, the degree of learning did not differ significantly between the two levels, as revealed by a paired-samples t test, t(19) = .59, p = .560. In this and subsequent experiments, the participants could not rely on the frequency of the individual shapes, as every shape in the base and nonbase triplets had appeared equally frequently in the familiarization phase. The change of temporal regularities in the nonbase triplets for either of the two levels was the only source of information that the participants could base their judgments.

The presentation order did not influence the participants’ performance. A paired-samples t test showed no significant difference between the trials when the base triplets were presented first (test performance: 68.75 %) and those when they were presented second (test performance: 63.75 %), t(19) = .85, p = .408. We also tested the possibility that the participants might have used the test phase itself to learn parts of the base triplets. Comparing the performance of the first half (67.50 %) with that of the second half (65.00 %), we found no significant difference, t(19) = .44, p = .666, indicating that learning had not occurred during the test phase.

The awareness of the temporal regularities improved statistical learning. In the binary confidence judgment, 11 out of the 20 participants reported having used some kind of knowledge (test performance: 77.85 %). Among these 11 participants, eight specifically reported being aware of the temporal regularities (test performance: 85.15 %), whereas the other nine participants reported that they had responded by guessing (test performance: 52.08 %). The eight participants who were aware of the temporal regularities performed better than the other 12 who were not aware of the temporal regularities, as assessed with an independent-samples t test, t(18) = 5.91, p < .001.

Experiment 1B

Visual statistical learning at the global level without shape information

Experiment 1A found that temporal statistical learning took place both at the local and global levels when the statistical learning and hierarchical structure were presented in separate spatiotemporal domains. In addition, there was no difference between the extent of statistical learning at the local and global level. However, one might argue that observers might not represent the hierarchical structure but, rather, they might have used the two different shapes to perform the familiarity judgment task, such as the gray (global) and the black (local) shapes. To rule out this possibility, in Experiment 1B we again tested the temporal VSL at the global level without providing strong shape cues (i.e., the gray shading and the black shapes). In the global level of Experiment 1B, new shapes replaced the local shapes that had appeared in the familiarization phase, thus preserving the hierarchy of the stimuli without the two shape cues. However, we hypothesized that the participants could still judge the familiarity correctly if they could use the spatial hierarchy of the local and global shapes that they had represented during the familiarization phase.