1 Introduction

As mobile Internet has become increasingly globally accessible, online learning formats have become more diversified, with live video streaming (LVS) developing particularly rapidly. The COVID-19 pandemic further accelerated this process, as it necessitated a massive transformation from traditional face-to-face classroom settings to online learning. When learning from LVS, a synchronous online learning method, students learn from videos via video conferencing technologies (e.g., Tencent conference software), which is able to transcend the time and space limitations of conventional classroom settings (Camilleri & Camilleri, 2022).

Unfortunately, the majority of the challenges of online learning strongly affect elementary students (children aged approximately 7 to 13 years) whose control function and sustained attention are still underdeveloped (Chen & Wang, 2017; Rice, et al., 2016). Due to the absence of monitoring and control in online learning, elementary students usually have difficulty keeping a sustained focus on learning when learning alone, resulting in worse learning performance and lower learning efficiency (Chen & Wang, 2017; Zhang, et al., 2019). However, the presence of a peer may compensate for the disadvantages or challenges faced when learning alone.

Co-viewing is a prevalent method of learning from videos. It refers to the learner viewing videos with another person (i.e., a co-viewer) who engages synchronously in the same learning activities (e.g., viewing the video together or interacting with each other; Azhari, et al., 2021). However, findings regarding the influence of co-viewing on students’ attention allocation, learning outcomes, and metacognition (i.e., the cognitive process of self-monitoring and self-control) are inconsistent. Some studies have claimed that co-viewing can facilitate learning performance (as indicated by retention and transfer) and metacognition in collaborative learning (e.g., Lytle, et al., 2018; Schneider and Pea, 2013), while others hold that a co-viewer may be a distraction, drawing the student’s attention and hindering their learning performance when the co-viewer appears in the chat window (e.g., Marlow, et al., 2017; Skuballa, et al., 2019). Meanwhile, several studies found no effects of co-viewing on learning from videos (e.g., Skouteris & Kelly, 2006; Tricoche, et al., 2020). Such inconsistent results could be attributed to the levels of learner-learner interaction in each study. Studies have shown that learner-learner interaction is an effective support in improving elementary students’ online learning outcomes (Liao, et al., 2021). However, little is known about how learner-learner interaction in the context of co-viewing affects elementary students’ attention allocation, learning outcomes, and metacognition in learning from LVS.

1.1 The effects of co-viewing on learning from videos

Co-viewing is a common practice when learning from videos. According to the drive theory of social facilitation, co-viewing promotes learning from videos (Zajonc, 1965) as the co-viewer is regarded as a source of arousal, whose presence stimulates the learner’s desire to perform well, and thus influences the learner’s visual attention, task performance, and metacognition (Karabenick, 1996; Zhu, et al., 2015). This has been evidenced by various empirical studies (Lytle, et al., 2018; Schneider & Pea, 2013). For example, Schneider and Pea (2013) tested the effects of co-viewing on real-time collaborative learning. They found that learners who could see their co-viewer’s gaze on the screen expended less effort, had better learning performance (as indicated by retention and transfer), and communicated more effectively than those in the no-gaze condition. Retention and transfer reflect different levels of learning performance, in that retention reflects the learner’s memory of new knowledge, while transfer reflects the learner’s application of new knowledge in novel situations.

Cognitive load theory, however, suggests that a learner’s working memory capacity is limited, and thus they are only able to focus on a limited amount of information at a time (Paas & Van Merriënboer, 1994). The distraction conflict theory also suggests that the social presence of a co-viewer can affect the attentional view, in that co-viewing may cause a conflict in the learner’s attention between the ongoing lesson and the co-viewer’s presence. This conflict might lead to cognitive overload, resulting in the learner expending more mental effort overall, leading to worse learning performance (Baron, et al., 1978; Marlow, et al., 2017; Skuballa, et al., 2019). For example, one study explored the influence of co-viewing on learners in online learning. It found that when the co-viewer was visible in the chat window, the learners’ attention was distracted, which hindered their understanding of the learning content, thus harming their learning performance as measured by a comprehension test (Marlow, et al., 2017).

Still, some empirical studies have shown no effects of co-viewing on learning from videos (Skouteris & Kelly, 2006; Tricoche, et al., 2020). Tricoche et al. (2020) assessed whether peer presence influenced learners’ attention and eye movements during different tasks (i.e., saccades, visual search, and continuous performance). Their results failed to reveal the difference between the peer presence and alone conditions on visual search and continuous performance tasks. Meanwhile, with regard to learning performance, Skouteris and Kelly (2006) also found no significant effects of co-viewing on vocabulary comprehension in children.

Taken together, the results of the above-mentioned studies suggest that co-viewing may not necessarily guarantee facilitated learning during video streaming. Skouteris and Kelly (2006) have suggested that the role of co-viewing is probably moderated by the interaction between the student and the co-viewer. Therefore, we speculated that the inconsistent results regarding the effect of co-viewing on video learning might be attributed to the level of a learner’s interaction with their co-viewer.

1.2 The effects of learner-learner interaction on learning

Learner-learner interaction is one of the three types of interactions that occur during learning (i.e., learner-content interaction, learner-instructor interaction, and learner-learner interaction; Moore, 1989). It has been defined as a conversation or interaction event between two or more learners that occurs across various learning environments (e.g., face-to-face classroom settings, synchronous/asynchronous online learning) through responses or feedback (Muirhead & Juwah, 2004). According to Piaget’s constructivist learning theory (1985), the child is an active participant in constructing knowledge. Learner-learner interaction encourages students to complete learning tasks through communication with their peers, during which they are able to question others freely and actively by participating in social discourse, argument, and learning. This interaction facilitates learning because contrasting viewpoints create sociocognitive conflict (Doise, et al., 1975).

Learner-learner interaction has been used widely in face-to-face classroom settings, and its positive effects on learning have been recognized, and students are often encouraged to complete learning tasks through communication with their peers (Castellaro & Roselli, 2015; Tenenbaum, et al., 2020). Despite these advantages, the traditional classroom setting does have some limitations. For example, it requires all students to participate in learning activities in the same place and simultaneously. In contrast, with the support of networks and technologies, online learning has the apparent advantage of flexibility in shifting time and space (Shu & Gu, 2018).

Numerous studies have confirmed the benefits of learner-learner interaction on online learning (Sunar, et al., 2017; Tenenbaum, et al., 2020). From a social learning perspective, learner-learner interaction in online learning improves learners’ social presence and reduces the psychological distance between the learners, leading to improved course completion and knowledge construction (Castellanos-Reyes, 2021). Moreover, co-viewing may produce social cognitive conflict, with the consensual process creating conditions for cognitive growth (Piaget, 1985). For example, Sunar et al. (2017) found that learners in massive open online courses (MOOC) who interacted with their peers (e.g., by following them within the online course environment or engaging with them in discussion) were more likely to complete the course than those who lacked such interactions. Furthermore, a meta-analysis of 71 studies involving a total of 7,103 participants (aged 4 to 18 years) also found that students learned more when they completed a task that involved learner-learner interaction as compared to those who learned alone or who were in other comparison groups (Hedges’ g = 0.40, 95% confidence interval [CI: 0.27, 0.54], p < .0001; Tenenbaum, et al., 2020).

Social interaction is also thought to enhance metacognition, causing students to reflect upon and compare their ideas against those of others (Frith, 2012). By interacting with others, students actively participate in social discourse and repeatedly think over their ideas, which is conducive to improving learning efficiency and metacognitive judgment (Castellaro & Roselli, 2015; Filius, et al., 2018). Judgment of learning (JOL) refers to students’ belief and confidence about completing learning tasks, reflecting their self-monitoring of the learning process (i.e., metacognition). Research on metacognition has shown that, between the ages of approximately 7 to 13 years, elementary students’ control function and sustained attention are still under development, making it essential to explore the effects of learner-learner interaction on this age group in particular (Rice, et al., 2016). In considering the existing literature, however, it seems reasonable to speculate that learner-learner interaction may influence the effects of co-viewing in elementary students.

To ensure learners fully interact with one another, questions or prompts are often set to act as scaffolds during learner-learner interactions to support cognition. These scaffolds can help learners focus the interaction content on a specific topic and guide learners’ attention to prevent them from wandering, especially for elementary students who generally have poor self-control (Giacumo & Savenye, 2020; Shu & Gu, 2018). Each scaffold is beneficial to help them achieve a specific teaching goal. For example, teachers will ask questions about a specific concept or ask for an explanation to promote learners’ memory and comprehension of scientific knowledge (Benedict-Chambers, et al., 2017). In the current study, the teaching goals were to have the learners remember, understand, and apply what they had learned. Therefore, several questions were provided to serve as scaffolding to facilitate the process of learner-learner interaction. In addition, taking into account the measurement of learning performance in previous studies as well as the teaching goals of this study, we developed retention and transfer tests to investigate learners’ memory, comprehension, and application of the knowledge.

1.3 The current study

The collective findings of existing studies have shown that co-viewing is likely to influence students’ visual attention, learning performance (indicated by retention and transfer), and metacognition (Lytle, et al., 2018; Zajonc, 1965), and that learner-learner interaction does have an impact on learning (Sunar, et al., 2017; Tenenbaum, et al., 2020), however co-viewing does not always facilitate learning from videos (e.g., Tricoche, et al., 2020). It seems reasonable to assume that learner-learner interaction might therefore moderate the effects of co-viewing. Furthermore, while learning from LVS has become used widely due to the influence of the COVID-19 pandemic, this has been a great challenge for elementary students whose cognitive function is still under development (Camilleri & Camilleri, 2022; Dumontheil, et al., 2010). Therefore, the current study examined whether co-viewing and learner-learner interaction affected elementary students’ attention allocation, learning performance (i.e., retention and transfer), metacognition (i.e., judgment of learning), and perceptions of learning from LVS.

According to the drive theory of social facilitation (Zajonc, 1965), students’ engagement may be provoked by peer presence, motivating them to perform well in video lectures. Furthermore, constructivist learning theory suggests that learner-learner interaction facilitates both knowledge construction and learning performance. In contrast, according to cognitive load theory and distraction conflict theory (Baron, et al., 1978; Marlow, et al., 2017; Paas and Van Merriënboer, 1994; Skuballa, et al., 2019), peer presence in the context of co-viewing may be a distractor and trigger attention conflict between the ongoing task and the learner’s peer. Therefore, co-viewing may have both positive and negative effects on learning. Whether co-viewing facilitates a learner’s performance depends on the balance of social benefits and attentional losses, and learner-learner interaction might compensate for attentional losses in co-viewing LVS. Based on the aforementioned relevant theories (i.e., drive theory of social facilitation, cognitive load theory, distraction conflict theory, and constructivist learning theory) as well as the findings of previous research, we proposed the following hypotheses:

Hypothesis 1

Learners will pay the most attention to the co-viewer and the least attention to the video when co-viewing LVS with interaction, followed by when merely co-viewing, and finally when learning alone.

Hypothesis 2

Learners will demonstrate the best learning performance (as indicated by retention and knowledge transfer) when co-viewing LVS with interaction, followed by when merely co-viewing, and finally when learning alone.

Hypothesis 3

Learners will show the highest level of learning efficiency when co-viewing LVS with interaction, followed by when merely co-viewing, and finally when learning alone.

Hypothesis 4

Learners will report the best metacognition when co-viewing LVS with interaction, followed by when merely co-viewing, and finally when learning alone.

2 Methods

2.1 Participants and design

An a-priori power analysis conducted using G*Power (one-way ANOVA: f = 0.4, α = 0.05, power = 0.80, number of groups = 3) indicated that 83 participants would be sufficient for the planned analyses (Faul, et al., 2007; Jacob, et al., 2020). We randomly recruited 86 students in Grade 5 or 6 from a Chinese elementary school, all of whom had normal or corrected-to-normal vision and hearing. All participants were taking the basic classes in school (e.g., Chinese, mathematics, English, science, art, music, physical education, and morality and the rule of law), and all had the necessary levels of reading, comprehension, and expression ability to successfully complete the experiment. The study procedure had been approved by the local ethics committee, and all participants were advised of their right to withdraw at any point during the experiment. All student participants and their legal guardians gave their informed consent before beginning the experiment, and each participant received a gift after completing the experiment.

The experiment followed a one-way between-subjects design in which participants were randomly assigned to one of three groups: the learning alone group (LA; n = 29), the merely co-viewing group (CV; n = 28), or the co-viewing with interaction group (CV + I; n = 29). The demographic characteristics of all three groups are shown in Table 1. There were no significant differences in participant age (F(2, 83) = 1.33, p = .271, η2 = 0.03) or sex (χ2 = 0.32, p = .852) among the three groups.

Table 1 The demographic characteristics of the three groups

2.2 LVS and environment

The video lecture theme used for the LVS in the current study was “The Earth”, and the lesson consisted of four sections: the features and structure of the Earth (e.g., “The earth is very big, and approximately 40,000 km in diameter.”), geological hazards (e.g., “The plates constantly push against one another, and in some places they continuously rise or collapse, causing earthquakes.”), gravity (e.g., “The Earth’s gravity holds people on the outside of the planet to the ground.”), and living conditions on the Earth (e.g., “With the presence of water and an atmosphere, the Earth is the most suitable planet for human life that we know of, so far.”). The video lecture began with the appearance of a female teacher who introduced the student viewers to the video. After five seconds, her image disappeared from the screen. Each part of the video comprised two sections: the explanation section and the question section. The question section followed the explanation section, and presented the learners with a prepared question which was shown on-screen for 40 s. The video lecture was shown using Tencent meeting software, a computer-based synchronous learning environment, and the total video length was 6 min 20 s. The video played automatically, and participants were unable to pause or otherwise control the video playback. If the participant was in a co-viewer condition (i.e., either with or without interactions), their co-viewer was another online student who had been assigned randomly, who watched the video synchronously from another room along with the participant, with both students able to see each others’ images and hear any sounds their co-viewer made in real-time. The Tencent meeting software screen was made up of the instructional video area and the co-viewer area (see Fig. 1).

Fig. 1
figure 1

Two areas of the LVS

Participants in the LA condition watched the video alone, and the co-viewer area was black (Fig. 2a). Participants in the CV condition viewed the video with an online co-viewer present, whose image was shown in the co-viewer area, but no interactions between the two students were allowed (Fig. 2b). The CV + I condition was similar to the CV condition, except that in the CV + I condition the participants were asked to interact with their co-viewer during the question section (Fig. 2c).

Fig. 2
figure 2

Screenshots of the LVS in LA (a), CV (b), and CV + I (c) conditions. The yellow words were shown in Chinese and have been translated; the red words were not shown on-screen

2.3 Instruments and measures

Before running the formal experiment, eight Grade 5 or 6 elementary school students were randomly interviewed to test whether they understood the items used to test their learning during the experiment, in the prior knowledge test and the learning performance test. The students’ responses indicated that all students understood all the terms.

2.3.1 Prior knowledge test

Before being assigned to a condition, all students completed the prior knowledge test which consisted of seven items that measured the students’ prior knowledge regarding “The Earth” (maximum score = 14; see Appendix A). The test included five multiple-choice items (one point for each correct answer, total of five points), three fill-in-the-blank items (one point for each blank, total of six points), and one open-answer item (total of three points). The open-answer item responses were rated by two trained raters, who had high a inter-rater reliability (r = .95). The prior knowledge test had a medium level of internal consistency (Cronbach’s α = 0.70).

2.3.2 Learning performance test

To test the effect of co-viewing and learner-learner interaction on learning performance, we used retention and transfer tests to evaluate students’ recognition, comprehension, and application of the knowledge learned, a common procedure used in prior research on video lectures (Pi & Hong, 2016; Pi, et al., 2022; Schneider & Pea, 2013). The retention and transfer tests were developed specifically for the current study and can be seen in Appendix B. The retention test was used to measure participants’ memory of both the factual and conceptual information covered in the LVS (maximum score = 18), and was made up of one multiple-choice item (one point for correct answer), three fill-in-the-blank items (one point for each blank, total of five points), three true or false items (one point for each correct response, total of three points), and three open-answer items (three points for each item, total of nine points). Responses to the open-answer items were rated by two trained raters, who had high inter-rater reliability (respectively, r1 = 0.93, r2 = 0.97, r3 = 0.92; ps < 0.001). Reliability for the retention test was moderate, with an internal consistency of Cronbach’s α = 0.66. Concerning the transfer test, we developed one open-answer item to assess participants’ application of their new knowledge. Participants could score a maximum of four points on the item. The two trained raters rated responses with high inter-rater reliability (r = .87, p < .001).

2.3.3 Learning efficiency

To test the effects of co-viewing and learner-learner interaction on learning efficiency, we evaluated the degree to which participants improved their learning performance using the same amount of mental effort. Participants rated the mental effort they spent on learning during the experiment by answering the following question: “How much mental effort have you spent just now while learning about the Earth?” The students responded using a scale ranging from 1 (“very little”) to 9 (“a lot”; Paas and Van Merriënboer, 1994). The scores for mental effort and the learning performance test (i.e., the sum of the retention and transfer scores) were used to calculate students’ learning efficiency using the following formula (Van Gog & Paas, 2008):

$$\text{Learning efficiency = }\frac{{\text{Z}}_{\text{learning performance}}-{\text{Z}}_{\text{mental effort}}}{\sqrt{\text{2}}}$$

Learning efficiency is widely used in video learning research to weigh up students’ learning performance and invested mental effort. For instance, Yang et al. (2021b) calculated participants’ learning efficiency to explore the effects of generative learning strategies (i.e., imagination, drawing, and self-explanation) on JOL when learning from a scientific video. Participants were regarded to have high learning efficiency if they performed better on the learning performance test, and thus had learned more than expected based on their mental effort (i.e., efficiency score > 0). In contrast, their learning efficiency was considered to be low if they performed worse, and therefore had learned less than might be expected based on their invested mental effort (i.e., efficiency score < 0; Moning and Roelle, 2021).

2.3.4 Judgment of learning (JOL)

To test the effect of co-viewing and learner-learner interaction on JOL, we asked participants to rate their belief in their ability to perform three tasks related to the lesson (Eitel, 2016; Lindner, et al., 2021; Serra & Dunlosky, 2010). Participants rated each of the three items (i.e., recalling, answering, and explaining) on a scale ranging from 1 (“not confident”) to 100 points (“very confident”), and the mean of the three item scores was used to assess students’ metacognition (see Appendix C). This scale has been used widely to measure participants’ metacognition during video learning. For example, Pi et al. (2023) used the tool to test the influence of text or visual cues on students’ JOL in foreign language video learning. The reliability of the scale was high, with Cronbach’s α = 0.80.

2.3.5 Informal interviews

Short informal interviews were used to investigate the students’ perceptions of their LVS experience. Those in the LA condition were asked, (1) “What did you do in the question section?” and (2) “Would you prefer to have an online co-viewer when you learn using LVS?” Those in the two co-viewing conditions (i.e., CV and CV + I) were asked, (1) “How did you feel during the learning session?” and (2) “Did the presence of the co-viewer and the interactions you had with them influence your learning from the video?” All interviews were audio recorded and transcribed verbatim afterward. Two independent coders worked together to code the first 20% of the data, and had a high inter-coder reliability (r = .90, p < .001). After that, one person coded the remaining 80% of the data. An example of the coding would be, when a student said, “The presence of a co-viewer made me feel a lot more confident, and they seemed to encourage me while we were learning together through their eye contact and facial expressions,” this was coded as “confident” and “encouraged”.

2.4 Apparatus and eye movement data analysis

An aSee Glasses eye tracker (7 Invensun Technology, Beijing, China) with a sampling rate of 120 Hz was used to collect students’ eye movements as they watched the video lecture. The horizontal tracking angle was 80°, and the vertical angle was 50°. The LVS was displayed on a 14-inch laptop computer with a screen resolution of 1280 × 720, and the students sat in front of the computer wearing the eye tracker. Two areas of interest (AoI) were plotted, the video area and the co-viewer area (see Fig. 1). We tracked the students’ total fixation duration (i.e., the sum of the time the student focused on the screen), as well as their fixation duration on each AoI (i.e., fixation time attending to the video area, and to the co-viewer area) to analyze their attention allocation (Yang, et al., 2021a).

2.5 Procedure

One week before participating in the experiment, the students’ legal guardians gave their informed consent for their child to take part in the study. The experiment was conducted simultaneously in two classrooms, one in which the study participant worked, and the other room for the co-viewer. The full experiment lasted approximately 50 min (see Fig. 3). After giving their verbal informed consent and being informed about the experimental procedure, all participants first responded to the demographic questionnaire (e.g., age, gender), and completed the prior knowledge test (8 min). They then took a two-minute rest. Next, the participants were fitted with the portable eye tracker and were randomly assigned to one of the three conditions to watch the LVS (10 min). After watching the LVS, participants completed the mental effort scale, the JOL scale, and the learning performance tests (20 min). Finally, informal interviews were conducted with each participant to collect further data regarding the participant’s perceptions of the experience (5 min).

Fig. 3
figure 3

Experimental procedure

3 Results

A series of analyses were conducted to examine the differences between the conditions in terms of attention allocation, behavioral, and self-reported variables. First, the total fixation duration, the fixation duration on the instructional video area, and the fixation duration on the co-viewer area were analyzed to assess the differences in students’ attention allocation across the three conditions. As the three variables were not normally distributed (ps < .05), Kruskal-Wallis H tests were conducted. All behavioral and self-reported variables (i.e., retention, transfer, learning efficiency, and JOL) were shown to have followed the normal distribution and homogeneity of variance (ps > .05), so a series of one-way analyses of variance (ANOVAs) were conducted to test for differences between the three conditions (η2 used as effect size; small: 0.01 ≤ η2 < 0.06; medium: 0.06 ≤ η2 < 0.14; large: η2 ≥ 0.14; Cohen, 1988). Finally, the interview results were coded to provide additional evidence for findings regarding participants’ eye movement and behavioral performance. The descriptive statistics results of all dependent variables are shown in Table 2.

Table 2 The means (M) and standard deviations (SD) of all dependent variables

3.1 Attention allocation

The Kruskal-Wallis H test showed no significant differences in the total fixation duration across the three groups (H = 5.28, p = .071), however it did indicate a significant difference in the fixation duration on the video area (H = 13.63, p = .001). Further pairwise comparisons showed that the fixation duration on the video area was longer in the LA and CV groups than it was in the CV + I group (see Fig. 4a), but no significant differences were found between the LA and CV groups (p = .245).

The Kruskal-Wallis H test also showed a significant difference in the fixation duration on the co-viewer area (H = 18.96, p < .001). Specifically, more attention was allocated to the co-viewer area by the students in the CV and CV + I groups than by the LA group. Furthermore, more attention was allocated to the co-viewer area by the CV + I group than by the CV group. These results suggest that the combination of the presence of a co-viewer and interaction with that co-viewer will increase students’ fixation time on the co-viewer area (Fig. 4b). These findings regarding attention partially support Hypothesis 1.

Fig. 4
figure 4

Fixation duration on the video area (a) and the co-viewer area (b) among the three groups

3.2 Learning performance

Two one-way ANOVA tests were conducted on the retention test and transfer test scores. Concerning the retention test, a significant difference was found between the scores of the three groups (F(2, 83) = 3.39, p = .039, η2 = 0.08). Specifically, students who were co-viewing with interaction had better retention performance scores than those who learned alone. No other significant differences were found (see Fig. 5a).

Concerning the transfer test, a significant difference was found in the scores of the three groups (F(2, 83) = 11.29, p < .001, η2 = 0.21). Further least significant difference (LSD) post hoc testing indicated that the students in either of the co-viewing groups, both with and without interaction, performed better than those who learned alone. Furthermore, students in the CV + I group demonstrated better transfer performance than those in the CV group. These results indicate that both co-viewing and co-viewing with interaction improve students’ transfer performance (see Fig. 5b). The findings regarding learning performance (i.e., retention and transfer) largely support Hypothesis 2.

Fig. 5
figure 5

Retention (a) and transfer performance (b) among the three groups

3.3 Learning efficiency

The ANOVA results of the students’ learning efficiency showed a slightly significant difference in learning efficiency among the three groups (F(2, 83) = 2.63, p = .078, η2 = 0.06). Further LSD post hoc test results showed that, compared to LA group, students in the CV + I group performed significantly higher in learning efficiency (see Fig. 6a). The learning efficiency results partially support Hypothesis 3.

3.4 JOL

To assess students’ metacognition, an ANOVA was conducted on the JOL scores. The results indicated a slightly significant difference between the group scores (F(2, 83) = 2.94, p = .058, η2 = 0.07). Specifically, students in the CV + I group reported significantly higher JOL than those in the CV group (MD = 9.55, p = .028; see Fig. 6b). No other significant differences were found. These findings show the positive effect of learner-learner interaction on JOL, which partially supports Hypothesis 4.

Fig. 6
figure 6

Learning efficiency (a) and JOL (b) among the three groups

3.5 Informal interviews

3.5.1 Interviews of learning alone group

When asked about their experience during the interview, 25 out of the 29 students in the LA group expressed that they “recalled the knowledge in the video explanation section”, or that they “thought about the questions quietly without speaking aloud.” Only four of the students reported speaking the answers out loud in addition to thinking about them. Most of the students made a comment along the lines of, “I did not know whether my answer was correct, so I wanted to consider it and think about it in my head first, rather than saying it out loud.”

We also collected the students’ thoughts on whether they would have preferred to have had an online co-viewer during the lesson. A total of 27 students replied that they would have liked to have learned with a co-viewer, giving reasons which included “competition”, “enjoyment”, and “idea-sharing”. One participant commented, “If the co-viewer doesn’t interrupt my learning, then I would prefer to learn with them because that’s more fun than learning alone.” Only two students replied that they preferred not having a co-viewer during the study, both explaining that it was because they feared the co-worker would be an “interruption.”

3.5.2 Interviews of co-viewing groups (CV and CV + I)

We also asked students in the two co-viewing groups about their learning experience. Both groups expressed generally positive attitudes towards the experience. Most students stated satisfaction with co-viewing, commenting that “the learning process was interesting,” and that they “preferred this way of learning.” However, a few students did report a negative perception of the experience, reporting “interference” as the reason.

Figure 7 shows the attitudes of the students in the CV group regarding the influence of co-viewing while learning. Twelve students claimed that the presence of the co-viewer did not influence their learning, four expressed a negative perception of the experience (e.g., feeling distracted), and the remaining 13 students reported that the online co-viewer had a positive influence on their learning, using descriptors such as “confident”, “relaxed”, and “not lonely” to describe the experience. As one student put it, “The presence of a co-viewer gave me a lot of confidence, and he seemed to encourage me to learn together with him through his eye gaze and facial expressions.”

Fig. 7
figure 7

Students’ attitude towards the influence of co-viewing (CV group). Categories are not mutually exclusive, as one participant could have mentioned more than one category

Concerning students’ attitudes towards the influence of co-viewing coupled with interaction on their LVS learning, all students in the CV + I group believed that the learner-learner interaction had a positive influence on their learning. After coding, 44 comments were collected, with students claiming that the learner-learner interaction was “helpful (n = 8)”, and allowed the students to “share ideas (n = 8)”, “perfect answers (n = 17)”, and “understand the knowledge deeply (n = 5).” For instance, one student commented, “I preferred to listen to the co-viewer’s views first, and then express my own opinions so that I could perfect my answers.” Some students also noted that the interactions were “relaxed (n = 2)”, “encouraging (n = 1)”, and “motivating (n = 1).” Interestingly, two students mentioned that the interaction condition did make them feel “nervous”, but that they ultimately still found it “helpful.”

4 Discussion

4.1 Empirical contributions

Although co-viewing and interactions are common in online learning, until now, the effects of co-viewing with interaction in elementary students learning from LVS have been unclear. The current study aimed to test the effects of co-viewing on LVS learning in elementary students, and whether learner-learner interaction moderated students’ attention allocation, learning performance (i.e., retention and transfer), learning efficiency, and metacognition (JOL). Our results found that learner-learner interaction was beneficial in co-viewing, but that CV (without interaction) was no better than learning alone. In other words, our findings confirm the moderating role of the learner-learner interaction on the effect of co-viewing when elementary students are learning from LVS.

As expected, our results indicated that the students in the CV + I group achieved the best learning performance (i.e., retention and transfer), showed the highest level of learning efficiency, and reported the highest metacognition of all three conditions. These results align with those of previous studies, which have shown that learner-learner interaction benefits learning performance and metacognition (Castellaro & Roselli, 2015; Filius, et al., 2018; Sunar, et al., 2017; Tenenbaum, et al., 2020). There are several possible reasons for the benefits of co-viewing with interaction. First, as a social cue, the interaction between students can enhance their sense of social presence, shortening the distance felt between them and thus reducing their sense of loneliness, facilitating their completion of courses and strengthening their knowledge construction (Castellanos-Reyes, 2021). The results of the informal interviews conducted as part of the current study also appeared to be consistent with this explanation. The students in the CV + I group reported more actively positive feelings (e.g., helpful, relaxed, and confident).

Second, existing studies have already shown that learner-learner interaction is conducive to the sharing of ideas, which further promotes knowledge construction and improves learning efficiency (Castellaro & Roselli, 2015; Filius, et al., 2018). When students do not share the same opinions as their co-viewer, cognitive conflict will occur, which provides conditions for cognitive development. Additionally, the JOL results showed that, compared to the CV group, the CV + I group reported higher confidence in recalling, answering, and explaining their new knowledge. This result might be because the learner-learner interaction enabled students to compare their understandings with those of their co-viewer, and allowed them time to consider their ideas. These processes have been proven to promote metacognition, especially in elementary students whose cognitive functions are still developing (Frith, 2012; Rice, et al., 2016). In line with Piaget’s constructivist learning theory (1985), learner-learner interaction allows students to effectively monitor their learning process, detect problems in their cognition, and correct these issues as soon as possible. As expected, our results regarding learning efficiency also indicated that the students in the CV + I condition showed higher learning efficiency than those in the LA condition. This finding that when students invest a similar amount of mental effort into a task, learner-learner interaction can help them achieve a higher learning performance. As the interview results for the current study demonstrated, the students claimed that the interaction did help them perfect their answers and understand the lesson more deeply.

Importantly, the eye movement tracking results showed that when interaction was allowed, the students allocated more attention to their co-viewer and less time to the LVS itself. However, this distraction of attention did not reduce the students’ learning outcomes. Instead, co-viewing with interaction led to the best learning outcomes (i.e., retention, transfer, and learning efficiency) of all three conditions. Sustained attention in elementary students is limited, so a co-viewer can easily attract their attention, which may have caused attention conflict between the ongoing task and the presence of their co-viewer in the current study (Baron, et al., 1978; Paas & Van Merriënboer, 1994). However, the empirical results suggest that when learner-learner interaction was allowed, the attention split was meaningful. The student was likely attracted to their co-viewer because they provided the student with additional information, for example with eye contact or facial expressions, or through discourse generated through the interactive process, all of which promoted the student’s knowledge construction and metacognition (Castellaro & Roselli, 2015; Filius, et al., 2018).

Interestingly, the positive effects of co-viewing were not found in the absence of interaction, which was inconsistent with our expectations. Compared to the students learning alone, those co-viewing without interaction did not show significant differences in their fixation duration during the LVS, or in their learning performance (i.e., retention), learning efficiency, or JOL. However, the presence of a co-viewer did increase the students’ fixation duration on the co-viewer area. The data gathered through the informal interviews may provide some explanation for this, in that half of the students in the CV group did not express a positive perception of the co-viewing experience. These findings are in alignment with those of Skouteris and Kelly (2006), who also did not find any significant effects of co-viewing in children completing a vocabulary comprehension task. It could be that, as the sustained attention of elementary students is still weak, their attention is easily attracted by a co-viewer, drawing their attention away from the content they are meant to be learning (Baron, et al., 1978; Rice, et al., 2016; Chen & Wang, 2017). As this attraction of their attention did not provide the students with any additional information about the learning content or knowledge construction, however, the positive effect of co-viewing may thus have been counteracted. The test results were also in line with the interview results, with the students with a negative attitude toward CV commenting that the co-viewer’s presence may have distracted them from their LVS learning.

4.2 Theoretical contributions

This study extends the applicability of constructivist learning theory to synchronous online video learning. According to constructivist learning theory, the four learning environment elements are situation, collaboration, conversation, and meaning construction (Piaget, 1985). Therefore, using constructivist learning theory as a starting point, we designed a learning activity in which students could interact with a co-viewer. Our results indicated that learners constructed meaning through interaction with their co-viewer, resulting in improved learning outcomes. This finding verifies the guiding significance of constructivist learning theory in the context of the design of synchronous online learning activities.

Second, the current study is one of the first attempts to thoroughly examine the moderating role of learner-learner interaction during co-viewing of LVS, and shows support for the boundary condition in the drive theory of social facilitation. Previous studies have not shown consistent conclusions regarding the effects of co-viewing, with some studies finding positive influences (Lytle, et al., 2018; Schneider & Pea, 2013; Zajonc, 1965) and others finding negative or zero effects (e.g., Baron, et al., 1978; Paas and Van Merriënboer, 1994; Skouteris and Kelly, 2006; Tricoche, et al., 2020). The results of the current study indicate that there may be no obvious positive effect of the mere presence of a co-viewer while a student learns from a video lecture, which extends the existing literature and provides support for the boundary condition in the drive theory of social facilitation. The social facilitation effect suggests that, in the process of learning, the mere presence of a co-viewer can be enough to enhance a student’s motivation and level of psychological arousal, regardless of whether the co-viewing occurs with or without learner-learner interaction (Karabenick, 1996; Zajonc, 1965). However, the present study showed that merely co-viewing was not enough to facilitate learning from LVS in elementary students. Their learning with a co-viewer is instead moderated by learner-learner interaction.

The study was conducted in an elementary school in central China, and we should be cautious to generalize the findings to other cities or countries. These results may be influenced by students’ cultural backgrounds. For example, cross-culture studies have shown that the Chinese are relation-oriented, while Westerners are task-oriented (Markus & Kitayama, 1991; Pi & Hong, 2016). Thus, Chinese students may be particularly sensitive to interaction and prefer to interact with others. In contrast, western students tend to pay more attention to learning tasks rather than interaction. Therefore, interaction may play different roles among Chinese students and Western students.

Finally, the current study took numerous approaches in gathering data, using eye tracking technology as well as questionnaire surveys, knowledge tests, and informal interviews to evaluate both the learning process and outcomes of the students. Previous studies have not focused on the effects of co-viewing and interaction together on elementary students’ attention allocation, learning outcomes, and metacognition. In fact, due to their lack of development of self-control and sustained attention, elementary students generally have more difficulty focusing on learning materials, and are more easily attracted by things other than the learning content (e.g., co-viewers; Rice, et al., 2016). The present study provides evidence for the beneficial effects of co-viewing and interaction for elementary students when learning from LVS, with the notable findings that co-viewing alone, without interaction, was not shown to promote learning from LVS in elementary students. Instead, it appeared to lead to distraction, which in turn impeded the students’ learning outcomes. However, learner-learner interaction moderated the co-viewing effect, improving the students’ learning performance, learning efficiency, and metacognition.

4.3 Limitations and future directions

Despite its empirical and theoretical contributions, several limitations of the current study must be acknowledged. First, we did not consider different individual characteristics of the students (e.g., personality, emotional state). Studies have found that personality plays a role in the effect of others’ presence, specifically, extroverted students tend to have a positive attitude towards others’ presence, whereas neurotic students feel the opposite (Ku, et al., 2020; Uziel, 2007). Furthermore, evidence has shown that positive emotional states appear to benefit learning experiences and performance, while negative emotions may hinder them (Miller, et al., 2018). Therefore, future research should examine the moderating role of individual characteristics on co-viewing in elementary students.

Second, to minimize the disruption of their normal school schedules, the students had only a few minutes of rest between each different stage of the study. As a result, the students may have felt fatigued, which could have affected their learning abilities. Previous studies have shown that fatigue is detrimental to learning performance in various tasks (Carron & Ferchuk, 1971; You, et al., 2019). Furthermore, the students were interviewed immediately after they had completed the post-tests, and they may not have had enough time to reflect deeply on their learning experience. Therefore, future research should incorporate increased break times to avoid the students experiencing fatigue, and to allow them to better elaborate on their learning experience.

Third, there is a limitation in the measurement of learning outcomes in the current study. Following previous studies (e.g., Pi and Hong, 2016; Pi, et al., 2022), we only measured learning performance as indicated by retention (i.e., the memorization of concepts) and transfer (i.e., the application of knowledge in new situations). However, as the LVS in the current study taught scientific knowledge, students could have already been familiar with the topic, but may have misunderstood some of the concepts. Therefore, the learning goal should not be limited to just memory and application of factual knowledge. Furthermore, complex questions during interactions should be considered as a scaffolding measure to promote deeper learning (e.g., conceptual change, skills, and values). Moreover, according to Zajonc’s social facilitation theory (1965), when individuals complete simple or skilled tasks, the presence of another can improve one’s task performance. When they complete complex or challenging tasks, the company of another will reduce one’s performance. Thus, future research should investigate whether the difficulty and complexity of scaffolding influences the effects of co-viewing with or without interaction.

Finally, learning is a complicated process, but this study focused mainly on the effects of a co-viewer and interaction on attention allocation, learning performance (i.e., retention and transfer), learning efficiency, and metacognition. Future research should pay more attention to other relevant factors. For example, researchers may investigate learners’ interaction behaviors and discourse using network analysis methods and explore the connections between various variables in depth to understand better the complex learning process (Castellanos-Reyes, 2021).

4.4 Practical implications

The rise of online learning has led to LVS becoming increasingly popular. Synchronous online learning can use face-to-face learning, allowing learners to interact with each other in real-time, while also making up for shortcomings of traditional classrooms limited by time and space. Co-viewing LVS via video-conferencing software (e.g., Tencent conference) has become a common method for synchronous online learning. Therefore, it is important to understand how best to design teaching activities and learning contexts that will effectively maintain elementary students’ attention and promote their learning outcomes. The current study tested the effects of co-viewer presence and learner-learner interaction in elementary students’ learning from LVS, and showed that learner-learner interaction benefited co-viewing LVS.

The findings of the current study provide practical implications for the social context of learning from LVS. First, we found that learner-learner interaction in co-viewing LVS facilitated elementary students’ learning performance (i.e., retention and transfer), learning efficiency, and metacognition. Therefore, learner-learner interaction should be encouraged in co-viewing LVS. Second, we observed that students merely co-viewing LVS, without learner-learner interaction, did not fixate longer on the instructional video, but instead on their co-viewer than those viewing LVS alone. Therefore, when there is no learner-learner interaction activity, the co-viewer window should be turned off to avoid the split-attention effect of co-viewing.