Introduction

Prior research has highlighted the benefits of YouTube for learning (e.g., see Jackman , 2019; Rosenthal , 2018) including increased engagement, better comprehension, and the flexibility to control the learning experience (Jebe et al., 2019; Kay, 2012; Stockwell et al., 2015). Consequently, educators and teachers have recognized the educational value of YouTube, employing it for instructional purposes (Jung & Lee, 2015; Manca & Ranieri, 2016). One of such purposes lies in the integration of YouTube explanatory videos into formal learning environments.

In the last years, there has been extensive science education research on explaining the quality of YouTube explanatory videos: Kulgemeyer (2020) developed a comprehensive framework for effective explanatory videos, based on guidelines published earlier in the literature (e.g., see Brame , 2016; Findeisen et al. , 2019). Further studies have investigated the relationship between surface features, such as likes and views, and the explaining quality (i.e., the instructional quality) of YouTube explanatory videos (Kulgemeyer & Peters, 2016; Bitzenbauer et al., 2023): These studies have revealed that the surface features provided by YouTube may not serve as reliable indicators of the explaining quality of a specific video, while a statistically significant correlation was found between the number of content-related comments and the explaining quality. Based on the above findings, Bitzenbauer et al. (2023) emphasize that it is crucial to support teachers “in selecting videos with high explanation quality from the plethora of (online) resources” (p. 2) through evidence-based selection criteria.

However, to date, there is a dearth of studies investigating the video selection practices of physics pre-service teachers on YouTube, particularly concerning their decision-making factors. It remains unclear whether teachers rely on YouTube’s provided metrics, such as likes, views, or the age of the video, or if they consider the comments section influential. This article addresses this research gap by presenting the findings of a mixed-methods study that explores the decision-making processes of (pre-service) physics teachers when selecting instructional videos on YouTube to be included in learning environments (see “Methods”). The study employs a combination of eye-tracking, think-aloud interviews, and a retrospective questionnaire survey to gain comprehensive insights into the thought processes and strategies employed by pre-service physics teachers during the video selection process.

Research Background

Explanatory videos, also referred to as instructional videos, play a vital role in science education research, for example, serving as concise introductions and explanations of specific topics of interest (Wolf & Kratzer, 2015). Explanatory videos typically do not exceed 10 min in length and have garnered increased attention in both formal and informal learning environments, especially through platforms such as YouTube (e.g., see Beautemps and Bresges , 2021; Pattier , 2021).

Quality Criteria of Instructional YouTube Videos

Recent scholarly investigations have focused on understanding the factors contributing to the success and popularity of explanatory YouTube videos, particularly in the field of science (Beautemps & Bresges, 2021; Welbourne & Grant, 2016). Notably, the video structure has emerged as a crucial determinant in this regard (Beautemps & Bresges, 2021).

However, the primary objective of explanatory videos is to support student learning, making the quality of explanations of utmost importance (Kulgemeyer & Wittwer, 2023; Pekdag & Le Marechal, 2010). Researchers have explored various frameworks and guidelines to enhance the effectiveness of explanatory videos. For example, Kulgemeyer (2020) proposed a comprehensive framework for creating effective explanatory videos that aligns with guidelines established earlier by Brame (2016) and Findeisen et al. (2019). Furthermore, this framework incorporates insights from multimedia learning research and draws upon studies related to instructional explanations conducted by Geelan (2012) and Wittwer and Renkl (2008). The framework encompasses seven factors comprising 14 features that collectively influence the effectiveness of explanatory videos. These factors include video structure, language-level adaptation, minimal digressions, consideration of prior knowledge, misconceptions, and student interest (Kulgemeyer, 2020).

Kulgemeyer and Wittwer (2023) empirically tested the effectiveness of the framework by comparing student achievement when exposed to videos developed in accordance with the framework against those that did not strictly adhere to the guidelines. The results demonstrated that students exposed to videos closely aligned with the framework exhibited significantly higher levels of declarative knowledge in post-tests (\(d = 0.42\)), although no statistically significant difference was observed in post-test scores related to conceptual knowledge.

The correlation between video metrics provided by YouTube, such as the number of views or likes, and the videos’ explaining quality has yielded mixed results: Kulgemeyer and Peters (2016) conducted an exploratory study focusing on instructional YouTube videos on mechanics topics and found that the number of content-related comments posted by users below a specific video was the only variable that correlated significantly with the explaining quality. Conversely, the number of views, likes, and dislikes did not exhibit significant correlations. Similar findings have been brought forth by Kocyigit and Akaltun (2019) who evaluated 53 online videos using the the Global Quality Scale and found that the YouTube metrics did not significantly differ across quality groups. Bitzenbauer et al. (2023) conducted an additional exploratory study that specifically examined explanatory YouTube videos on quantum topics, namely quantum entanglement and quantum tunneling. In contrast to earlier findings, the authors observed a small but significant correlation between the number of likes and the quality of explanations in their sample of quantum topic videos (\(r = 0.37\), \(p < 0.01\)).

Selection Processes of YouTube Explanatory Videos

The increasing abundance of low-quality educational content on YouTube has become a matter of concern for researchers (e.g., see Bohlin et al. , 2017; Neumann and Herodotou , 2020; Tan , 2013) highlighting the crucial role that teachers play in selecting explanatory videos of high quality (Chtouki et al. , 2012; Jones and Cuthrell , 2011). This issue is further exacerbated by the reliance on popularity-based rankings in search systems: For instance, Chelaru et al. (2012) observed that the top ten videos in the YouTube search results received a disproportionately higher number of views, likes, and comments. Additionally, the study by Chavira et al. (2021) revealed that out of the ten most-viewed videos analyzed in their study, only four were deemed satisfactory in terms of quality.

Despite existing research on YouTube video selection, to the best of our knowledge, no studies have specifically examined the process by which teachers select videos from the list provided by YouTube based on search queries. However, several studies have shed light on user behavior, indicating that individuals often sequentially view the returned videos until they find one that aligns with their needs (e.g., see Fyfield et al. , 2021; Tan and Pearce , 2011). The abundance of choices available on YouTube may contribute to choice overload, making it challenging to identify high-quality content (Toffler, 1984). Choice overload describes the phenomenon of increased difficulty in decision-making when faced with a large number of choices (Schwartz, 2016), potentially resulting in decreased motivation to engage with individual options (Iyengar & Lepper, 2000).

Against this backdrop, it becomes even more apparent that it is essential to support teachers in the process of selecting explanatory videos for classroom practice. Two main measures have been at the center of the debate so far:

  1. 1.

    Ranked lists of educational channels have been published to “help Internet users to narrow down their search space by recommending channels” (Tadbier and Shoufan , 2021, p. 3079). However, “there is no reason to assume that the extensive offer of ranked lists would not lead to choice overload” (Tadbier and Shoufan,2021, p. 3079).

  2. 2.

    To tackle these challenges, scholars have suggested the utilization of decision-assistance tools like meta-search engines, which employ aggregation techniques (Dwork et al., 2001; Haveliwala, 2002; Meng et al., 2002) as described by Tadbier and Shoufan (2021).

Research Rationale

While we agree that pre-made lists or similar resources might not optimally support (science) teachers in the process of selecting YouTube explanatory videos for classroom practice, we believe that the existing empirical evidence on the explaining quality of YouTube explanatory videos might indeed be useful to facilitate the systematization of teachers’ decision-making process. As sketched in “Quality Criteria Of Instructional YouTube Videos,” several studies have brought forth hints for instructional quality of online explanatory videos and might, hence, provide evidence in this regard:

  • Bitzenbauer et al. (2023) found a statistically significant correlation (\(r = 0.46\), \(p < 0.001\)) between explaining quality and the number of content-related comments in YouTube videos on quantum topics. Similarly, Kulgemeyer and Peters (2016) reported a significant correlation (\(r = 0.38\), \(p < 0.01\)) between explaining quality and the number of relevant comments for videos on Newton’s third law and Kepler’s laws, respectively.

  • Bitzenbauer et al. (2023) discovered a significant correlation (\(r = 0.37\), \(p < 0.01\)) between the number of likes and video explaining quality.

  • It is important to exercise caution when considering additional metrics provided by YouTube, such as the number of views, as previous research has not found stable correlations with explaining quality.

Of course, reviewing comments under each video in search of high-quality explanatory videos is practically unfeasible and time-consuming. Moreover, based on the available evidence, it is challenging to determine a quantitative threshold indicating an adequate number of content-related comments or likes. Nonetheless, considering the current state of research, it appears feasible to explore ideal (i.e., efficient) decision-making processes by systematically analyzing the order in which the different criteria can be employed by teachers: As a starting point for future research aimed at supporting teachers in their decision-making processes when selecting YouTube explanatory videos for science teaching, we propose the decision tree presented in Fig. 1 as a representation summarizing the hints of instructional quality of explanatory videos according to the current state of research described above. It is noteworthy that the proposed order is not to be considered strict.

This decision tree suggests teachers to first ask quick initial questions to themselves during their search process such as whether a given video has an appropriate duration for classroom use or whether it has already received user likes. If both of these surface-level criteria are positive, it recommends teachers to explore the comments section. The presence of not only superficial but also content- or video-related comments indicates cognitive stimulation of viewers, as “videos that accumulate plenty of those relevant comments are more successful in catching viewers’ attention as these videos might use either a more stimulating explanation or the explanation delivered is considered as a starting point for further learning progress” (Kulgemeyer & Peters, 2016, p. 12). Moreover, if there are even interactions among users, including responses to content-related comments, this may provide additional evidence of a high-quality video. Finally, teachers are then encouraged to assess the instructional quality of the video by viewing it themselves.

Fig. 1
figure 1

Decision tree to support (pre-service) teachers selection process when searching explanatory videos suitable for physics teaching as hypothesized based on the current state of the literature

The suggested decision tree serves as a hypothesis for future studies examining teachers’ selection processes of instructional videos for science teaching as stated above. The main objectives of this study are twofold: First, we aim to explore how pre-service teachers utilize YouTube metrics when selecting instructional videos for physics teaching. Secondly, we intend to compare their selection and decision-making processes with the procedure recommended by the existing literature, as represented through the decision tree in this article.

Hence, with our research, we aim to approach the following research question: How—if at all—do pre-service teachers use the features and comments sections provided by YouTube when selecting YouTube explanatory videos for teaching purposes?

We decided to address this question in the context of quantum physics YouTube videos because (a) as mentioned above, there have been related studies published previously we can build upon with our findings and (b) quantum physics deals with difficult-to-grasp topics and different visualizations or explanations are common to describe the same phenomena due to their abstract nature. Thus, a highly varying degree of explaining quality is to be expected when exploring explanatory videos for sub-topics of quantum physics like quantum tunneling or quantum entanglement.

Methods

Study Design

We investigated our research question by conducting a mixed-methods study comprised of eye-tracking, think-aloud interviews, and a concluding questionnaire. The selection of precisely these methods as well as their interrelation will be explained more thoroughly in the following subsections “Eye-Tracking”–“Questionnaire.” The mixed-methods study consisted of three phases (P1 to P3; for an overview, see Fig. 2):

P1::

In the first phase, the participants were presented with a pre-constructed image chart showing eight different YouTube video suggestions for a specific topic via the original surface features provided by YouTube (e.g., thumbnail, length, title, views, upload date, channel name, number of subscribers). As additional information, we added the corresponding number of likes the videos have already received (cf. Fig. 4). The participants were then asked to select one of the offered explanatory videos that they deemed suitable for use with learners without prior knowledge in physics teaching on this topic. The videos displayed in the chart, however, could not be opened or watched, and instead, the selection had to be based solely on the provided features. In addition to tracking the eye movements during the selection process (cf. “Eye-Tracking”), the participants were prompted to voice their thoughts at all times in the sense of a think-aloud interview (cf. “Think Aloud”).

P2::

In the second phase, the image chart was removed, and the participants were now allowed to freely explore a previously opened YouTube search tab concerned with a second specific quantum topic. Again, the task was to select one of the videos suggested by YouTube for a hypothetical physics classroom lesson covering the specific topic with learners without prior knowledge. In contrast to the first phase, the participants’ eye movement was no longer tracked, but they were now also allowed to open and watch the videos as well as scroll through the comment section. This way, the selection process could place a more pronounced focus on content-related reasons to substantiate the rather superficially guided process in phase 1. Similar to phase 1, all thought processes had to be conveyed verbally at all times.

P3::

After the initial combination of eye-tracking and think aloud, the study concluded with a questionnaire given to students in retrospective which asked the study participants to reflect on the importance of the different surface features provided by YouTube in their selection processes (cf. “Questionnaire”).

To ensure that the results of our study are not directly dependent on (and hence, restricted to) a specific (quantum) topic covered throughout the phases, we randomly assigned each study participant to one of three different groups A, B, or C prior to starting the data collection. Each participant took part in the study individually and was then—depending on the group assignment—given the task of selecting explanatory YouTube videos on two different quantum topics (namely, either quantum tunneling, quantum entanglement, or quantum computing) in study phases 1 and 2 as described above. Kruskal-Wallis tests comparing each of the different eye-tracking metrics investigated in this study (cf. “Data Analysis”) across the three study groups revealed no statistically significant differences among the groups (for all details on the test statistics, see the supplementary file). This indicates that our results are not directly linked to a specific topic, but it is sensible to analyze the data collected in this study across the groups—we took advantage of this observation in our study, as we analyzed the data from all 24 participants simultaneously, as if they had been collected under exactly the same conditions. An overview of our study design is presented in Fig. 2.

Fig. 2
figure 2

Study design visualized using a flowchart. The different topics covered in the first two phases are color-coded and, as indicated, were switched among the groups A to C between both phases

Data Collection

Eye-Tracking

Eye-tracking data was collected in study phase 1 using a stationary head-free eye-tracking system from Tobii (Tobii Pro Fusion) alongside their respective software (Tobii Pro Lab). The eye tracker operates at a sampling frequency of 60 Hz and a nominal spatial accuracy of \(<0.3^\circ \) visual angle. The stimuli were presented on a 24-inch computer screen (1920 \(\times \) 1080 pixel resolution and 60 Hz frame rate). Prior to the study, a nine-point calibration process was utilized to ensure accurate eye-tracking, and the participants were introduced to the basics of eye-tracking. The instructor verified the agreement between the measured gaze positions and the actual points on the screen. If the calibration results were not deemed satisfactory, the calibration was repeated. In instances where the eye tracker failed to detect sufficient calibration data, participants were repositioned in front of the eye tracker. Additionally, potential factors that could have interfered with pupil detection were examined, e.g., lighting conditions, occlusion, or calibration drift. On average, the distance between each participant and the tracker was 60 cm.

Think Aloud

Since previous research has identified the need to complement eye-tracking data with additional verbal data, we amended the eye-tracking results by incorporating think-aloud interviews into our study design (Chien et al., 2015; Smith et al., 2010). During think-aloud interviews, “participants think out aloud while performing a given task, or recall thoughts immediately following completion of that task” (Eccles and Arsal , 2017, p. 514). The participants’ verbalizations were recorded as well as transcribed and subsequently subjected to further analysis (cf. “Data Analysis”). The goal of this method lies in uncovering cognitive processes that are not as accessible with the other methods used (Rios et al., 2019). Thus, even though it might interfere with the study objective due to the verbalizations resulting in an overall higher cognitive demand, think-aloud studies are often used as an introspective annex (McKay, 2009; Sasaki, 2013). We leveraged this method by asking the participants to articulate their thought processes at any point in time in both, phases 1 (image chart) and 2 (free exploration). To use thinking aloud effectively, the participants were provided with an instruction on thinking aloud, which was formulated following Mackensen-Friedrichs (2004) to ensure a standardized procedure (Sandmann, 2014). Some of the cues given to the participants were (1) speak your thoughts aloud; (2) there should be no pauses in speaking, so verbalize your thoughts without pauses; (3) do not organize your thoughts before speaking, imagine you are alone in the room; and (4) thinking aloud may be a bit unfamiliar. Therefore, you will be supported and repeatedly prompted to express your thoughts.

The additional verbal data obtained from those interviews provides further insights into the cognitive processes, motivations, and decision-making that underlie the observed eye movements, offering a more complete picture of participants’ experiences and interpretations (cf. Brückner et al. , 2020). Furthermore, eye-tracking data alone can identify moments of attention shifts or fixations on specific elements, but it may not explain the reasons for these shifts. Since our study addresses selection processes based on visual stimuli, supplemental verbal data can clarify whether a shift in gaze was triggered by interest, confusion, or any other factors, shedding further light on the nature of the participants’ attentional patterns.

Fig. 3
figure 3

The AOIs were defined covering all the surface features given for the eight video options shown to the participants. The eye-tracking metrics regarding the related AOIs were summarized using aggregated tags

Questionnaire

To further enhance cross-validity, we concluded our study with a final questionnaire in phase 3. Here, participants were presented with a list containing all (surface) features provided by YouTube and were asked to rate whether they considered the different features important to their decision-making processes on a four-point rating scale (strongly disagree, disagree, agree, strongly agree). The addressed features were number of views, likes, comments, and subscriptions as well as thumbnail, channel, video title, video length, video description, upload date, order determined by YouTube’s search algorithm, and specific comments. On the one hand, these ratings allow to establish a ranking among all surface features in terms of their importance. On the other hand, the retrospective view obtained from the concluding questionnaire contrasts the introspective view from phases 1 and 2, enabling a triangulation with the findings from both the eye-tracking and the think-aloud interviews.

Sample

A total of \(N=24\) German pre-service physics teachers (9 female, 15 male) participated in this research. The participants were selected such that they (a) were at least in their second academic year and (b) did not rely on strong glasses or contact lenses (diopter \(>1\)). Participation in our study was voluntary, not financially recompensed, and informed consent was obtained from all participants.

Data Analysis

The eye-tracking data were evaluated in terms of three metrics that reflect attention allocation and cognitive demand: First, we analyzed the total fixation duration, which can be described as “the total duration of all fixations on a specific stimulus” (van der Laan et al. , 2015, p. 1). High values of this metric indicate a more pronounced focus on certain areas (Hahn & Klein, 2022); thus, it is the commonly reported measure of visual attention (Shruti Goyal & Miyapuram, 2015). Second, we investigated the metric fixation counts that often accompanies the total fixation duration as a measure of attention allocation (cf. Just and Carpenter , 1976; Wang et al. , 2014). Lastly, we analyzed the mean fixation duration, which is often interpreted as a measure of cognitive processing demand (Negi & Mitra, 2020). Consequently, higher values of mean fixation duration indicate a “higher cognitive effort to process information” (Hahn and Klein , 2022, p. 10). The areas of interest (AOIs) required for quantitative analysis were matched with the surface features provided by YouTube (cf. “Eye-Tracking”) as is indicated in Fig. 3. However, for the data analysis, the individual AOIs for each of the proposed videos shown in the image chart in phase 1 were not considered: Instead, so-called aggregated tags were created that combine the eye-tracking metrics for several related AOIs (e.g., all Like-AOIs). For example, the aggregated tag “Likes” was created, which summarizes all Like-AOIs for the individual video options, so that finally, for example, statements can be made about how often and how long the number of likes was viewed (across the different video options given).

For the subsequent think-aloud interviews, we conducted a qualitative content analysis. To this end, we (a) associated the participants’ verbal expressions with the corresponding surface features provided by YouTube and (b) categorized their decisions for or against each video. The categories for these decisions were developed based on both inductive and deductive procedures (Forman & Damschroder, 2001). An overview of all categories and their descriptions is provided in the appendix (cf. Table 5). The category system was applied by two independent raters, and dissenting judgements were resolved through discussion. The interview data were analyzed threefold: First, we calculated the relative speaking time allocated to each (surface) feature and visually displayed the resulting share of each feature in a bar chart. Second, we visualized the temporal trajectory of each interview through the various categories and plotted them alongside a common axis, resulting in a temporal topography graphic for each of the two phases. Lastly, we counted the most frequently used arguments among the participants’ reasonings and how often they lead to a decision for or against a video. This insight was used to provide an overview of the (surface) features provided by YouTube for each video that are most influential during the decision-making process of pre-service physics teachers.

The responses of the concluding questionnaire were summarized using a diverging stacked bar chart constructed from the participants’ responses by aligning the bars from a stacked bar chart relative to the scale’s center (Robbins & Heilberger, 2011). In addition, each of the response options was color-coded and equated with a number (strongly disagree \(\widehat{=} \ -2\), disagree \(\widehat{=} \ -1\), agree \(\widehat{=} \ 1\), agree completely \(\widehat{=} \ 2\)) so that a mean agreement value for each surface feature could be calculated, resulting in a ranking among all surface features provided by YouTube (cf. Veith et al. , 2022). It is important to note that rating scale data is ordinal in nature and as such the average results are merely being used as a quick means to represent the true data.

Results

In the following, we present the results of our study, separated by methodology. First, we provide an overview of the assessed eye-tracking metrics (cf. “Eye-Tracking Results”), and second, we enrich those findings with the results of the think-aloud data (cf. “Think-Aloud Inter-view Results”) as well as the subsequent questionnaire study (cf. “Questionnaire Results”).

Eye-Tracking Results

The eye-tracking data were collected in study phase 1 where participants were presented with a carefully constructed chart containing eight search results from YouTube (for details, see “Methods”). These options were specifically chosen to exhibit a range of surface features. The participants’ task was to determine which of the eight explanatory videos would be suitable for inclusion in a learning environment related to the topic being investigated. Figure 4 presents an illustrative heat map generated from the eye-tracking data obtained from one of the participants in the study. The heat map provides visual information regarding the areas that received the highest attention during the task.

Fig. 4
figure 4

Exemplary heat map created from the eye-tracking data from one of the study participants in phase 1 of our study. Qualitatively, participants’ spots of attention are represented through red color

Table 1 provides the descriptive statistics for the metric total fixation duration for each area of interest. With a mean percentage of 35.39% of the total fixation duration, the thumbnail was by far the most compelling AOI and the only one with a share above 10%. While the AOIs’ title (9.58%) and channel (6.17%) were also able to captivate the participants’ attention to some extent, the remaining AOIs played a seemingly negligible role during the selection process. This discrepancy is visualized via boxplots in Fig. 5.

Table 1 Descriptive statistics for the total fixation duration (in percent) for each area of interest
Fig. 5
figure 5

Boxplots for the total fixation duration for each area of interest. The whiskers indicate \(1.5\times \text {IQR}\), where IQR is the interquartile range

Descriptive statistics for the metric mean fixation duration are provided in Table 2. The data in this regard paint a different picture by taking similar values for each AOI. The mean fixation durations average between 176 (subs) and 240 ms (date) and, thus, lie in the typical range of 100–600 ms reported by Hahn and Klein (2022). The boxplots illustrating these data substantiate this relationship—the middle 75% of data can be located between 200 and 300 ms across almost all AOIs (cf. Fig. 6).

Table 2 Descriptive statistics for the mean fixation duration (in seconds) for each area of interest
Fig. 6
figure 6

Boxplots for the mean fixation duration for each area of interest. The whiskers indicate \(1.5\times \text {IQR}\), where IQR is the interquartile range

Lastly, we investigated the metric fixation counts for each AOI. With there being striking differences in the total fixation durations of each AOI but similar values for mean fixation duration, it becomes apparent that the number of fixations must vary in a manner similar to the total fixation duration. The data provided in Table 3 paint a coherent picture in this regard. Again, with a mean percentage of 45.59% of the total number of fixations, there is a predominant focus on the AOI thumbnail, with title (11.60%) and channel (7.12%) being second and third, respectively. Consequently, the boxplots for the metrics total fixation duration and fixation count are somewhat congruent (cf. Fig. 7).

Table 3 Descriptive statistics for the fixation count (in %) for each area of interest

Think-Aloud Interview Results

To analyze the selection process more thoroughly, the results from the eye-tracking study are now complemented with data from a think-aloud study, as described in “Methods.” Figure 8 offers an initial comprehensive overview of the content aspects addressed in the argumentation provided by the study participants: It shows the relative proportion (of total speaking time) of each surface feature in participants’ utterances in both study phases. In phase 1, the most pronounced focus was placed on the thumbnail with 30.9%, followed by channel and title with 23.0% and 14.2%, respectively. Regarding the free exploration in phase 2, however, the data convey a more differentiated impression: Here, the surface feature channel emerges on the top with 22.0%, with the thumbnail (20.0%) and length (17.4%) taking second and third places. In addition, the surface features likes, subs, date, order, description, and comments are almost negligible with an allocated speaking time of below 5% throughout both phases.

Fig. 7
figure 7

Boxplots for the fixation counts for each area of interest. The whiskers indicate \(1.5\times \text {IQR}\), where IQR is the interquartile range

Fig. 8
figure 8

Bar chart visualizing the relative proportion (of total speaking time) of each surface feature in participants’ utterances in both phases. Phase 1 indicates the allocated speaking time regarding the pre-constructed image chart, while phase 2 indicates the allocated speaking time during free exploration (cf. “Methods”)

A more comprehensive insight into the structure of the participants’ selection process is offered by the bar charts in Figs. 9 and 10.

Fig. 9
figure 9

Topography of the think-aloud interviews in the first phase, one for each study participant P1 to P24. The red strokes indicate a decision for or against a video. The upper row indicates the color coding for each surface feature. White sections represent small breaks where the participants did not address a specific surface feature

Fig. 10
figure 10

Topography of the think-aloud interviews in the second phase, one for each study participant P1 to P24. The red strokes indicate a decision for or against a video. The upper row indicates the color coding for each surface feature. Hatched sections represent parts of the interview where the participants watched a video

Here, each individual interview is presented as a bar, and the sections dedicated to the different surface features are color-coded respectively. Hence, these visualizations allow for a deeper insight into the temporal topography of each interview. Analyzing this topography, it becomes apparent that blue (thumbnail) and violet (channel) cover the most area during the first phase, in accordance with the findings presented in Fig. 8. In the second phase, where participants were allowed to click on and even watch videos, this dynamic changes: On the one hand, the violet sections increase, indicating a greater focus on the surface feature channel. On the other hand, the participants had access to more surface features such as comments or the video description. Decisions for or against a video are indicated by red strokes. For example, a red stroke after a blue section displays a decision against a video because of the thumbnail. A summary of all decisions and the surface features they were based on is presented in Table 4.

With a total of 26 and 23 decisions, the surface features thumbnail and video length are by far the most influential ones. In congruence to the bar chart presented in Fig. 8, the channel, the video title, and the number of views can also be regarded as guides for the selection process, while more specific features such as likes, subs, or comments seem almost entirely irrelevant for decision-making. Lastly, it is noticeable that most decisions could not be attributed to specific surface features. In particular, 30 out of the 51 positive decisions do not seem to be related to surface features provided by YouTube. We will elaborate on this finding in more detail in “Discussion.”

To obtain a more in-depth view on the decisions used by participants that could be related to a surface feature, their arguments during the interviews were categorized (cf. “Methods”). Since the thumbnail feature leads to the most decisions in both phases, we present a bar chart for the most frequently used arguments in Fig. 11.

Table 4 Overview of the decisions for or against a video based on the respective (surface) features, sorted by study phases
Fig. 11
figure 11

Bar charts visualizing the relative share of each category (T1 to T5) addressing the thumbnail that was used as an argument for or against a video. The respective category system is provided in the appendix (cf. Table 5)

Fig. 12
figure 12

Top five of the most frequently used arguments for or against a video during the think-aloud interviews. The respective category system is provided in the appendix (cf. Table 5)

Fig. 13
figure 13

Diverging stacked bar chart visualizing the participants’ agreement that the respective surface feature is important for decision-making. The respective average ratings for each (surface) feature (cf. “Methods”) are provided in labels on each bar where 2 corresponds to “agree completely” and \(-2\) to “strongly disagree.” The abbreviation “nr. Comments” stands for the number of comments under a video

With the most frequently used arguments being T3 (the thumbnail indicates an interesting video) and T4 (the thumbnail indicates a boring or nonprofessional video), it becomes apparent that arguments addressing affective aspects dominate over content-related reasonings. An overview of the overall top five most frequently used arguments is provided in Fig. 12. We discuss these findings in “Discussion.”

Questionnaire Results

In the final part of our study we, in retrospection, asked the participants to evaluate the extent to which they agree with the respective features having been relevant for their selection process. The responses are visualized using a diverging stacked bar chart in Fig. 13. In congruence to our previous findings, the thumbnail takes a sole first place with a rating of 1.80 (where 2 corresponds to “agree completely” and \(-2\) to “strongly disagree”). With 21 out of 24 participants agreeing completely with the thumbnail being important for their selection process, the thumbnail even exceeds the video length in terms of relevance for decision-making in the participants’ retrospective views. In contrast, the number of comments (average rating of \(-1.88\)) and comments themselves (average rating of \(-1.52\)) solidify last places and do not seem to contribute meaningfully to the decision-making process.

Discussion

The Process of Selecting Instructional Videos for Physics Teaching

In phase 1 of our study, pre-service physics teachers were given the task of selecting one explanatory video on quantum physics from a set of eight options. Participants were provided with excerpts from the YouTube search results and had access to various metrics such as views, likes, and channel information. The analysis of participants’ eye movements revealed a significant emphasis on video thumbnails, with more than one-third of their total fixation time and counts directed towards this area of interest (AOI). Surprisingly, there were no statistically significant differences in mean fixation duration between different AOIs, despite this measure “is often considered an indicator of cognitive processing demand” (Hahn and Klein , 2022, p. 10). This finding, hence, contrasts with the results of Hsieh and Chen (2011), who suggested that viewing content with different information types requires varying cognitive resources.

To gain further insight into the participants’ selection processes, we examined their think-aloud data. This analysis, consistent with the eye-tracking data, revealed that during both phase 1 (selection of one out of eight options based on surface features) and phase 2 (free exploration), participants predominantly voiced their considerations in relation to the AOI thumbnail. An in-depth categorization of argumentation structures and decision-making uncovered four key observations:

  1. 1.

    The video duration played a significant role in participants’ choices, aligning with didactic perspectives as they were selecting videos for instructional purposes.

  2. 2.

    The video content had a minor influence on participants’ decisions during the free exploration phase. Instead, choices were primarily guided by thumbnail, duration, channel, and title features, indicating a reliance on surface features and pragmatic decision-making among pre-service physics teachers. This tendency to select videos they already had a connection with or were familiar with, such as those from known channels, is in accordance with findings from cognitive psychology (Chen, 2016).

  3. 3.

    A notable portion of the decisions made during the study could not be attributed to surface features, comments, or video content based on either eye movements or verbalizations. Hence, in these cases, the study participants either struggled to explicitly articulate their decisions due to multiple considerations or did not just articulate them at a deeper level. Similar cases have been observed in physics education research on teachers’ professional competences, where prior research has found that experienced teachers’ actions in the classroom are guided by informed decisions and teaching routines that cannot be easily verbalized (e.g., see Borko and Livingston , 1989; Livingston and Borko , 1989; Stender , 2014). To clarify whether similar principles contribute to an explanation of the observations made in our study requires further investigation.

  4. 4.

    Although the participants had the opportunity to view comments associated with each YouTube video during the free exploration phase 2, surprisingly, none of the participants explicitly based their decisions on comments. This observation contrasts with findings from previous research reporting that students relied on comments as an indicator of video quality (e.g., see Fyfield et al. , 2021; Tan and Pearce , 2011) and, instead, indicates that the selection process tends to be less systematic. However, previous studies have consistently shown a strong and statistically significant correlation between the explanatory quality of YouTube videos and the number of content-related comments (Kulgemeyer & Peters, 2016; Bitzenbauer et al., 2023). It is therefore noteworthy that these comments did not play a significant role in the decision-making process of our participants.

The eye-tracking and think-aloud data were complemented by retrospective questionnaire responses: When asked about the features that influenced their video selection for instructional purposes, the majority of respondents indicated thumbnail, duration, channel, or title, while video descriptions and the quantity or quality of comments played a minor role, even in retrospective evaluation.

The finding that participants primarily explored the top results of the YouTube search list aligns with previous studies (e.g., see Fyfield et al. , 2021; Tan and Pearce , 2011): Over half of the participants reported approaching video selection in a sequential manner based on the order of videos in the search list (cf. Fig. 13). This pragmatic approach leads to quick decisions (made in a time frame of less than 10 min in the free exploration phase 2 of this study) that are mainly based on surface features (e.g., thumbnails) or familiarity (e.g., channel). The analysis of individual argumentation categories (see Figs. 11 and 12) supports this assumption.

In conclusion, our findings suggest that the decision-making process of (pre-service) physics teachers when searching for suitable YouTube explanatory videos (on quantum topics in this study) for instructional purposes is primarily driven by pragmatism, efficiency, and reliance on familiar features. The availability of empirical evidence regarding the explanatory quality of online videos seems to be overlooked by (pre-service) physics teachers, representing a missed opportunity to streamline the selection process. Therefore, in “Conclusion,” we will synthesize existing empirical evidence and propose a preliminary decision tree that may assist teachers in efficiently identifying high-quality explanatory videos on YouTube.

Contrasting the Selection Processes with the Proposed Decision Tree

The observations made in this study indicate that the selection processes of (pre-service) physics teachers when searching explanatory videos suitable for physics teaching are predominantly unsystematic, relying on superficial or familiar aspects, and characterized by pragmatic choices. These tendencies give rise to two interconnected issues when it comes to real instructional preparation, where videos with high explaining quality are sought:

  1. 1.

    Teachers may require significant time to find suitable videos due to the unsystematic approach.

  2. 2.

    There is a probability that teachers may select videos of lower quality.

It is obvious that in the study reported in this article, the decision-making process of the participants in most cases diverged from the proposed decision tree. In light of this, it becomes necessary to support teachers in systematizing their selection process to overcome the identified problems in practice. The decision tree proposed in “Research Rationale” might be a valuable tool in this regard as it reflects the state of the literature. While we are aware that future studies are required and might lead to a refined version of the decision tree, the significance of the decision tree in its current form lies in its capacity to systematize the selection process without imposing quantitative guidelines or thresholds. This acknowledges the absence of empirical evidence supporting such measures and recognizes that decisions should be made by teachers on a case-by-case basis, taking into account the specific topic. Future studies should investigate whether the use of the decision tree indeed facilitates efficient and successful identification of high-quality explanatory videos on various topics. Additionally, it will be crucial to determine whether decision steps need to be supplemented or specified.

Limitations

The present study has several limitations that should be considered when interpreting the results. First, the focus on explanatory videos related to three specific quantum topics may restrict the generalizability of the findings. Although we designed the study with three separate groups, each tasked with selecting videos for instructional situations on two different quantum topics, this control measure may not account for potential variations that could arise if explanatory videos on further (physics) topics, e.g., classical mechanics topics, were included. Further research is needed to validate the reported results in a broader range of topics. Second, understanding the cognitive processes of pre-service physics teachers during video selection is a complex empirical endeavor, and the chosen data collection methods—even though they might complement each other—come with inherent limitations. While the analysis of eye-tracking data is based on the eye-mind assumption (Just & Carpenter, 1980), previous research has emphasized the importance of complementing eye movement analysis with additional verbal data to gain a comprehensive understanding (Brückner et al., 2020; Chien et al., 2015; Chiou et al., 2022; Mason et al., 2013; Smith et al., 2010; Wu & Liu, 2021). To address this concern within our mixed-methods approach, we employed introspective thinking-aloud in our study. Additionally, the retrospective questionnaire used for internal validation allows participants to reflect on their experiences; however, it may also trigger ad hoc generated associations and thoughts regarding the different YouTube surface features (for similar arguments, see Winkler et al. , 2021, 2023). Third, it is important to consider that while this study focuses on the selection processes employed by (pre-service) physics teachers in finding YouTube explanatory videos on quantum physics suitable for teaching, the selection situations created within the study design differ from real classroom planning scenarios. Particularly, in our study, participants had unlimited time for decision-making, whereas real instructional planning is significantly influenced by time constraints. However, the analysis of think-aloud data reveals that decision-making occurred within a time frame of less than 10 min in phase 2 of the study (free exploration), which aligns with a reasonable time frame in natural classroom planning situations. Lastly, the time-consuming nature of the study led to a relatively small sample size. However, the primary aim of this study was to gain in-depth insights into the video selection process rather than to produce generalizable findings on a surface level. Future research with larger sample sizes could provide a broader perspective on the topic.

Conclusion

This mixed-methods study explored how pre-service teachers select instructional videos on YouTube for physics teaching, focusing on the role of surface features (likes, views, thumbnails) and comments. The results indicate that the decision-making processes of (pre-service) physics teachers when searching for suitable YouTube explanatory videos are primarily driven by pragmatism, efficiency, and reliance on familiar features.

Based on the current state of research into the explaining quality of online explanatory videos, we proposed a decision tree which reflects how an efficient and successful selection process might look. Although the decision-making process of the study participants often differed from the proposed decision tree in this study, it serves as a hypothesis for future research aimed at supporting teachers in systematizing their selection process: Further studies should explore whether the decision tree facilitates efficient and successful identification of high-quality explanatory videos on various topics and how it might be adapted and refined, e.g., with regard to different subject areas and teaching contexts. Also, future studies might examine how the decision tree works as a tool for preparing teaching, e.g., in related courses in science teacher education. Lastly, it seems particularly crucial to consider the evolving nature of online platforms in future research: For example, research could examine how the decision tree (or an evolved version thereof) can be adapted to accommodate changes in platform features and the emergence of new video metrics. Collaborative research involving educators, researchers, and platform developers may further enhance the decision tree’s practicality and usability for (pre-service) teachers, facilitating their video selection process and ultimately benefiting student learning experiences in physics and beyond.