Research methods in science education are commonly differentiated into quantitative and qualitative methods (Krüger et al., 2014). The former allow for the confirmatory testing of statistical hypotheses, whereas the latter allow for more exploratory generation of novel hypotheses. This division is artificial and attributes to the imperfect capabilities of modeling complex systems that involve learning processes of humans. It would be desirable to better integrate both methods and conserve the predictive capabilities of quantitative methods and the exploratory capabilities of qualitative methods. It has been suggested that complex algorithmic approaches such as machine learning can better model assessment in science education (Breiman, 2001; Zhai, 2021) and eventually provide a new methods paradigm. “Machine learning is about inductively solving problems by machines, i.e., computers.” (Rauf, 2021, p.8). Inductive learning requires appropriate data for the machines to improve on relevant tasks. Given advances in data storage and accessibility, machine learning (ML) models dramatically improved their performance on many tasks such as image classification, or spoken and written language analytics (Goodfellow et al., 2016; Goldberg, 2017). Scholars in the fields of education and discipline-based educational research also argued that ML methods can advance educational research (Singer, 2019; Baig et al., 2020), even “revolutionize” assessments (Zhai et al., 2020). Among others, the education sector presents a field where datasets of unprecedented size become available (Baig et al., 2020).

ML methods have been utilized in science education research in different contexts. Mostly, science education researchers employed supervised ML methods where a model is trained to map responses to predefined outputs (Zhai et al., 2020). However, oftentimes problems in science education research are less well defined and only small datasets can be collected with reasonable effort. For example, in research on university-based teacher education such as noticing and attention to classroom events, typically small samples are available (Chan et al., 2021; Wilson et al., 2019). Noticing, among others, comprises the careful observation of events in a teaching situation. In science education research it has been highlighted that preservice science and mathematics teachers attend to many different events and contents in a teaching situation (Talanquer et al., 2015). To capture the complexity of noticing, science education researchers therefore used open-ended, constructed response formats to assess noticing and attention to classroom events (Barth-Cohen et al., 2018; Luna et al., 2018; Chan et al., 2021). The responses are then analyzed with some form of content analysis. However, not only do the differences in attention between the teachers yield to the complexity of assessing noticing and attention processes, but also the teachers’ use of language in constructed-response items. Language use was characterized to be “noisy, ambiguous, und unsegmented” (Jurafsky, 2003, p.39). Hence, probabilistic approaches are required to analyze language-related processes and products. A probabilistic approach that also captures complexity is ML. The application of ML-based modeling could provide researchers means to gain novel insights into these complex constructs (Zhai et al., 2020). Yet, it is not clear in what ways ML-based approaches can be utilized to identify meaningful patterns in teachers’ constructed responses with respect to noticing and attention to classroom events.

In the present study we therefore evaluate potentials and challenges of using a pretrained language model-based clustering approach to analyze preservice physics teachers’ open-ended, constructed responses in the context of describing a standardized teaching situation. We critically examine to what extent the application of ML in our research context can bridge the divide between quantitative and qualitative methods and provide a more integrative approach.

Utilizing NLP and ML to Model Complex Dataset

Applications with ML and natural language processing (NLP) attracted a lot of interest in the field of science education research (Zhai et al., 2020). ML refers to computers’ inductive problem solving based on data (Zhai, 2021; Rauf, 2021). Two major types of ML are supervised and unsupervised ML (Jordan & Mitchell, 2015). In supervised ML, human-annotated data are provided for the models to learn a mapping from input to output in order to classify or predict unseen data (Marsland, 2015). Unsupervised ML, on the other hand, encompasses algorithms to reduce complex datasets and extract patterns in them. Both types of ML can be used to analyze natural language. The study of natural language by means of computers is called NLP. NLP refers to the systematic and structured processing of natural language data. Natural language can be contrasted with artificial language such as programming languages or mathematics which are more aligned with formal logic. The attribute “natural” relates to the fact that this form of language can be characterized to be “noisy, ambiguous, and unsegmented” (Jurafsky, 2003). It has been argued that it is not possible to specify clear-cut rules for natural language (i.e., a grammar) that explain phenomena of language comprehension and production: “we can’t reduce what we want to say to the free combination of a few abstract primitives” (Halevy et al., 2009, p. 9). Hence, probabilistic approaches such as ML methods are increasingly incorporated into NLP research in addition to rule-based approaches, given the capacity of probabilistic approaches such as ML to systematically process complex language data, extract patterns in it and classify instances of language use (Goldberg, 2017).

ML research experienced a new spring with the successful application of deep neural networks to learn input-output mappings that outperformed more simple (shallow) ML models in most tasks in image and language analysis (Goodfellow et al., 2016). A heuristic in ML research states that problems which are easy for humans are difficult for machines to solve such as character recognition or speech perception (Goodfellow et al., 2016). Simple ML algorithms like logistic regression excelled in problems where the input representation through features is particularly informative, e.g., the age of a student. The selection and engineering of inputs typically requires efforts for the human researcher, because data are not typically represented in this aggregated form in real-world contexts. Simple ML models would lose performance when more complex data such as images or language form the input (Goodfellow et al., 2016). Deep neural networks have been found to be capable of representing the input as part of the modeling, which allowed ML and NLP researchers to apply these models to problems where complex data has to be represented in the first place. Thus, human feature selection and engineering is partly replaced by automated feature representation in the deep neural network approaches oftentimes with the loss of interpretability of the model decisions.

A major facilitator for the deep learning revolution in the last decades was the availability of annotated data. For once, researchers spend tremendous efforts to annotate data manually in order to train deep neural networks that are capable of language comprehension and production, or image classification. For the now famous ImageNet competition, researchers manually labeled over three million images in two years with the help of crowdsourcing (Mitchell, 2020). Similar efforts have been undertaken in NLP. To advance language translation, ML researchers were fortunate to find annotated datasets from the cold war where translations were important for intelligence or from the European Parliament that consists of many different nations (Mitchell, 2020). However, curating and annotating these datasets captures resources that are not widely available such as money and compute time. Consequently, for most researchers in domains like science education no such well-developed datasets will be available for their specific research questions.

However, the ML paradigm of transfer learning that became important with increasingly complex deep neural networks (Devlin et al., 2018) might solve this problem. Transfer learning enables sharing of previously trained ML models for different tasks (Ruder, 2019). Much as humans learn language from experiences, feedback and reinforcement (Bruner, 1985) and build on learned structures (Rumelhart et al., 1986), the paradigm of transfer learning posits that prior trained weights in a given context can be further used to improve model performance in different contexts/domains and with different tasks (Ruder, 2019). NLP researchers used transfer learning in the context of language modeling. While in image processing models are oftentimes pretrained on the ImageNet dataset to improve downstream performance (Devlin et al., 2018), language models in NLP research can be trained on corpora such as the Internet or Wikipedia (Devlin et al., 2018; Ruder, 2019). NLP researchers then pretrain language models that are capable of representing language in a way that researchers can use in downstream tasks (Mikolov et al., 2013). Typically, these language models are trained with the objective to simply predict context words. The pretrained language models can then be used to generate an informative representation of language to enhance task performance (Mikolov et al., 2013; Mikolov et al., 2013; Devlin et al., 2018).

Modeling Unstructured Data in Science Education Research with NLP and ML

“Perhaps when it comes to natural language processing and related fields [that model human behavior], we’re doomed to complex theories that will never have the elegance of physics equations” (Halevy et al., 2009, p.8). The “unreasonable effectiveness” (Wigner, 1960) of mathematics has been recognized for physics; however, educational sciences are far from having theories in this elegant formulation—given the complexity of the involved problems. In this context, NLP and ML have probably much to offer for these fields where complex theories prevail. Yet, especially more sophisticated NLP and ML applications such as deep neural networks might pose unfulfillable requirements on required size of training datasets and model implementation to be useful for science education research. The size of the training dataset should be judged against the complexity of the task and the complexity of the ML model. While the review by Zhai et al. (2020) shows that typical applications of NLP and ML in science education research comprise fewer than 30k training samples, the reviewed studies exclusively focus on simpler ML models such as logistic regression, support-vector machines, or naive Bayes. More generally, data collection in domains such as science education is costly and time-consuming, because large coordination efforts are necessary to recruit enough subjects. There are literally no studies in science education where millions of subjects have been collected that comprise datasets that seem to be required to train more general-purpose deep learning models. Does this imply that particularly the more complex ML methods are not applicable for science education researchers?

For supervised ML this hypothesis has been refuted in some science education research contexts. Wulff et al. (2022) could show that pretrained language models improve classification performance for discourse elements with preservice physics teachers’ written reflections. The findings in this study suggest that complex ML models that are trained from scratch can reach classification accuracy of simpler ML models. Furthermore, the authors show that utilizing pretrained weights for the complex models enhances classification accuracy and generalizability further. Carpenter et al. (2020) showed that deep contextualized embeddings from pretrained language models could improve prediction of students’ reflective depth in a biology learning context. These findings buttress the applicability of complex ML models such as deep neural networks as facilitators for supervised ML. These studies, however, do not suggest that training the more performant deep learning models from scratch is possible with the available science education datasets. Furthermore, it is not clear from these studies to what extent pretrained language models could be used to extract patterns in the datasets.

Prior research on pattern extraction from unstructured data with simpler unsupervised ML models and larger datasets in education and science education contexts focused on standardized documents such as dissertation or conference abstracts. Munoz-Najar Galvez et al. (2020) established a data-driven way to systematically analyze the field of education research. They identified paradigm shifts in education research on the basis of 137,024 dissertation abstracts, reconstructing a shift from an outcome-oriented paradigm to an interpretative paradigm. In science education research, Odden et al. (2020) used latent Dirichlet allocation (LDA), a generative probabilistic topic model, to analyze all papers that were extracted from the Physics Education Research Conference Proceedings from 2001 to 2018 (overall 1,302 papers). They outline shifts in the paper’s topics in the conference over time. Despite the potentials of LDA to summarize occurring research topics and trends over time, the authors recognize some shortcomings with this algorithm. For example, the LDA model groups together segments that use similar vocabulary. However, the segments might differ in meaning anyways (see also: Odden et al., 2021). Other researchers could show that simpler unsupervised ML methods could also be used to explore patterns in comparably smaller datasets in science education. Sherin (2013) used a vector space model and a hierarchical agglomerative clustering algorithm to identify students’ science explanations in interview transcripts. He showed the general applicability of these NLP-based methods in this context, but contends that the algorithms could not account for word ordering effects. He also mentions the desire to more systematically extract the number of topics that are likely present in the data (see also: Xing et al., 2020). Also Rosenberg and Krist (2020) successfully applied an unsupervised clustering algorithm to assess students’ considerations of generality in science (see also: Xing et al., 2020; Zehner et al., 2016).

A domain of research in science education where NLP and ML in unsupervised contexts has not yet been applied widely is university-based science teacher education. In fact, no reviewed study in Zhai et al. (2020) engaged in university-based educational research. Besides supervised ML approaches in university-based science teacher education that have been occasionally applied (Wulff et al., 2020), unsupervised approaches could facilitate researchers and instructors novel insights into relevant constructs because they can explore patterns in unstructured data (Halevy et al., 2009; Hao, 2019).

Science Teachers’ Noticing of Classroom Events

Teachers face the challenge to professionally act in uncertain situations (Clifton & Roberts, 1993; von Aufschnaiter et al., 2019; Chan et al., 2021). Learning to professionally act in uncertain situations requires teachers to develop the capacity to reflect on their teaching experiences (Korthagen, 1999). An important part of reflective competencies are noticing skills that relate to perceptual and cognitive thinking processes (Chan et al., 2021). In particular, noticing comprises observation, interpretation, and reasoning about learning-relevant events in classrooms (Sherin & van Es, 2009; van Es & Sherin, 2002a; Chan et al., 2021; Furtak, 2012). Van Es & Sherin (2002) define noticing with regard to three key aspects: “(a) identifying what is important or noteworthy about a classroom situation; (b) making connections between the specifics of classroom interactions and the broader principles of teaching and learning they represent; and (c) using what one knows about the context to reason about classroom interactions.” (p. 573) Noticing research has documented the difficulties that novice and even expert teachers have to direct their attention and notice relevant classroom events (Sherin & Han, 2004; Chan et al., 2021; Talanquer et al., 2015; Levin et al., 2009; Roth et al., 2011). For example, novice science and mathematics teachers struggle to attend to student thinking and the substance of what they are saying (Sherin & Han, 2004; Hammer & van Zee, 2006), and tend to strive for quick and conclusive inferences that are right or wrong, rather than tentative interpretations (Crespo, 2000). This strand of research also showed that science teachers provide more general evaluations as compared to more specific accounts of student understanding (Hammer & van Zee, 2006). Mathematics and science education scholars generally highlighted the complexity of the noticing construct (Chan et al., 2021; Talanquer et al., 2015). Talanquer et al. (2015) summarize the noticing foci of teachers as: “the object of noticing (e.g., student actions, student thinking), the noticing stance (e.g., evaluative, interpretive), the specificity of noticing (e.g., specific student, whole class), and the noticing focus (e.g., specific concept, general topic)” (p. 587). To design authentic learning opportunities for mathematics and science teachers to enhance noticing skills, valid, reliable, and scalable assessment of attention to classroom events and noticing is necessary.

To assess noticing and attention to classroom events, science education researchers increasingly embraced constructed response items, e.g., open-ended, free-recall written responses (Barth-Cohen et al., 2018; Luna et al., 2018; Talanquer et al., 2015; Chan et al., 2021). Open response items have been argued to allow a more authentic examination of teachers’ professional competencies as compared to more closed-form questions (Nehm et al., 2012; Zhai, 2021). Many of the noticing research then seeks to analyze inductively what teachers are noticing (Chan et al., 2021). However, the mere linguistic complexity of the constructed responses (noisy, ambiguous, and unsegmented) and the complexity of the noticing construct make it challenging to integrate all information in the responses and infer the noticing skills. From their review on teacher noticing research in science education, Chan et al. (2021) conclude that “methodological trade-offs between different ways of investigating teacher noticing need to be better explored” (p. 37). We suggest that ML-based methods can provide novel means to analyze teachers’ responses inductively “to understand what teachers notice” (Chan et al., 2021, p.34). Thus, ML methods potentially help researchers to gather ‘knowledge of teachers’ (Fenstermacher, 1994). We also concur with Lamb et al. (2021) that ML models are powerful tools to advance algorithmic understanding of relevant underlying cognitive processes that can explain the process and products of writing. Zhai et al. (2020) argued that ML models can particularly advance understanding and assessment of complex constructs such as noticing and provide means to automate assessment and feedback. Consequently, this study examines potentials and challenges of an ML-based clustering approach when applied in the context of assessing noticing of classroom events for preservice science teachers.

Research Questions

Noticing or directing attention to relevant classroom events is highly relevant for mathematics and science teachers and, thus, plays an important role in science education research. Attention to classroom events and contents played a particularly important role in mathematics and science education research. Star and Strickland (2008) suggested that noticing research should focus particularly on what catches teachers’ attention and what is missed. 25 of the 26 science education studies reviewed by Chan et al. (2021) considered attention to classroom events as an essential aspect of noticing; 11 studies even restricted noticing to attention. Attention to classroom events has often been studied through video clips that present teachers with a standardized teaching situation and are typically followed by some form of eliciting teachers’ observations (Zhai, 2021; van Es & Sherin, 2002a; Seidel & Stürmer, 2014; Putnam & Borko, 2000; Darling-Hammond, 2000; Kleinknecht & Gröschner, 2016; Sherin & van Es, 2009).

Noticing research can be characterized as a context where it seems to be notoriously difficult to recruit large sample sizes, rendering quantitative research methods difficult to apply. Reviews suggest that studies typically comprise small samples of up to 241 teachers (Wilson et al., 2019; Chan et al., 2021). This restricts researchers to using mostly qualitative methods with some form of content analysis (Wilson et al., 2019; Chan et al., 2021; Talanquer et al., 2015). As such, it is important to examine to what extent ML-based approaches can be utilized in this context as a means to advance quantifiable hypotheses. Particularly, pretrained language models can improve the ML methods to be more robust with small samples. Hence, we ask the following overarching research question: To what extent and in what ways can a pretrained language model-based clustering approach extract meaningful patterns in preservice physics teachers’ written descriptions of a teaching situation?

In the context of RQ1, we analyze the validity of the extracted clusters:

  • RQ1: To what extend can a pretrained language model-based clustering approach extract interpretable (RQ1a), specific (RQ1b), and robust (RQ1c) clusters in the preservice physics teachers’ written descriptions of a teaching situation?

We then examined ways in which these clusters provide insights into the composition of the written descriptions. van Es and Sherin (2002) used the concept of analytical chunks in their noticing research, referring to experts’ tendency to organize their essays more coherently in reference to teaching and learning principles. Based on this concept of analytical chunks, we hypothesize that the analysis of interconnections between the clusters in the teachers’ written descriptions provides tools to develop a more quantitative understanding of chunks in the writing. To analyze the organization of the teachers’ written descriptions based on the extracted clusters, we explored dependencies among clusters:

  • RQ2: What kinds of dependencies with respect to textual organization can be analyzed based on the extracted clusters?

Method

Written Descriptions of a Video-Recorded Teaching Situation in Physics

In the present study preservice physics teachers’ were given the instruction to describe, evaluate and reason about a video-recorded lesson which presented the teachers an authentic teaching situation in a 9th grade physics classroom held by an in-service physics teacher. Overall, the teaching goal of the observed lesson was to introduce influencing factors on the movement of falling objects and the definition of free fall. Table 2 outlines the chronological order of events in the teaching situation. The teaching situation can be broadly divided into two phases. In the first phase, the teacher performed several experiments with falling objects (two masses, and a vacuum tube with screw and feather). The students posed hypotheses on the outcome of the experiments (e.g., which of the two masses of different weight will hit the floor first. In the second phase, the teacher provided the definition of free fall and students devised experiments to investigate what type of movement free fall is. This video-recorded teaching situation was chosen because it presents preservice physics teachers a complex and authentic teaching situation where many different noticing-relevant general and subject-specific issues could be identified. Teachers could describe mere surface-level, general issues such as that the students were noisy at several occasions, or more deep-level, subject-specific issues such as that several students raised concerns with the experimental setup (e.g., missing control of variables) or conceptual difficulties (e.g., whether an ever-accelerating object reaches the speed of light). Following the classification rubric for noticing research in science education by Chan et al. (2021), our approach was meant to characterize teacher noticing (purpose) as assessed through observation of other teachers’ teaching (teaching context), where the observing teachers could not control what happened (role of teacher) and the noticing-relevant events were pre-determined (what to notice) and selected by the researchers (selection of probes) with open-ended prompts (nature of prompt) and divergent answers without correct answer (type of teacher responses).

The video is about 17 minutes long. The preservice physics teachers were allowed to watch the video only once, without rewinding the recording, in order to simulate in-the-moment pressures of decision-making (Chan et al., 2021). It was an authentic lesson that was recorded in a German grade 9 high school physics classroom as part of a post-university physics teacher preparation program. In Germany, after the university-based teacher training teacher trainees are required to pass a one- to two-year program, run by federal states, that will approve if they are finally allowed to teach in public schools. Using a recorded lesson from this post-university teacher preparation program presents a lesson that is proximal to what the preservice teachers will do in their future careers. Overall, N=75 preservice physics teachers participated in the study who produced 86 written descriptions (sometimes preservice teachers produced two texts, pre and post to a seminar). The teachers varied in their teaching experience and came from three different universities throughout Germany (see Table 1). Preservice teachers spent approximately one hour on the entire questionnaire of the online video-vignette. The text production took approximately 20 minutes (independently of another 17 minutes video observation and another 20 minutes answering further questions). Preservice physics teachers were instructed to first describe what happened in the teaching situation. Afterwards, they should evaluate the situation, devise alternative modes of action, and formulate consequences for their own teaching.

Table 1 Sample description

Given that preservice physics teachers either described, evaluated, and reasoned about the observed teaching situation, the sentences that count as descriptions were extracted through an ML-based classifier. The ML-based classifier automatically retrieved descriptive sentences based on a classification algorithm that was described elsewhere (Wulff et al., 2022). This classifier annotated each sentence with one of the following labels: “circumstances”, “description”, “evaluation”, “alternatives”, and “consequences.” Using sentences as the segmentation units was found to be a reasonable strategy in similar contexts of writing analytics (Ullmann, 2019). The descriptive sentences were further filtered to a length greater than four words to remove headlines and similar non-informative sentences. 98% of sentences of the original descriptive sentences remained (1537 sentences in total). The preservice teachers wrote on average 16.0 (SD = 7.9, min: 4, max: 59) words in a descriptive sentence. In descriptive sentences, the preservice teachers wrote in various ways about the events in the lesson as outlined in Table 2. A randomly drawn sentence from a preservice physics teacher reads as follows: “The observations [from the students] and differences [to the hypotheses] were collected and summarized by the teacher as free falling movement is independent of the mass.” This sentence and all words and sentences in the following were translated from German to English by the authors who are familiar with English language, in particular specialized vocabulary in physics. Some intricacies emerged with the translations. For example, German language has many specific abbreviations in educational contexts, e.g., “SuS” (“Schülerinnen und Schüler”) for female and male students or “LK” (“Lehrkraft”) as an inclusive word for teacher that have no equivalent in English. We tried to highlight those issues when they occur. Furthermore, German language is well known for its compound nouns that can become very long (e.g., “Fallröhrendemonstrationsexperiment”, which can be translated to “demonstration experiment with drop tube”). In German, compound nouns may count as one word in the vocabulary, whereas in English many different words would be added. Consequently, the German vocabulary in terms of distinct words is larger compared to the English vocabulary.Footnote 1

Table 2 Sequencing of the lesson

Clustering Sentences of the Written Descriptions

ML methods that extract patterns in unstructured data such as the constructed responses are categorized as unsupervised ML. Unsupervised ML typically include some form of dimensionality reduction and clustering oftentimes with the purpose to make high-dimensional data human-interpretable. Clustering approaches that were not based on pretrained language models enabled science education scholars to identify emergent topics in conferences or students’ writing (Odden et al., 2020; Sherin, 2013), however, they also oftentimes require involved preprocessing of the data (Angelov, 2020; Odden et al., 2020; Zehner et al., 2016). Most often, researchers needed to remove frequent words (stopwords), lower-case all words (which might be disadvantageous in German where upper-case letters can differentiate word senses), or transform words into their base form to reduce vocabulary size (Odden et al., 2020; Rosenberg & Krist, 2020). Furthermore, researchers noted the difficulty in determining the number of clusters that should be extracted in these approaches (Sherin, 2013) and these approaches oftentimes assume that word order in the sentences is irrelevant (bag-of-words assumption). Finally, these approaches are ignorant of ambiguous word senses. No prior information on the words is incorporated in these approaches such that the word “bank” in the phrases “river bank” and “bank robbery” might be treated as the same word even though the meaning differs substantially. Recently, however, advances in NLP and ML research provided pretrained language models that provide contextualized embeddings for language data that help to cope with some of the aforementioned challenges. These contextualized embeddings potentially enable researchers to model constructed responses in a more language-sensitive way that is able to preserve word ordering and word sense disambiguation as features.

Pretrained language models can generate contextualized embeddings for language input that enhances modeling of the language data (Mikolov et al., 2013; Sherin, 2013; Taher Pilehvar & Camacho-Collados, 2020). Essentially, words are mapped to a position in high-dimensional vector space, called a distributed representation in the form of embeddings (Taher Pilehvar & Camacho-Collados, 2020). Vector space models thus encode word similarity and efficiently represent words. Given the claim that one understands a word by the company it keeps (Jurafsky & Martin, 2014), word embeddings can be learned through ML approaches, where model weights are optimized with the goal that a word embedding for a given word predicts the context words (Mikolov et al., 2013). More advanced approaches utilize pretrained language models that result in embeddings that also account for the context (contextualized embeddings) and the position in a segment that a word occurs in (Taher Pilehvar & Camacho-Collados, 2020). Pretrained language models are typically trained on large unstructured datasets (e.g., the Internet, Wikipedia). Training tasks involve prediction of context words (Devlin et al., 2018). For practical purposes the vocabulary is often restricted to some 30,000 tokens, where unknown words can be built from the 30,000 tokens. Linguists have estimated that 30,000 words are sufficient to understand many general English texts well (Mitchell, 2020). If a sentence is input into a pretrained language model, typically embeddings for each word in the sentence (given the position and context words) is the output. To generate a contextualized embedding for a sentence, the word embeddings can be pooled.

As an illustrative example for sentence embeddings based on pretrained language models, the following physics-related and general sentences should be considered (some noise data points were added which will be motivated later on): ’Earth exerts a force’, ’The force acts on’, ’The force on earth’, ’We force her’, ’They force him’, ’How to force him’, ’Grass is green’, ’The sunset can be red’, ’Green is grass’ (called Segment 1 to 9 respectively). Force in the first three sentences relates to the physics meaning (given as a noun). In the following three “force” is included as a verb that encapsulates a certain kind of rather aggressive behavior. The final three sentences are included as sentences that are entirely different in meaning. “Force” in the former sentences has a different word sense compared to the sentences 4 to 6 and should be distinguished in a clustering approach. In Fig. 1(a) a two-dimensional representation of the sentence embeddings gleaned from a pretrained language model is depicted. As can be seen from the separation of datapoints in space, pretrained language model’s word embeddings can in fact disentangle the senses to a certain degree. To further inspect the embedding space, a clustering approach can now determine which sentences are likely related to each other (Angelov, 2020).

Fig. 1
figure 1

a Two-dimensional representation of the example segments and noise. b Surface plot of probability density of the data points. c Minimal spanning tree with data points as nodes (colors indicate the mutual reachability distance). d Dendrogram of clusters for varying density values (colored circles indicate clusters)

Extracting clusters from contextualized embeddings can be done with Hierarchical density-based spatial clustering of applications with noise (HDBSCAN) (Campello et al., 2013). HDBSCAN is a way to calculate the number of dense volumes (i.e., clusters) in the embedding space. Density-based clustering methods consider the probability density of a collection of data points (Kriegel et al., 2011). In Fig. 1(b) the probability density distribution for the data points in Fig. 1(a) is depicted. To extract clusters, an imaginary water level can be introduced into the probability space. The water level represents a threshold for cluster extraction. Emerging islands, i.e., regions above the water level, represent clusters. If water level rises, less probability mass lies above the water level, and thus fewer clusters are extracted. A suitable water level has to be chosen in order to extract an appropriate amount of clusters.

To perform the actual clustering the nearest neighbors for each data point will be determined and the closest distance between nearest neighbors will be highlighted as edges in a graph, i.e., the minimal spanning tree (see Fig. 1(c)). A threshold parameter (i.e., the minimal distance) is then varied where edges that surpass the threshold are removed from the graph. Finally, the minimal spanning tree is mapped into a condensed tree representation (see Fig. 1(d)). The condensed tree depicts the number of data points in a cluster (width of the branches) with varying densities (\(\lambda\)). A way to extract clusters from the condensed tree is by defining a minimal cluster size and examining the stability of the branches over different density values (moving up and down in Fig. 1(d)). It is desirable to have clusters that persist over varying density-levels. The stability of a cluster basically relates to the regions of maximum area in the condensed tree Kriegel et al. (2011), Campello et al. (2013). The algorithm thus determines a number of clusters by examining properties of the clusters. From the illustrative example, the resulting clusters based on this clustering approach (HDBSCAN combined with pretrained language models) are depicted as blue, orange, and green ovoids in Fig. 1(d). The red-shaped ovoid cluster could be considered as noise, given the instability over density values in Fig. 1(d). If the sentence embedding points in Fig. 1(a) were to be colored, the closely aligned sentences would in fact be colored with the same colors, respectively.

Analysis Procedures

Interpretability of Clusters (RQ1a)

In order to evaluate if the pretrained language model-based clustering approachFootnote 2 outputs represent interpretable clusters, the most representative words for each cluster were considered, and a definition was derived. Visual inspection of the two-dimensional embedding space and the condensed tree representation helped to determine similarities and differences of the clusters. If the five most representative words could be mapped to distinct sections in the observed teaching situation (see Table 2) and were coherent, then we considered this as evidence of a meaningful cluster, because clusters were anticipated to attend to localizable events (e.g., experiments) or actions (e.g., devising hypotheses). We also assessed to what extent the clusters related to physics ideas that were implicitly or explicitly relevant in the observed teaching situation, and what ideas or events were not clustered.

Specificity of Clusters (RQ1b)

Then it was evaluated to what extent physics-savvy human raters could use the extracted clusters to manually annotate the video-recorded teaching situation. If human raters struggled to annotate a certain cluster in the video recording, this would provide evidence of unspecific focus of a cluster. To annotate the teaching situation on the basis of the extracted clusters, three independent raters with physics background (one postdoc, two PhD students) who were familiar with the observed teaching situation annotated the entire video sequence based on 10 second intervals. All the information they received were the five most representative words for the respective clusters (coding 1) with no further instruction. In a second iteration (coding 2), the human raters discussed and agreed on some coding rules, e.g., that the entire process of an experiment should be annotated if relevant words of a cluster occurred only at the beginning. To evaluate the reliability of this annotation, we first examined a graphical representation of the annotations over time to evaluate interrater agreement. We considered each cluster separately. To evaluate interrater agreement, Krippendorff’s \(\alpha\) for each cluster was calculated because Krippendorff’s \(\alpha\) is more appropriate than Cohen’s \(\kappa\) for three raters. A Krippendorff’s \(\alpha\) value of 1 refers to perfect reliability and a value of 0 to absence of reliability. Values between .667 and .800 are usually considered to allow researchers to draw tentative conclusions, i.e., consider the agreements as non-random (Krippendorff, 2004).

Robustness of Clusters (RQ1c)

To analyze robustness of clusters, the clustering approach was applied to smaller subsets of the dataset. To test if small sample sizes are enough, subsets of N=43 randomly chosen pre-service teachers and = 8 randomly chosen pre-service teachers were considered. The extent to which similar clusters emerge was examined. If meaningful clusters could be identified in these subsets, then we considered the algorithm robust with sample size variations which could be beneficial for science education researchers who oftentimes only have small samples at their disposal. Furthermore, we compared the outputs of the pretrained language model-based clustering algorithms with a clustering approach that was not based on pretrained language models, but was successfully applied in a science education research context before. We therefore adopted the topic modeling approach outlined by Sherin (2013). He devised an accessible approach for extracting clusters in interview transcripts. He started by segmenting texts into chunks of 100 words (with overlap). Afterwards, a normalized term-document matrix was formed. To circumvent the problem of similar topics (low levels of variability in the data), deviation vectors were calculated. Based on the deviation vectors, hierarchical agglomerative clustering yielded a distribution of topics, depending on the number of topics. Finally, the ten most representative words were found as the highest ranking words in the centroid vectors for the respective topics. With parameter values adapted to our research context, we extracted clusters from our descriptions based on this approach. Based on the comparison from the ten most representative words for each topic, we evaluate to what extent both clustering approaches yield similar topics. This would yield evidence that the pretrained language model-based approach could also be successfully employed in science education research contexts.

Advanced Textual Analytics Based on the Clusters (RQ2)

The applicability of the pretrained language model-based clustering for analytics of the constructed responses was evaluated through exploratory analysis of the textual organization of the constructed responses. Based on episodic memory theory it can be expected that the preservice teachers provide a chronologically ordered text organization. Hence, the temporal progression of the clusters within the teachers’ written descriptions was analyzed. To depict the temporal progression of the clusters within the written descriptions, the sentences were mapped to their relative position in reference to the other descriptive sentences for each teacher (see similar analysis in: Sherin, 2013). Mapping the sentences to their relative position was supposed to produce certain peaks where clusters are most prevalent in the descriptions. For example, it could be expected that mentioning the introduction with hedgehog and hare or the teacher experiments precedes other clusters such as the discussion of the type of movement, because these descriptions appeared first in the observed teaching situation and teachers are expected to describe the teaching situation chronologically. Distinctiveness in temporal progression would indicate that the extracted clusters in fact captured different aspects of the teaching situation. To further analyze textual organization, we employed a network-analytical approach to calculate the centrality of different clusters and a vector-field approach where the movements through cluster space can be characterized. In both approaches we will evaluate to what extent the respective empirical distributions, i.e., the directed network of clusters and the vector-field representation, are better captured by random processes or more deterministic processes. If teacher’s written descriptions can be characterized by more deterministic processes, we can conclude that the presented clustering approach can yield insights into textual organization.

Findings

Validity of the Clustering Approach (RQ1)

Interpretability of the Extracted Clusters (RQ1a)

To evaluate the interpretability of the extracted clusters, contextualized embeddings of the preservice physics teachers’ descriptive sentences were generated with the pretrained language models and clusters were extracted with the HDBSCAN algorithm. This approach yielded a number of 14 clusters and a noise cluster (cluster -1). The absolute sizes (# of sentences in a cluster) are depicted in Table 3. We also provided a definition of the clusters based on the most representative words for each cluster, and we determined how many sentences per written description on average were categorized into each cluster (see Table 3). The largest share of sentences was coded as -1.Footnote 3 The graphical representation of the embedding space with clusters highlighted in colors can be seen in Fig. 2. The embedding space can be fundamentally separated into two overarching groups (indicated by the black line): (1) clusters that relate to physics-related events or topics that occurred during the teaching situation and (2) clusters that encapsulate general actions, and specific, non-subject-related events. In group 1, cluster 2 thematizes the central experiment of the lesson where a feather and screw are observed falling in a vacuum tube. Cluster 2 had the second largest share of sentences in the descriptions (see Table 1). Relatedly, cluster 10 likely represents the students’ hypotheses that the screw has a higher weight, whereas the feather has a high air resistance. Clusters 0 and 1 represent the other experiment, in which two mass pieces (equal shape, different mass) are dropped simultaneously to deduce that free fall is independent of mass. Clusters 8 and 9 refer to the teacher’s question about which type of movement a free fall is and how this type of movement can be experimentally determined.

On the other hand, in group 2, clusters 6 and 7 represent teachers’ and students’ actions of summarizing and posing hypotheses/claims respectively. Given the similarity of clusters 6 and 7, they were also close in embedding space. Cluster 6 was related to posing hypotheses by the students, whereas cluster 5 was related to the process of summing up the hypotheses by the teacher. In fact, this was a recurrent thread in the lesson: the teacher asked the students to hypothesize about the results in advance of an experiment which is why the cluster was coded at several points. Cluster 3 also refers to the teachers’ responding to students’ answers. Cluster 4 represents the students’ action of raising arms and responding to the teachers’ questions. Cluster 13 captured the beginning of the lesson where the teacher reminds the students of the former lesson regarding the race between hedgehog and hare. Finally, cluster 12 referred to the instruction by the teacher that the students may copy the definition of free fall into their folders.

In sum, the clusters encapsulate both short and rather specific events in the teaching situation (e.g., writing the definition of free fall in the folder) and more abstract ideas such as summarizing hypotheses which occurred more than once in the scene. They also include more general clusters (summarizing students’ hypotheses, e.g., cluster 5) and more physics-related contents (characterization of the type of movement, e.g., clusters 8 and 9). Preservice physics teachers wrote on average 3.4 sentences on cluster 2, which comprised the largest share (after the noise cluster), followed by cluster 1 and 11 with 1.8 sentences on average. Thus, physics-specific clusters were more extensively included in the written descriptions. However, the overall low average counts of one sentence for a cluster could indicate that oftentimes the preservice teachers only briefly elaborated on an event. It is also noteworthy that some important events in the teaching situations are not captured in a cluster. During the lesson the students asked for example: “Why is it called free fall for a parachute jumper?”, “Would an infinitely accelerating mass surpass the speed of light?”, or “Would two plates, one made of cardboard the other made of metal, actually arrive on the floor at the same time?”

Fig. 2
figure 2

Two-dimensional representation of clusters. A point represents the projection of a sentence embedding into the two dimensions. Colors represent belonging to a cluster. Gray points represent “noise”, i.e., not belonging to any cluster. Larger points indicate cluster centroids

Table 3 Number, share (i.e., number of segments) and top five words of the extracted topics. M, \(M\!d\) are mean and median number of sentences for a cluster in a written description, respectively. \(S\!D\) is the standard deviation; range is minimum and maximum number of sentences

Specificity of the Extracted Cluster (RQ1b)

To examine to what extent the extracted clusters map to discernible events and topics in the teaching situation, human raters used the clusters as represented through the most informative words to annotate the video recording of the teaching situation (RQ3). Figure 5 depicts all codings from three independent annotators separated by cluster over time. To estimate human interrater agreement, we calculated the Krippendorff \(\alpha\) values for the clusters. After the first round of rating the video-recorded teaching situation (coding 1), the Krippendorff \(\alpha\)’s indicate that some clusters (e.g., 0, 1, 2, 8, 12, and 13) could be identified with good reliability given only the five most representative words and no annotator training. Cluster 12 related to the introduction of the definition of free fall by the teacher. This, apparently, was a localizable event in the teaching situation. Cluster 0 related to the experiment with two masses (similar reasoning for cluster 1). The teacher used two masses only once as an experiment, hence, this formed a recognizable event for the human raters. Cluster 13 related to the very beginning of the lesson. The words “hedgehog” and “hare” are unique for this event. The human annotators reached poor reliability on clusters with more general words (e.g., 3, 7, and 11). The words “respond”, “feedback”, “summarize”, and “teacher” could be applied to many different events in the teaching situation. They represent high-inferential categories, because the teacher and students did not specifically say that they “responded” or “summarized” ideas.

After coding 1, the three annotators made their coding rules more explicit and discussed them. On this basis, the video-recording of the teaching situation was annotated again by all three annotators (coding 2). Some improvements could be seen after the discussion. Most notably, clusters 1, 2, 4, and 9 substantially improved in interrater agreement (see Table 4). Cluster 9 made the most substantial improvement. This cluster related to the measurement and determination of the type of movement. The raters agreed to include all student suggestions at the ending of the teaching situation because this represented a coherent phase, which caused the improvements in agreement. However, other clusters (3, 6, 7, 10, 11) seemed to remain too vague to be annotated based on the five most representative words.

Table 4 Values for interrater agreement as measured through Krippendorff’s \(\alpha\) for each cluster

Robustness of the Extracted Cluster (RQ1c)

To evaluate the robustness of the extracted clusters, we probed to what extent the clustering algorithm would still yield interpretable and comparable clusters for smaller sample sizes. The baseline for comparison formed the extracted clusters based on the entire dataset (see Fig. 2). As sample sizes in noticing research in science education are typically smaller, subsets of N=43 and N=8 were drawn. The entire clustering approach was performed for these subsets of the data. The resulting cluster embeddings and condensed trees can be seen in Fig. 4. We particularly mapped the extracted clusters based on the top five words to the baseline clusters as extracted with the entire dataset. It is noteworthy that the spatial outline and the actual extracted clusters can be mapped well onto each other. This is even possible for a sample size of only N=8 teachers. The two overarching groups (general and physics-specific) could be identified for the subsamples as well. Based on the condensed trees, some similarities in cluster evolution over different density values can be inferred as well. For example, clusters 8 and 9 seem related in all condensed trees as they evolve from a common branch. Both clusters comprise sentences on type of movement which are physics-specific. Interestingly, in Fig. 3, also clusters 10 and 11 fall on the same branch as 8 and 9. This might be attributed to the fact that in clusters 10 and 11 the influence of air resistance on free fall is considered which is closely related to movement as well. While clusters 0 and 1 are linked in Fig. 3 (both include the vacuum tube experiment), this link does not exist in Fig. 4. For these clusters, probably the five most representative words are not informative enough to allow for clear distinction. Clusters 4, 5, and 6 relate to the students’ and teachers’ actions of posing hypotheses (see Fig. 3). While they neatly evolve from one parent branch in Fig. 3, only one of the respective clusters was present in the smaller samples. However, they also separate early (at low densities) from the other clusters (see Fig. 4).

Further evidence for robustness of the presented clustering approach based on pretrained language models can be gleaned by comparison with a formerly successfully employed clustering approach in science education research that was not based on pretrained language models. To implement a clustering approach based on hierarchical agglomerative clustering, a similar protocol as outlined in Sherin (2013) was followed. However, we did not segment our texts into 100-word chunks, but rather into the sentences that were used as smallest segments. We considered this useful, because we expected the grain size of our clusters (i.e., discernable events in the teaching situation) to be smaller compared to the grain-size of the clusters in Sherin (2013), i.e., explanations. Our overall vocabulary was 2,786 unique words in German language. 232 stopwords were removed. This enabled us to calculate deviation vectors and apply clustering. A number of 14 clusters were found to be reasonable for our data (see Supplementary Material for detailed Table).

Fig. 3
figure 3

Condensed tree representation of the extracted clusters

Fig. 4
figure 4

Scatter plots and condensed trees for cluster evaluation of smaller samples (= 8 and = 43 teachers)

Table 5 depicts the resulting clusters with the most representative words for each cluster vis-á-vis the clusters from the pretrained language model-based clustering approach. Most of the resulting clusters can be mapped to the clusters that were extracted based on the pretrained language model-based clustering approach. Cluster 0SFootnote 4 thematizes students’ formulating hypotheses and summarization by the teacher. This relates to clusters 3, 5, and 7. Clusters 1S and 2S relate to the vacuum tube experiment, where cluster 1S focusses on the execution and cluster 2S on the observation and results. This maps to cluster 2. Cluster 3S relates to the dependency of air resistance and fall velocity, and possibly relates to clusters 10 and 11. Cluster 4S is not entirely clear, and cluster 5S deals with the teacher repeating the experiment, which has no apparent equivalent cluster. Cluster 6S focusses on students’ raising their arms and responding, which could be mapped to cluster 4. Cluster 7S relates to the writing down of the definition of free fall, which can be linked to cluster 12. Cluster 8S likely mixes the response of one female student and the remark of another male student, to what extent the speed of light would be reached by a falling object. No apparent link can be made to the pretrained language model-based clusters. Cluster 9S relates to the experiment with two masses that would most likely map to clusters 0 and 1. Cluster 10S addresses the transition from introduction of the experiments with no apparent corresponding cluster. Cluster 11S, again, deals with the experiment with two masses and links to clusters 0 and 1. Cluster 12S addresses a students’ answer to the question about what kind of movement the free fall is. The closest resemblance is with cluster 8. Finally, cluster 13S addresses the vacuum tube experiment, in particular the repetition of the same. No apparent equivalent exists in the pretrained language model-based clustering approach. Finally, we calculated the proportion of sentences in each cluster from the approach by Sherin (2013) that were classified as noise in the pretrained language model-based clustering approach. The respective proportions for each cluster were: 0.46 (0S), 0.28 (1S), 0.45 (2S), 0.40 (3S), 0.60 (4S), 0.60 (5S), 0.48 (6S), 0.38 (7S), 0.62 (8S), 0.35 (9S), 0.61 (10S), 0.32 (11S), 0.34 (12S), and 0.09 (13S). Clusters 4S, 5S, 8S, and 10S had a particularly large shares of noise-clustered sentences. Interestingly, these clusters could not be easily mapped to the clusters from the pretrained language model-based clustering approach (however, cluster 13S with a particularly low proportion could also not be assigned). They also consistently included generic words (e.g., teacher or students), which were attributed with the noise cluster in the pretrained language model-based clustering approach Fig. 5.

Fig. 5
figure 5

Codings of video sequence (coding 2) with identified clusters based for three independent raters

Table 5 Comparison of clusters extracted from the pretrained language model-based clustering approach and the clustering approach that was adopted from Sherin (2013), and the respective mapping

Exploring Textual Organization with the Extracted Clusters (RQ2)

To evaluate to what extent the extracted clusters provide quantifiable information on the textual organization of the written descriptions, we first plot the occurrence of clusters throughout the written descriptions, examine the non-random organization of the clusters, and examine properties of the cluster embeddings. Occurrence of clusters throughout the written descriptions is depicted in Fig. 6. The vertical bars indicate the textual position for the respective maximum occurrence of a certain cluster. The textual positions of the maxima are equally distributed throughout the written descriptions, so that all parts of the written descriptions are attributed with a cluster. Furthermore, the cluster occur at expected positions, given the events in the teaching situation. For example, cluster 13 addressed the beginning of the lesson and it occurred most frequently at the very beginning of the written descriptions (see Fig. 6). In the observed teaching situation, three experiments were carried out one after the other: Free fall of a screw and a spring (cluster 10), free fall of two masses of the same size but different weights (cluster 1) and, finally, free fall in a vacuum tube (cluster 2). Cluster 10 appeared at the beginning of the texts. Cluster 1, in contrast, appeared somewhat later, which maps to the temporal sequence of events in the observed teaching situation, since both experiments that were referenced in these clusters were carried out shortly after each other in the first half of the video. Cluster 2 was addressed frequently and extensively throughout the descriptions. In fact, cluster 2 relates to the most noteworthy experiment (vacuum tube) in the entire teaching situation, which might explain the preponderance in the written descriptions.

A problem (cluster 0) occurred in the second experiment (cluster 1). The shapes of the curves for cluster 0 and 1 match well (as it is also evident in Fig. 3). Before the first experiment, the teacher summarized the “main hypotheses”; the corresponding cluster 5 for this event also occurred chronologically at the beginning. The other actions, i.e., the formulation and discussion of hypotheses (clusters 6 and 7), the reaction to pupils’ answers (cluster 3) and the pupils’ answers (cluster 4) occurred throughout the teaching situation, which is reflected in the considerably high frequency throughout the first half of the written descriptions in Fig. 6. Cluster 11 related to the discussion of the connection between air resistance, mass and fall velocity. This was also related to the experiments seen (observations were described and interpreted; hypotheses regarding the connection were posed and tested). The temporal progression was appropriate, less at the beginning, more towards the middle of the texts. Cluster 12 addressed summarizing the findings of the three experiments. It occurred quite often at the beginning of the descriptions, which does not correspond to the chronological sequence of events. The reason for this could be that some preservice physics teachers began the descriptions with what the goal/result of the sequence was. Otherwise, cluster 12 had its second peak before clusters 8 and 9, which again fits the temporal sequencing of events in the teaching situation. At the end of the sequence, the teacher asked what kind of movement the free fall is. The corresponding clusters were the question itself (cluster 8) and the discussion about it (cluster 9). They occurred most often in the middle of the texts, which corresponded to the end of the written descriptions. The noise cluster (cluster -1) occurred almost equally distributed throughout the written descriptions. The respective counts for each relative position were: 57 (0.0), 71 (0.1), 91 (0.2), 73 (0.3), 79 (0.4), 71 (0.5), 66 (0.6), 68 (0.7), 88 (0.8), 76 (0.9), 20 (1.0). This provides evidence that no particular position in the written descriptions was prone to include more noise sentences compared to other positions. The lower counts at the beginning and end positions resulted from the calculation of the relative position index.

Fig. 6
figure 6

Progression of extracted clusters relative to other descriptive sentences in the documents. Top: absolute count of occurrence for a cluster at a given document position. Bottom: relative frequency for a cluster at a given document position. Vertical lines indicate the overall peaks in occurrence for each cluster

To analyze the sequential interdependence of the clusters, directed network graphs were generated based on the incoming and outgoing connections for each cluster (see Fig. 7). A connection between clusters was established when one cluster occurred in the preceding or receding sentence of another cluster’s sentence. Edges (i.e., the interconnections between two clusters) in the networks were weighted by the cluster sizes to highlight connections that appeared often irrespective of the cluster size. The edges with the largest values for the connections were labeled with the respective values (see the small numbers on the edges in Fig. 7(a)). The empirical network graph highlights that certain clusters are central in the network (see Fig. 7(a)). The greatest importance in the network had clusters -1, 2, 4, 6, and 11. In particular, cluster 2 represents the vacuum tube experiment, and cluster 4 the general cluster that students raise their arms and respond. Hence, both physics-specific and general clusters were highly interconnected in the physics teachers’ written descriptions.

By analyzing interconnections between two nodes, it appears that clusters -1, 2, 9, 3, and 10 were self-referenced particularly often. Except for clusters -1 and 3, these clusters related to physics-specific events such as the vacuum tube experiment, the type of movement, and the weight and air resistance. Moreover, clusters 8 and 9, clusters 0 and 1, and 1 and 6 are interconnected particularly often. The former two connections directly attribute to the close connection of these clusters in meaning. The connection of cluster 1 (experiment with two masses) and cluster 6 (students’ hypotheses) can be explained by the fact that the teacher linked this experiment with posing hypotheses.

Finally, movements of the preservice physics teachers through embedding space by means of addressing specific clusters in their texts should be analyzed with streamline plots (see Fig. 7(b)-(d)). Streamline plots are vector field representations. We define a connecting vector between two sentences that belong to any of the clusters as a “velocity” vector, indicating the movement through cluster embedding space. The resulting vector field is represented in Fig. 7 (b). A tendency to “move” through cluster embedding space in center direction can be verified, because the streamlines direct toward the center. By comparing Fig. 7(b) with (c), which represents a vector field where every velocity magnitude and direction were chosen at random, it is evident that Fig. 7(b) does not represent a random vector field. When positional information is added generate the velocity vector direction (see Fig. 7(d)), the resulting vector field resembles the empirical vector field. The entropiesFootnote 5 for comparing velocities in plots (b) with (c), and (b) with (d) in x- and y-direction, respectively, were .45 and .28, and .03 and .10. This indicates that the vector field in Fig. 7(d) better approximates the empirical vector field. Thus, the preservice physics teachers do not randomly walk through the cluster embedding space, but rather deliberately compose their texts by attending to the different clusters that were extracted with the pretrained language model-based clustering approach.

Fig. 7
figure 7

Directed network graphs of clusters and streamline plots of cluster embeddings: a Empirical directed network based on the actual connections between clusters present in the written descriptions; b Streamline plot of actual connections between clusters; c Streamline plot with randomly distributed directions; d Streamline plot where directions are sampled from pool of existing connections.

Discussion

Attention to learning-relevant classroom events and students’ thinking is an important skill for teachers to implement a student-centered pedagogy (van Es & Sherin, 2002b; Chan et al., 2021; Levin et al., 2009). However, assessment of teachers’ attention to classroom events is complex, because either the uncertainty of teaching situations is oftentimes related to the inherent complexity of ongoing processes, and describing one’s attention processes is intricately tied to teaching knowledge and other filters (Chan et al., 2021). Constructed response formats have been argued to facilitate more authentic assessment of attention processes, and computer-based analytical tools such as ML methods have been found to provide promising means to further our understanding and assessment of complex constructs such as attending to classroom events (Lamb et al., 2021; Zhai et al., 2020). In this paper we sought to examine potentials and challenges of a pretrained language model-based clustering approach for the purpose of extracting patterns, i.e., clusters, in preservice physics teachers’ written descriptions of an observed teaching situation. We examined the validity of the extracted clusters (RQ1) and explored novel ways in which the clusters enable textual analytics that allow to examine quantitative hypotheses on textual organization (RQ2).

To assess the validity of the extracted clusters, the interpretability (RQ1a), the specificity (RQ1b), and the robustness (RQ1c) of the extracted clusters from the pretrained language model-based clustering approach were evaluated. The clustering approach identified a number of 14 clusters that can be grouped into physics-specific and more general clusters. With regard to the contents of the clusters, all clusters could be related to distinct events in the teaching situation. The clusters encapsulated short, concrete events (recapitulating the last lesson), and more abstract ideas (summarizing hypotheses). We found that more specific, event-related clusters could be reliably coded by the raters. However, the more general clusters (related to posing and summarizing hypotheses) that were applicable to several parts of the teaching situation yielded lower reliability scores, and are thus more inferential. The extracted clusters were also robust to variation in sample size and clustering method. A sample size of only N=8 preservice physics teachers’ written descriptions yielded a similar distribution of clusters. This likely resulted from grounding the clustering with embeddings from the pretrained language model. A further indication of robustness resulted from the comparison with a previously employed clustering approach in science education research (Sherin, 2013). We found that many of the extracted clusters from the pretrained language model-based clustering approach mapped to the clusters that resulted from the application of the clustering approach by Sherin (2013).

Given that the clusters were well interpretable and could be mapped to the teaching situation, we conclude that the algorithm identified meaningful and distinguishable clusters in the preservice teachers’ descriptions. The variety of different foci and abstractness in the extracted clusters is well represented within the different foci of noticing that were summarized by Talanquer et al. (2015). Moreover, the differentiation of more general clusters and physics-specific clusters resonates with the well-established construct of teachers’ knowledge, in particular the notions of general pedagogical knowledge and content knowledge (Shulman, 1986; Carlson et al., 2019). The pedagogical content knowledge as an “amalgam of content and pedagogy” (Hume, 2009) might be conceptualized as the relevant knowledge to connect the clusters and discuss pedagogical implications of the physics-specific, and more general clusters. The pretrained language model provides the relevant structures to classify sentences along this dimension. The contextualized embeddings from the pretrained language model facilitate science education researchers means to extract robust clusters in their datasets. Furthermore, the pretrained language model-based clustering approach integrates the data preprocessing into the modeling and introduces a novel criterion for cluster extraction (stability of clusters over density variation) that provides the human analyst another important measure of appropriate cluster selection.

The findings in the context of RQ1 also indicate that the preservice physics teachers included very general clusters and a comparably large amount of noise clustered sentences. This observation might relate to the finding that novice teachers tend to include broad and general statements in their observations, merely as placeholders (Mena-Marcos et al., 2013). Mena-Marcos et al. (2013) found that more knowledgeable teachers also include more precise statements in their reflections. Furthermore, the preservice physics teachers tended to include only few sentences on each cluster. This indicates that, on average, not much space is spent to describe an event in detail. This might relate to the finding that novice mathematics and science teachers in particular struggle to attend to the specific contents of what was said (Sherin & Han, 2004; Levin et al., 2009; Roth et al., 2011). Rather than describing the concrete hypotheses that the students uttered, many teachers might abstract from the specific contents and simply note that the students posed hypotheses. Yet, developing noticing skills would require the preservice physics teachers to detail the concrete ideas of the students and teacher in order to make an informed evaluation on the substance of the classroom interactions (Levin et al., 2009). However, the unspecific contents might relate to our instructional approach. For example, it should be tested if pre-service teachers can attend to specific events if they can watch the video multiple times and take notes for themselves.

In the context of RQ2 we evaluated to what extent the extracted clusters could be used to assess the textual organization of the written descriptions. The absolute and relative frequency of sentences in certain clusters with regard to their relative position in the written descriptions were analyzed through visual means. We found that the maximum counts for the clusters well matched their expected positions in the teaching situation. This suggests that the preservice physics teachers, on average, compose their written descriptions according to the chronological occurrence of the events in the teaching situation. This finding resonates with episodic memory theory which suggests that free recall of events occurs in temporal order (Conway, 2009; Kahana et al., 2008). Further evaluation of textual organization of clusters by means of network graphs enabled us to document that certain clusters are cued together more closely as would be expected by chance and cluster size. This means that clusters that were semantically or chronologically related were linked by the preservice physics teachers more often. This relates to the contiguity effect, namely that neighboring items (here: events in a teaching situation) are recalled successively (Kahana et al., 2008). Furthermore, streamplot analyses revealed that the preservice physics teachers’ movement through cluster embedding space was non-random and dependent on the position in this space. On a local scale, the position in cluster space thus determines the propensity with which the preservice physics teachers’ move in a certain direction in this space. Analysis of textual organization can extend assessment of analytical chunks as outlined by van Es and Sherin (2002). van Es and Sherin (2002) differentiate expertise in noticing in a trajectory where experts include more interconnections among their evidences (here: clusters and interconnections between them in the descriptions). The extracted clusters alongside with the network representation directly would yield a quantification of noted events and thus provide a tool to diagnose expertise levels in noticing.

Limitations

Even though the utilization of a pretrained language model allowed us to integrate data preprocessing into the ML-based modeling, there are assumptions on the pretrained language models that have to be critically examined. For example, the resulting contextualized embeddings are determined by the choice of the pretrained language model and cannot be easily adjusted. Problems with the pretrained embeddings have also been reported. Given that they are trained on the Internet, certain biases related to gender or ethnicity are present in the embeddings (Caliskan et al., 2017; Bhardwaj et al., 2020). As such, it has to be critically examined to what extent these biases might be propagated into educational assessments which can be disadvantageous.

Another feature of the pretrained language model-based clustering approach was the algorithm-derived extraction of the number of clusters present in the data. Even though the means to extract the clusters based on the stability over density variation might be an additional tool for researchers to use in order to determine a viable number of clusters, there are still many hyperparameters that can be tuned which yield different numbers of clusters. Given the scope of this paper, we did not systematically vary the hyperparameters to find a final number of clusters. We rather sought to establish that the proposed number of clusters was well interpretable in reference to the observed teaching situation. However, the large proportion of noise datapoints also indicates that a large share of the data is not accounted for in the clustering.

With regard to the contents of the clusters, it was noticeable that the clustering approach did not capture some relevant students’ questions from the observed teaching situation into a distinct cluster even though some pre-service teachers included them in their descriptions. Attending to these student questions in the teaching situation required physics knowledge. One student asked whether the different movement of feather and screw (the feather was zig-zagging whereas the screw moved straight to the ground) could explain the differences in falling time. This is a relevant question that hints at the missing control of variables in the experiment. Some preservice physics teachers included this question in their descriptions, however, no separate cluster appeared to capture it. This is a consequence of the instability and scarcity of this observation as represented in the preservice physics teachers’ written descriptions. Omitting contents from the clusters is in fact a goal for unsupervised ML approaches that seek to reduce a complex dataset (Jordan & Mitchell, 2015). For the purpose of assessing skills related to attention to classroom events, adjustments in the clustering procedure should be made to allow more clusters to occur, because the identification of this student question demonstrates close attention to student thinking and an understanding of the problematic aspects of the teaching situation and would be considered to correspond to high levels of noticing skills.

Conclusions

Many domains such as physics embraced ML methods to extract information from unstructured data, e.g., to sift through collider data (unfeasible for humans) to detect outliers (i.e., noise-clustered datapoints) with even the same clustering approach that has been applied in this study (Arpaia et al., 2021). Given the novel potentials to extract information from unstructured data and the increasing availability of this data, science education researchers should critically examine potentials and challenges of these novel ML-based methods in their research contexts as well. This study could show that a pretrained language model-based clustering approach could be used as an assessment tool to analytically induce what teachers attended to in an observed teaching situation and evaluate the potentials of ML for analyzing open-ended responses. We suggest that the applied pretrained language model-based clustering approach can be enhanced by further fine-tuning the pretrained language model weights to science-specific language. This will enable more involved language analytics such as analogical reasoning or synonym detection (Mikolov et al., 2013). It has been shown that pretrained language models capture some knowledge about quantities (e.g., the magnitude of weight of a prototypical dog), or some knowledge graphs about entities (e.g., “Bob Dylan is a songwriter”) (Wang et al., 2020; Zhang et al., 2020). In fact, representing natural language into vector spaces can enable novel research approaches to answer research questions in science education research (Sherin, 2013). Once the pretrained language models are trained and publicly available, advanced analytics of written descriptions will be enabled. The presented clustering approach could be applied as a recommender tool to automatically feedback to the teachers which events and contents they addressed and which they missed to pay attention to.

The pretrained language models enabled an informed contextualized representation through embeddings of the language data. Representation of language data through embeddings will also enable researchers to map language to other modalities such as graphical/visual data or mathematical expressions (see: Krstovski & Blei, 2018). Multiple representations and translating between different representations has been considered a constitutive feature for scientific literacy (Brookes & Etkina, 2009). However, it will be necessary to develop theoretically grounded ontologies and epistemologies of what preservice science teachers can observe and how they reason about it (Brookes & Etkina, 2009). Once pretrained language models are developed and ontologies and epistemologies can guide analyses, the presented clustering approach in conjunction with these models can help to make analyses more comparable, scalable, and robust.

With the help of the clustering approach in this study quantitative hypotheses on text composition could be explored. For example, we suspect that preservice physics teachers include general and specific language statements in their written descriptions, are scarce to describe a particular cluster, and compose their texts in chronological order of the appearance of the events. Writing a sentence that can be classified into a specific cluster, to a certain extent, predisposes the teachers to move through the cluster embedding space in certain directions, and noticing certain events predisposes them to also include temporally related events. These hypotheses need to be more systematically tested, because they can enhance assessment of noticing-related cognitive mechanisms such as careful observation and attention to classroom events. We even wonder to what extent mapping the teachers’ trajectories through the embedding space can be captured by more physics-involved concepts such as movement through a potential where equations of motion and conservation laws determine the teachers’ writing. We are not aware that these hypotheses have been tested. ML-based methods will enable these analyses.

In line with the argument put forth by Singer (2019), we encourage science education researchers to adopt more observational studies that are grounded in data science, assessment and measurement (Singer, 2019). Insights in physics today also come from simulation studies and observational (non-manipulable) experiments. The recent Nobel price of 2021 on complex systems’ behavior or insights in astrophysics are testimony to this. We believe that science education researchers can gain novel insights on studied phenomena through ML-based, computational approaches such as the one presented in this study where an unstructured body of textual data is analyzed. Zhai et al. (2020) and Lamb et al. (2021) argued that ML-based computational models can capture the complexity of cognitive processes and “revolutionize” science assessment. We concur with these arguments and emphasize the necessity to develop an understanding in the science education research community for unsupervised ML approaches and pretrained language models in particular, given the preponderance of observational data that is available in educational contexts. Unsupervised ML methods have thus great potentials to bridge the gap between quantitative and qualitative methods in science education. Pretrained language models, more particularly, capture human-like semantics as measured through implicit association tests and thus represent cognitive structures of humans (Caliskan et al., 2017). Hence, pretrained language models arguably are most promising candidates to model language-based processes. Given that, in our case, the ML-based approach scaled seamlessly (neither human annotations nor preprocessing of the textual data was necessary to extract clusters) and is publicly available to researchers, it would be desirable to increase efforts to share data and models in order to make the most use of the available resources.