Adapting a Reflection Model for Written Reflections
In analyzing their scaffolding system for reflective journals, Lai and Calandra (2010) found that novice preservice teachers appreciated the given structure for their written reflection in the form of a reflection-supporting sequence model. Similarly, in this study, a reflection-supporting model was developed to help structure preservice physics teachers’ written reflections. Korthagen and Kessels (1999) proposed a reflection-supporting model in the context of university-based teacher education that outlines the process of reflection in five sequential stages in a spiral model, where the last step of a former cycle comprises the first step of a subsequent cycle. According to this model, teachers start with recollecting their set objectives and recall their actual experiences. Afterwards, they outline essential aspects of the experience, reconstruct the meaning of the experience for their professional development, and identify larger influences that might have been relevant for the teaching enactment. Then, teachers generate alternatives for their actions with advantages and disadvantages for these alternatives. Finally, they derive consequences for their future teaching and eventually enact these action plans.
This rather broad model was adapted to the specific context of writing a reflection after a teaching enactment. This model was chosen as the theoretical framework because it is grounded in experiential learning theory and outlines essential steps to abstract from personal experience for the use of one’s own professional development. The adapted model was particularly tailored to writing. In this line, a written reflection entails functional zones (as introduced in text linguistics by Swales (1990)) that each contributes to the purpose of the text. These functional zones pertain to elements of reflection that were gleaned from the model by Korthagen and Kessels (1999). The functional zones related to:
-
1.
Circumstances of the taught lesson (CircumstancesFootnote 1). Here, the preservice teachers were instructed to provide details on the setting (time, place), class composition (e.g., amount of students), and their set learning goals.
-
2.
Description of a teaching situation in the lesson that the teachers wanted to reflect on (Description). Here, teachers were prompted to write about their actions and the students’ actions.
-
3.
Evaluation on how the teachers liked or disliked the teaching situation, and why (Evaluation). Here, teachers were held to judge their own performance and the students’ performance.
-
4.
Alternatives of action for their chosen activities (Alternatives). In this step, teachers should consider actions that they could have done differently to improve the outcomes.
-
5.
Consequences they draw for their own further professional development (Consequences). In the final step, preservice teachers were encouraged to relate their reflection of the situation to their future actions.
We submit that these functional zones provide a scaffold for the process of written reflection. From argumentation theory, it is expected that valid conclusions (i.e., consequences for personal professional development in teacher education) rely on evidence (Toulmin 2003), which in the case of teaching situations should be students’ responses to teaching actions (Clarke and Hollingsworth 2002). In the written reflections, preservice teachers are supposed to generate evidence through recall of essential environmental features (Circumstances) and the objective description of the teaching situation. The objective description of the teaching situation is a difficult task in and of itself given that perception is theory-laden so that structuring and external feedback are indispensable (Korthagen and Kessels 1999). It is furthermore important for teachers to abstract from the personal realm and integrate theory into their reflection (Mena-Marcos et al. 2013). This should be facilitated through evaluation of the teaching situation, generation of alternative modes of action, and outlining personal consequences.
From an empirical stance, the functional zones are constitutive of reflective writing as well. In particular, Hatton and Smith (1995) found that most of the labelled segments in their analyzed written reflections were descriptive and thus corresponded with the description phase in our model—eventually rendering Description (and also Circumstances) the most prevalent element in preservice teachers’ written reflections (Poldner et al. 2014, also). Furthermore, Mena-Marcos et al. (2013) found that preservice teachers often included appraisals into their written reflections. This suggests that evaluation is another prevalent element in the reflections. Expert teachers with regard to deliberate reflection particularly used the descriptive phase to establish the context for exploring alternative explanations, which corresponds with the Alternative step in our model (Hatton and Smith 1995). Consequences are a particularly expert-like element in preservice teachers’ written reflections, given that (genuine) thinking about consequences requires an openness for changing one’s behavior in the future. Thinking about consequences should be trained because the goal of reflection is personal professional development (von Aufschnaiter et al. 2019; Poldner et al. 2014). Also, Ullmann (2019) found evidence that “future intention” (a category that essentially encapsulates Consequences) is well present in written reflections (though not assessed for the population of teachers) and could be accurately classified. With regard to distinctiveness of the five elements, it has been shown that elements like the ones in our model could be well distinguished (Ullmann 2019). What has been observed though was that Description and Evaluation are difficult to keep apart for novice, preservice teachers (Kost 2019).
Furthermore, the functional zones could be related to teachers’ professional knowledge. Hence, the teachers’ knowledge bases of content knowledge (CK), pedagogical knowledge (PK), and pedagogical content knowledge (PCK) were added to the model as another dimension (Nowak et al. 2018). Finally, the written reflection could be differentiated by the extent to which reasoning, e.g., justification, for an element of the reflection was present or not (see Fig. 1). However, the professional knowledge and the amount of reasoning present in the reflections were not a concern of this study, but rather only the contents of the functional zones (Circumstances, Description, Evaluation, Alternatives, and Consequences), which are called elements of reflection in the remainder.
The goal of the computer-based classifier was to accurately label segments as one of the elements of reflection. The elements of reflection seemed to be a reasonable choice for a classifier because these elements were not too specific, which would typically cause human-computer agreement to decrease (Nehm et al. 2012). Our prediction is that the classifier will be able to label segments in a written reflection according to these five elements.
Collecting Written Reflections in Preservice Teacher Education
Preservice physics teachers at a mid-sized German university were introduced to the reflection model during their final 15-week-long school placement. We decided to build the classifier upon preservice teachers reflections compared with expert teachers written reflections because the classifier was meant to be applied in teacher training programs. It was anticipated that the degree of concreteness of writing by preservice teachers would be different from expert teachers so that a classifier that was built upon expert teachers’ written reflections might fail to perform well on preservice teachers’ written reflections.
The preservice physics teachers were introduced to the reflection-supporting model and the five reflection elements in a 1.5-h seminar session. In this seminar session, preservice teachers learned the reflection elements and wrote a sample reflection based on an observed video recording of an authentic lesson. They were instructed to write their later reflections according to this reflection model. The contents of their written reflections were the teaching enactments that the preservice teachers made. Each preservice teacher handed in approximately four written reflections throughout the school placement. Writing about teaching enactments is a standard method in assessment of preservice teachers’ reflections because it allows for careful thinking, rethinking, and reworking of ideas, and yields a product externalized from the person that is open for observation, discussion, and further development (Poldner et al. 2014). A writing assignment can be considered an open-ended, constructed response format. Constructed response formats, such as teachers’ writing, were argued to be a viable method to assess reflective thinking (Poldner et al. 2014). Compared with selected-response formats like multiple-choice questions, writing can engage preservice teachers in a more ecologically valid and natural way of thinking and reasoning (Mislevy et al. 2002; Poldner et al. 2014).
Written reflections were collected from N = 17 preservice physics teachers (see Fig. 2). All preservice teachers were in their final school placement and were recruited from two subsequent, independent semesters. In the course of this school placement, teachers were given multiple opportunities to hand in written reflections on their teaching enactments and receive expert feedback. Overall, N = 81 written reflections were collected.
To assess the predictive performance of the computer-based classifier, a held-out test data set was gathered in the same type of school placement 2 years later, until after building the computer-based classifier was finished. Classification of the held-out test data set and calculating agreement with human raters would serve as a means to evaluate the problem of overfitting of the model to the training dataset, given that multiple classifiers and feature configurations were considered. Instructions for the reflection-supporting model and the five elements of reflection were the same in this cohort compared with the other cohort. This cohort comprised N = 12 preservice physics teachers who were not related to the former cohort. N = 12 written reflections on the teaching enactments of the preservice teachers were collected.
In both cohorts, preservice teachers handed in their written reflections after going through a process of writing and editing these texts. The preservice teachers received a brief feedback by their mentors on breadth and depth for some of their written reflections.
Next, all the written reflections were labelled according to the elements of the reflection model (see Fig. 2). When teachers provided information, such as dates or meaningless information such that this was their second written reflection, these segments were labelled “irrelevant.” They were removed from further analyses because only a fraction (approximately 4%) of the segments was affected. Human interrater agreement for labelling the segments was used to assess to what extent teachers’ written reflections adhered to the reflection model and how reliably the elements could be identified by the human raters. Two independent human raters each labelled one written reflection of N = 8 different teachers (approximately 10% of the train and validation data sets, similar to Poldner et al. (2014)) twice. Each human rater read the entire written reflection. While reading, they identified and labelled segments of the text according to the reflection elements of the model. A segment of the text was defined as a test fragment that pertained to one of the reflection elements and addressed a similar topic, such as making a case for the functioning of the group work during class. The human raters reached an agreement, as assessed through Cohen’s κ, of κ = 0.74, which can be considered substantial (Krippendorff 2019) and comparable with similar human coding tasks in research on written reflections (Poldner et al. 2014). We also considered a sentence-level coding with a subsample of the dataset, where agreement remained substantial with a noticeable decrease in agreement, κ = 0.64.
To evaluate where disagreements between the human raters occurred most often, a confusion matrix was created (see Table 1). Disagreements occurred most often with regard to Description, Circumstances, and Evaluation. Eleven percent of segments were misspecified by the two human raters with regard to these three elements. Alternatives and Consequences were never confused with Circumstances or Description, but occasionally with Evaluation.
Table 1 Confusion matrix for human interrater agreement After interrater agreement was assured, one human rater went through all written reflections and identified and labelled segments. This resulted in the following amount of segments per label: Circumstances, 759; Description, 476; Evaluation, 392; Alternatives, 192; Consequences, 147. The mean (SD) coding unit length was 2.0 (1.8) sentences, ranging from 1 to 29 sentences, with the longest being an extended description of a teaching situation. The mean (SD) number of words in a segment was 27 (29).
Building a Computer-Based Classifier for Written Reflections
Data Transformation and Feature Selection
To build the computer-based classifier, the written reflections have to be subdivided into segments which are transformed into features (predictor variables) that will allow the classifier to predict probabilities for each of the reflection elements for this segment. The choice of features is critical for classification accuracy and generalizability to new data (Jurafsky and Martin 2014). In this study, the used words in a segment were anticipated to be an important feature for representing the segments given that Description likely elicits process verbs like “conduct (an experiment),” Evaluation might be characterized by sentimental words, such as “good” or “bad,” and Alternatives requires writing in conditional mode (Ullmann 2019). Consequently, word count was used as a feature to represent a segment. Word count accounts for the words that are present in a segment through occurrence-based encoding in a segment-term matrix that is used to classify the segment (Blake 2011). In the segment-term matrix, rows represent segments and columns represent the words in the vocabulary. The cells are filled by the number of occurrences for a certain word in a given segment. This model ignores word order in the segments (bag-of-words assumption). On a conceptual level, word count combines multiple types of analyses, such as lexicon/topic (which terms are used?) or syntax/discourse (which terms co-occur in the same segment?) (Chodorow and Burstein 2004). Word count is often used as a starting point for model building, such that a classifier can be improved by comparing with other features, e.g., length of segment or lemmatized/stemmed word forms (Ullmann 2019).
To further assess performance sensitivity of the classifier to other feature configurations, further feature engineering was done in a later research question. In addition to the word count feature (see above), the following features were applied to all segments irrespective of category and were compared with each other:
-
1.
Meaningless feature (baseline model): To get an idea to what extent a meaningless feature could accurately classify the segments, a baseline model was implemented with the feature of normalized (by segment length) vowel positions. Normalized vowel positions are not expected to relate to any category in a meaningful way so that the algorithm had no meaningful input for classification.
-
2.
Word count feature: In this feature configuration (see above) the occurring words were encoded through a segment-term matrix. Punctuation and non-character symbols were removed from the documents.
-
3.
Some words feature: In addition to the word count configuration from above, common words (called stopwords) were removed from the segments to reduce redundant and irrelevant information. Furthermore, words were included in a lemmatized form. Lemmatization transformed words into their base form, e.g., “was” will be mapped to “be.” However, modality was kept for verbs because this was considered an important feature for expressing alternatives or consequences, i.e., ideas in hypothetical form. Feature engineering of this kind lowers the dependency on idiosyncratic features of certain segments; however, it also removes potentially useful information, such as tense.
-
4.
Doc2Vec feature: Finally, an advanced algorithm called Doc2Vec (Mikolov et al. 2013; Rehurek and Sojka 2010; Le and Mikolov 2014) was utilized to represent segments. Doc2vec is an unsupervised machine-learning algorithm that utilizes neural networks. In this algorithm, two complex models (continuous bag-of-words [CBOW], and skip-gram) are combined to retrieve word-vectors from the weight matrix after training a simple neural network in the CBOW and skip-gram steps. The word-vectors encode the context in which the respective word appears (CBOW) and predict the context based on the word (skip-gram). Alongside words, a special label for each document (segment) is also fed to the neural network, so that document-vectors that represent segments can also be retrieved from the weight matrix. This vector was used in the present study as a feature of the segment. Two particularly important hyperparameters for the Doc2Vec feature are window size and embedding dimensionality. Window size refers either to the size of the context (before and after the word) that is used to represent a word in the CBOW step, or to the size of context that a word predicts in the skip-gram step. Embedding dimensionality refers to the dimensionality of the space to which words and segments are mapped (Le and Mikolov 2014). With the (exploratory) purpose of the present study of testing the applicability of this algorithm to represent segments from written reflections, a (standard) window size of 15 was used alongside an embedding dimensionality of 300, which is considered a common configuration of hyperparameters for this model (Jurafsky and Martin 2014). Note that other configurations likely result in different values for performance metrics.
Classifier Algorithm
In the present study, supervised ML techniques were utilized in order to build the classifier. In the build phase of the computer-based classifier, segments were transformed into feature vectors and multiple classifier algorithms were fit with the goal of finding the model with the most performant combination of features and classifier (Bengfort et al. 2018). To find this model, the sample of N = 81 written reflections was split into training and validation datasets. To assess predictive performance of the most performant model for unseen data, it was finally fit to the held-out test dataset that consisted of N = 12 written reflections that were collected after the model building phase.
Classifier performance metrics included precision, recall, and F1-score for each category. Precision is the amount of correctly labelled positive data (true positives) for a category, given the amount of positive labelled data by the machine (true positives + false positives), and recall is the proportion of the number of items correctly labelled positive (true positives) given the number of items which should have been labelled positive (true positives + false negatives) (Jurafsky and Martin 2014). Due to the importance of both precision and recall, the harmonic mean of precision and recall is calculated, which is called F1-score. The harmonic mean is more conservative compared with the arithmetic mean, and gives stronger weight to the lower of both values, i.e., it is shifted towards it. When F1-score is reported for entire models, the macro average and weighted average F1-scores across categories were used. Macro average refers to the average for precision, recall, and F1-score over the categories. Weighted average additionally accounts for the support for each category, i.e., number of segments for each category.
As a means to choose a classifier algorithm (see Fig. 3), multiple established classifiers were trained (Ullmann 2019) and used to predict the labels in the validation data set with default hyperparameter configuration (Bengfort et al. 2018). This approach indicates if some algorithms were more applicable to classify the segments. The choice for one classifier for further analyses can be justified on the basis of these results. Four classifier algorithms were implemented: Decision Tree Classifier, multinomial logistic regression, Multinomial Naïve Bayes, and Stochastic Gradient Classifier. The Decision Tree Classifier creates a hierarchical model to classify data. The input features (words) are used in a rule-based manner to predict the labels of the segments (Breiman et al. 1984). Decision Tree Classifier has beneficial attributes, such as explainable decisions or flexibility with input data. However, it is prone to generate overcomplex trees with the problem of overfitting. Multinomial logistic regression is a versatile model to build a classifier where weights are trained based on optimization methods to maximize a loss function (Jurafsky and Martin 2014). A softmax function is used to calculate the probabilities of any of the five categories. Multinomial logistic regression is specifically useful because it balances redundancy in predictor variables and has convenient optimization properties (convex function, i.e., it has one global minimum) (Jurafsky and Martin 2014). The Multinomial Naïve Bayes classifier builds on Bayes’ rule where the posterior probability for a category of a segment is calculated through the prior probability of the label across all segments and the likelihood that, given the category, the features (e.g., words) occur. Yet, it is often infeasible to compute the evidence for any combination of words in a given segment so that the simplifying bag-of-words (or naïve) assumption is made, namely that the classification is unaffected if a certain word occurs in 1st or any other position (Jurafsky and Martin 2014). Multinomial Naïve Bayes has the particular advantage to perform well for small datasets but has less optimal characteristics when features are correlated. Finally, Stochastic Gradient Classifier is a classifier for discriminative learning where an objective function is iteratively optimized. Stochastic Gradient Classifier optimizes the evaluation of the loss function for linear classifiers, e.g., logistic regression.
To fit the classifiers, the programming language Python 3 (Python Software Foundation), the scikit-learn library (Pedregosa et al. 2011), and gensim (Rehurek and Sojka 2010) were used. Furthermore, the libraries spaCy (version 2.2.3) and nltk (Bird et al. 2009) were used to pre-process segments and extract features.