Since the 1970s, increasing attention to the evaluation of teachers’ generic competencies has resulted in the development of a variety of observation instruments that are widely used to assess elementary and secondary education, such as the Stallings Observe System (Stallings and Kaskowitz 1974), the Framework for Teaching (Danielson 1996), the International System for Teacher Observation and Feedback (Teddlie et al. 2006), the International Comparative Analysis of Learning and Teaching (Van de Grift 2007), and the Classroom Assessment Scoring System (Pianta et al. 2008). Other instruments used to examine teacher behavior include, for example, teachers’ self-reports, (semi-structured) interviews, and student questionnaires (e.g., Kyriakides et al. 2002; Kyriakides 2008; Maulana et al. 2015; Muijs 2006). However, despite its labor-intensive nature, classroom observation is viewed as a more unbiased form of data collection to examine teacher behavior (Pianta and Hamre 2009; Wragg 1994).

The development and implementation of observation instruments can be very useful in more effectively shaping teacher education and professional development programs and in evaluating classroom-based interventions (e.g., Darling-Hammond 2012; Lavigne and Good 2015; O’Leary 2014; Yoder and Symons 2010). However, most of these instruments focus on teachers’ generic competencies rather than teachers’ subject-specific competencies. Therefore, scholars such as Grossman and McDonald (2008), Desimone (2009), and Schoenfeld (2013) emphasized the importance of adding subject-specific observation instruments to research on teaching and teacher education.

Although some recently developed observation instruments focus on more specific teacher competencies, such as classroom talk (Mercer 2010), project-based learning (Stearns et al. 2012), and the reform of learning and instruction (Sawada et al. 2002), only a few observation instruments focus on teachers’ subject-specific strategies, such as English reading (Gertsen et al. 2005), content- and language-integrated learning (De Graaff et al. 2007), English language arts (Grossman et al. 2010), and mathematical instruction (Hill et al. 2012; Matsumura et al. 2008; Schoenfeld 2013).

To date, however, there are no validated and reliable observation instruments that evaluate secondary-school history teaching. This is unfortunate, especially because, as noted by Bain and Mirel (2006), Grant and Gradwell (2009), and Achinstein and Fogo (2015), current teacher education and professional development programs may not meet history teachers’ needs so that they can achieve the aims set by history curricula. Observation instruments that evaluate history teachers’ subject-specific strategies could identify history teachers’ specific needs and, thus, further improve teacher education and professional development programs for history teachers.

Van Hover et al. (2012) attempted to construct a validated observation instrument to evaluate secondary-school history teaching. Their Protocol for Assessing the Teaching of History (PATH) is promising, but information about the measure’s reliability is lacking. In contrast to PATH, the observation instrument that we developed focuses on a single but highly important history teacher competency, promoting students’ ability to perform historical contextualization. Historical contextualization is considered an important component of historical thinking and reasoning and is incorporated into history curricula worldwide (Lévesque 2008; Seixas and Morton 2013; Van Drie and Van Boxtel 2008). In previous research, we examined how students performed on a historical contextualization task and found that secondary-school students of different ages experience difficulties in performing historical contextualization tasks (Huijgen et al. 2014).

Therefore, we must gain greater insight into how history teachers promote students’ ability to perform historical contextualization in classrooms. The purpose of the present study is, therefore, to construct a reliable high-inference observation instrument and scoring design to assess history teachers’ competency in promoting historical contextualization in classrooms. In this study, we first present the theoretical framework and our research questions. Then, we present our methodology and results. Finally, we discuss our findings and present the practical implications of the results and directions for future research.

Theoretical framework

Teaching historical reasoning competencies

Scholars and other educational professionals widely agree that secondary-school history education should involve more than the simple learning of facts (e.g., Lévesque 2008; Van Drie and Van Boxtel 2008; Wineburg 2001). Therefore, historical reasoning competencies, such as determining causality, investigating sources, asking rich historical questions, and performing historical contextualization, have become increasingly important in Western history education over the last two decades (Erdmann and Hasberg 2011; Seixas and Morton 2013). Some scholars also stress the importance of historical reasoning competencies for promoting students’ democratic citizenship (e.g., Barton 2012; Saye and Brush 2004). To achieve historical reasoning competencies, students in history classes must be involved in engaging learning tasks and activities (Levstik and Tyson 2008; Gerwin and Visone 2006; Grant and Gradwell 2010) and history lessons should extend beyond factual recall to achieve deep subject understanding (Bransford et al. 2000; Darling-Hammond et al. 2009).

However, both novice and experienced history teachers seem to struggle when they are asked to develop engaging learning tasks and teach students historical reasoning competencies (Monte-Sano 2011; Van Hover and Yeager 2004; VanSledright 2010; Virta 2002). Many history lessons might, therefore, have a strong focus on historical content knowledge (Saye and Social Studies Inquiry Research Collaborative (SSIRC) 2013; VanSledright 2011).

Observing history education

To explore the challenges and problems that history teachers face, qualitative research studies have been conducted (e.g., Bain and Mirel 2006; Fogo 2014; Monte-Sano and Cochran 2009; Virta 2007). However, few quantitative research studies using standardized instruments have been conducted to explore history teachers’ competencies (Adler 2008; Ritter 2012). For example, only two studies used observation instruments to examine how teachers actually teach historical content knowledge and historical reasoning competencies. Thus, the use of standardized observation instruments in research on history education is an underexamined topic, as Van Hover et al. (2012) noted.

While the field of history education elucidates a clear and ambitious vision of high-quality history instruction, a current challenge for history educators (including teacher educators, curriculum specialists, and school-based history and social science supervisors) becomes how to illuminate and capture this when observing classrooms to research history instruction or to provide useful discipline-specific feedback to preservice (and inservice) history teachers (p. 604).

Nokes (2010) used an observation instrument and focused on history teachers’ literacy-related decisions about the types of texts they used and how students were taught to learn with these texts. Eight secondary-school history teachers were observed over a 3-week period using two frequency counting observation instruments, one instrument to record the type of texts and one to record teachers’ activities and instruction; however, detailed information about the instruments’ validity and (inter-rater) reliability is lacking. The other study was conducted by Van Hover et al. (2012), the only researchers who attempted to construct a subject-specific observation instrument, called the PATH, with the goal of evaluating and improving history instruction. PATH has the same structure as Pianta and Hamre’s (2009) Classroom Assessment Scoring System-Secondary (CLASS-S) and consists of six dimensions: (1) lesson components, (2) comprehension, (3) narrative, (4) interpretation, (5) sources, and (6) historical practices. Each dimension includes indicators and behavioral markers that are scored “high,” “middle,” and “low” by observers. The authors tested the inter-rater reliability for PATH and found positive indicators, but detailed information about the instrument’s validity and reliability is lacking.

Historical contextualization: a conceptualization

Rather than constructing an observation instrument for all historical reasoning competencies, we focus on how history teachers promote historical contextualization in classrooms. This focus provides us with the opportunity to spend sufficient time on item development and to test whether it is possible to observe history teachers’ subject-specific strategies using an observation instrument. We chose historical contextualization because it is considered a key competency of historical reasoning (Davies 2010; Lévesque 2008; Seixas and Morton 2013) and is, therefore, included in the formal history curricula of many countries, such as Australia, Belgium, Canada, Finland, Germany, the Netherlands, Spain, and the UK (Huijgen et al. 2014).

In history education, it is possible to contextualize historical sources and phenomena, including persons, events, and developments (Havekes et al. 2012). Historical contextualization is the ability to situate a historical phenomenon or person in a temporal, spatial, and social context to describe, explain, compare, or evaluate it (Van Boxtel and Van Drie 2012). Wineburg and Fournier (1994) defined historical contextualization as building a context of circumstances or facts that surround a particular historical phenomenon to render it more intelligible. Endacott and Brooks (2013) viewed historical contextualization as “a temporal sense of difference that includes deep understanding of the social, political, and cultural norms of the time period under investigation as well as knowledge of the events leading up to the historical situation and other relevant events that are happening concurrently” (p. 43). Historical events and historical agents’ decisions must be placed in the specific socio-spatial and socio-temporal locations in which they emerged. For example, students must know that in ancient Roman times, Julius Caesar could not have had breakfast in Rome and dinner in the Gaul region of France on the same day because the transportation modes needed for such a trip was not available (Lévesque 2008).

Teachers’ strategies for promoting historical contextualization

Research has been conducted to conceptualize the instructional practices that effective teachers employ to promote historical contextualization in classrooms (e.g., Doppen 2000; Rantala 2011; Van Boxtel and Van Drie 2012). To teach historical reasoning competencies such as historical contextualization, teachers must not only possess expert levels of subject content knowledge but also activate students to acquire knowledge and help them apply this knowledge to gain different historical reasoning competencies (Haydn et al. 2015). Additionally, Hattie’s meta-analysis (2008) indicated that effective teachers activate student learning. Other meta-analyses on effective teaching seem to confirm this finding (e.g., Kyriakides et al. 2013; Seidel and Shavelson 2007). Exposure to information alone is not sufficient for students to gain deep subject-specific understanding and historical reasoning competencies. Based on research that focused on historical contextualization, we identified four main teaching strategies for promoting historical contextualization in classrooms: (1) reconstructing the historical context, (2) fostering historical empathy, (3) performing historical contextualization to explain the past, and (4) raising awareness of present-oriented perspectives when examining the past.

First, the historical context of a phenomenon must be reconstructed to perform historical contextualization. Foster (1999) argued that students must possess historical context knowledge, including knowledge about chronology, before they can perform historical contextualization. Reisman and Wineburg (2008) also stressed the importance of background knowledge for the performance of historical contextualization. To reconstruct the historical context, students and teachers can use different frames of reference such as the chronological frame of reference, spatial frame of reference, or social frame of reference (e.g., De Keyser and Vandepitte 1998; Pontecorvo and Girardet 1993; Van Boxtel and Van Drie 2012). The chronological frame includes knowledge of time and period, significant events, and developments (Dawson 2009; Wilschut 2012). The spatial frame focuses on knowledge of (geographical) locations and scale (Havekes et al. 2012). The social frame includes not only knowledge of human behavior and the social conditions of life but also knowledge of socio-economic, socio-cultural, and socio-political developments (Van Boxtel and Van Drie 2004).

To reconstruct the historical context, teachers and students should explore the different frames of reference. For example, in previous research, we found that most students who used and combined different types of knowledge (e.g., chronological, spatial, economic, political, and cultural knowledge) obtained higher scores on a historical contextualization task than students who used a single type of knowledge (Huijgen et al. in press). Teachers could use different sources to build the different frames of knowledge, such as movies (Marcus 2005; Metzger 2012), written documents, objects, and images (Fasulo et al. 1998; Van Drie and Van Boxtel 2008).

Second, although some scholars claim that historical empathy is idealistic and can never be fully achieved because most historical agents are dead (e.g., Kitson et al. 2011; Riley 1998; Wineburg 2001), most scholars agree that historical empathy could promote historical contextualization (e.g., Cuningham 2012; Davis 2001; Endacott and Brooks 2013; Lee and Ashby 2001; Skolnick et al. 2004). Historical empathy focuses on empathizing with people in the past based on historical knowledge that explains their actions. Colby (2008) noted that the primary purpose of historical empathy is to enable students to transcend the boundaries of presentism by developing a rich understanding of the past from multiple viewpoints. In history lessons, teachers could focus on a historical agent to gain insight into the views and values of people who lived in the past (e.g., Foster 1999; Wooden 2008) or discuss historical agents’ decisions with a group of students (Kohlmeier 2006). Teachers could also promote historical empathy by promoting the formation of affective connections with the historical agent based on students’ own similar yet different life experiences (Endacott and Pelekanos 2015; Kitson et al. 2011) and focusing on understanding historical agents’ prior knowledge and positions (Berti et al. 2009; Hartmann and Hasselhorn 2008; Huijgen et al. 2014).

Third, students should be able to explain the past based on their historical context knowledge (Lévesque 2008; Seixas and Morton 2013; Wineburg 2001). For example, students must explain why the Great Depression of 1929 spread to Europe or the differences between governance in ancient Greece and governance in the Middle Ages. To answer such historical questions, students must link the Great Depression and the different types of governance to their historical context (Seixas 2006). Furthermore, the successful performance of different historical reasoning competencies, such as identifying indirect and direct causes (Stoel et al. 2015), understanding change and continuity (Haydn et al. 2015), reasoning with historical sources (Reisman and Wineburg 2008), and asking historical questions (Logtenberg et al. 2011), requires an analysis of the broader historical context. Teachers should, therefore, create opportunities for students to practice these competencies with these types of questions. Hallden (1997) suggested that teachers should focus their instruction on the relationship between historical factual details (lower-level context) and large historical developments (larger context). Kosso (2009) also noted that “Individual events and actions are understood by being situated in the larger context. However, the larger context is understood by being built of individual events. It is a hermeneutic circle and perhaps the only way to understand other people” (p. 24). Presenting and evaluating historical phenomena from different perspectives is also considered an effective approach (e.g., Ciardiello 2012; Levstik 1997; McCully 2012; Stradling 2003). For example, to understand and explain the Cuban Missile Crisis of 1962, students should examine this phenomenon from not only a capitalist Western perspective but also a communist Soviet perspective.

Finally, teachers should raise awareness of students’ present-oriented perspective and the consequences of this perspective when examining the past (Barton and Levstik 2004; Huijgen et al. 2014; Wineburg 2001). Students must know that the past differs from the present (Seixas and Peck 2004); however, social psychology research illustrates that young students especially find it very difficult to take another persons’ perspective, particularly when that other person does not have the same knowledge that the students have (Bloom and German 2000; Wellman et al. 2001). This inability could cause problems in history education, as students must be aware that much of the information that they know was not available to people in the past. Students’ present-oriented thinking or presentism is considered one of the main reasons why they fail to achieve historical contextualization and could cause misconceptions among students, leading them to reach incorrect conclusions about historical phenomena (Lee and Ashby 2001; Huijgen et al. 2014; VanSledright and Afflerbach 2000).

Although we can never be perfectly non-presentist (e.g., Pendry and Husbands 2000; Wineburg 2001), teachers should foster students’ awareness of their own contemporary values and beliefs and the consequences of this perspective when explaining the past. To achieve this goal, teachers could present the past as tension for students (e.g., Savenije et al. 2014; Seixas and Morton 2013), present conflicting historical sources (Ashby 2004), not present the past as progress (Wilschut 2012), and promote intellectual conflict regarding historical phenomena that might be difficult for students to understand and explain (Foster 2001; Huijgen and Holthuis 2015). Furthermore, to prevent students from viewing the past from a present-oriented perspective, teachers should explicitly model or scaffold how historical contextualization can be performed successfully, for example, by providing learning strategies. Explicit teaching of domain-specific strategies, such as how to perform historical contextualization, could promote students’ ability to explain historical events (Stoel et al. 2015). Reisman and Wineburg (2008) stressed the importance of explicitly providing students with an illustration of contextualized thinking, for example, by providing videos of good examples of professional historians who scaffold their contextualization processes.

Research questions

A subject-specific observation instrument could provide insight into the instructions and methods that history teachers employ to promote students’ ability to perform historical contextualization. Therefore, we aimed to construct a reliable subject-specific observation instrument and scoring design that measures how history teachers promote historical contextualization in classrooms. To address this central aim, we specify the following three research questions:

  1. 1.

    What is the observation instruments’ dimensionality when used to observe how history teachers promote historical contextualization?

  2. 2.

    What are the reliability outcomes when the observation instrument is used to observe how history teachers promote historical contextualization?

  3. 3.

    How many lessons and observers are necessary to establish a reliable and optimal scoring design?

Method

Structure of the observation instrument

To design and construct our observation instrument, we used the guidelines described by Colton and Covert (2007), which focus on the development of valid and reliable instruments in social sciences. Our instrument could be characterized as a high-inference observation instrument. In contrast to low-inference instruments (such as time sampling and time logs), high-inference instruments provide a more qualitative verdict (Chávez 1984). However, these instruments are more susceptible to subjectivity; therefore, thorough inter-rater reliability procedures are necessary.

We modeled our instrument on Van de Grift’s (2007, 2009) International Comparative Analysis of Learning and Teaching (ICALT) observation instrument. We chose this instrument’s format because it also seeks to observe teachers’ professional strategies and calculate scores based on these strategies. Similar to the ICALT instrument, our instrument utilizes a four-point Likert scale to score the items. In our instrument, scores 1 and 2 represent a negative verdict, while scores 3 and 4 represent a positive verdict. Score 1 should be used only if teachers do not use a particular strategy in their lessons.

Formulating and refining the items

Based on the four main strategies identified in our theoretical framework (reconstructing the historical context, fostering historical empathy, performing historical contextualization to explain the past, and raising awareness of a present-oriented perspective) and a review of literature on teaching historical contextualization, we formulated observable items to assess classroom teachers’ behavior in regards to historical contextualization. Furthermore, during two national teacher professionalization conferences, we asked 25 history teachers (after an introduction of the concept of historical contextualization) to each formulate 20 items that assess classroom teachers’ behavior in regards to historical contextualization. Combining these items with the items that we formulated resulted in a total of 121 items.

Meta-analyses on effective teaching illustrate that promoting different types of interactions in classrooms (i.e., student-student interactions and teacher-student interactions) could promote student learning (Kyriakides et al. 2013; Seidel and Shavelson 2007). Therefore, we formulated three items (“the teacher asks evaluative questions,” “the teacher uses classroom discussion,” and “the teacher uses group work”), focusing on more generic teacher strategies and different (social) interactions in the classrooms. We included these three generic items because history education research shows that these types of interaction could promote historical reasoning competencies (e.g., Brooks 2008; Van Drie et al. 2006; Van Drie and Van Boxtel 2008; Stoel et al. 2015). Therefore, the total list included 124 items.

By excluding double items and items that might be very difficult to evaluate, we shortened the list to 82 items. For example, we first included individual items for all time indicators (e.g., year, period, and century), but we then incorporated these items into one item, “the teacher gives time indicators.” Another example is that we excluded items focusing on the specific economic, political, and social circumstances (e.g., form of government, welfare, scientific knowledge, wars, and laws) of historical phenomena. Because these specific circumstances are difficult to observe in one history lesson, we included only items such as “appoints political/governance characteristics at the time of phenomena” and “appoints social-cultural characteristics at the time of phenomena.” This method might result in a less nuanced image of a lesson, but we preferred to develop an instrument that allows us to observe all behavior indicators in a single history lesson.

Next, we organized two expert panel discussions to further shorten the list of 82 items and ensure the instrument’s face and content validity. The first panel discussion was held with two history teacher educators and seven secondary-school history teachers. The second panel discussion was held with one history teacher educator and four secondary-school history teachers. All experts had more than 7 years of work experience. The experts were asked to (1) remove unnecessary items that did not measure history teachers’ competency in terms of promoting historical contextualization, (2) remove possible multiple items that might cover the same teacher behavior, (3) reformulate unclear items, and (4) formulate new items that they thought were missing. In total, the experts excluded 24 items, reformulated 12 items, and created no new items, resulting in a list of 58 items.

Subsequently, we trained ten student history teachers on the use of the observation instrument, and they observed one videotaped history lesson using the instrument. We calculated Cronbach’s alpha (jury alpha) for their observation scores to explore the instrument’s internal consistency. This jury alpha was 0.58 (poor internal consistency). After deleting ten items that threatened internal consistency, the jury alpha increased to 0.81 (good internal consistency). Examples of the deleted items are “appoints relations between historical phenomena,” “uses substantive concepts when explaining historical phenomena,” and “uses general schemas to explain historical phenomena.” We asked the experts in the first panel session to determine whether the ten deleted items could jeopardize the instrument’s face and content validity; they found no threats.

The same experts were also asked to observe three videotaped history lessons taught by three different history teachers using the 48 items. After discussing each lesson, three items (“explains the importance of placing phenomena in a chronological framework,” “explains the importance of placing phenomena in a spatial framework,” and “explains the importance of viewing phenomena from different dimensions”) led to strong disagreement among the experts; thus, we deleted these items. This resulted in a total list of 45 items in the first version of the Framework for Analyzing the Teaching of Historical Contextualization (FAT-HC).

Research design

Following Hill et al. (2012), we adopted generalizability theory to explore the instrument’s dimensionality and to determine its reliability (Brennan 2001; Cronbach et al. 1972; Shavelson and Webb 1991). Compared to the classical test theory, generalizability theory is more informative and useful in educational systems because the classical test theory considers only one source of measurement error at a time. Additionally, it does not result in specific information on how many forms, items, occasions, or observers are required (Shavelson et al. 1989). A generalizability study (G-study) can accommodate any observational situation and is restricted by only the practical limitations of data collection and software (Lei et al. 2007). A G-study views a behavioral measurement (for example, an observed score) as a sample from a universe of admissible observations. Each aspect (called a facet) in the measurement procedure is considered a possible source of error. A G-study provides estimates of the variance contributed by persons, observers, occasions of measurement, and each of the possible interactions between these facets. Generalizability theory distinguishes a decision study (D-study) from a G-study. A D-study uses information from a G-study to construct a scoring design that minimizes error for a particular purpose (Shavelson and Webb 1991). In addition to a G-study, a D-study can identify the optimal data collection for a desired score reliability (Hill et al. 2012).

Table 1 Teachers’ characteristics

Sample and data collection

Non-probability sampling was used to select five teachers to observe and five observers (see Tables 1 and 2 for the teachers’ and observers’ characteristics). In the Netherlands, the average age of male teachers is 46 years and that of female teachers is 42 years. The gender distribution of the teachers was 48 % female and 52 % male. In total, there are 1785 history teachers with a master’s degree and 3944 history teachers with a bachelor’s degree working in the Netherlands (Dutch Ministry of Education 2011). The teachers in the sample worked at different schools, and these schools did not differ significantly from the total population in regards to student enrollment, location (rural or urban), or graduation rate (Statistics Netherlands 2014). The national students’ mean score on the formal history exam for general secondary education and pre-university education was 6.35 on a ten-point scale.

Table 2 Observers’ characteristics
Table 3 Variance decomposition for the item level

We videotaped two different lessons for each teacher (n = 5), and all lessons were taught in the two highest tracks of secondary education in the Dutch educational system. We observed only the lessons for upper secondary-school students in the two highest tracks because the Dutch formal exam program considers the ability to perform historical contextualization to be an important aim for these students (Dutch Ministry of Education 2015). A total of 267 students, with a mean age of 16.2 (SD = 0.7) years old, were involved. The mean duration of analyzed lessons was 39 min (SD = 2.4). Each observer individually evaluated the ten videotaped lessons using the developed observation instrument, yielding a total of 50 observations.

Training observers to use the instrument

All observers received a 4-h training. In this training, we used three videotaped history lessons taught by three history teachers (one female teacher with more than 15 years of work experience, one male teacher with 4 years of work experience, and one male teacher with more than 25 years of work experience) from three different schools as training materials. One lesson was about the Ancient Roman period, one was about the Middle Ages, and one was about the Second World War. These three lessons were not used in our data analyses. The observers received an explanation of the 45 items and evaluated the videotaped lessons using a training version of the observation instrument that included more in-depth explanations of the items. After the observers observed each videotaped lesson, their results were discussed, and some items were clarified by the trainers to minimize inter-rater bias.

Data analysis

To explore the instrument’s dimensionality, we conducted a G-study at the item level with seven facets in a crossed design. To estimate the reliability of our instrument and produce a composite of scores with maximum generalizability, we conducted a new G-study and employed multivariate generalizability using a “t × l × o” design, where t represents the observed history teachers, l represents the number of observed lessons, and o represents the number of observers. To determine the optimal number of observers and lessons needed in a scoring design to achieve acceptable reliability, we conducted a D-study using the information from the earlier conducted G-study that estimated the reliability of our instrument.

Results

The instrument’s dimensionality

Based on our theoretical framework, we consider our instrument to be one-dimensional because all items should measure teachers’ ability to promote historical contextualization. The first data analysis indicated that five items (“the teacher asks evaluative questions,” “the teacher uses classroom discussion,” “the teacher uses group work,” “the teacher compares phenomena with the present,” and “the students compare phenomena with the present”) displayed a low correlation (<0.30) with the other items. These five items also obtained a standard deviation above 1.00 and were excluded from further data analysis, resulting in a total list of 40 items in the final version of the FAT-HC observation instrument (see Appendix A).

To further explore the instrument’s dimensionality, we conducted a G-study at the item level with seven facets in a crossed design using the collected data of the five observers who each evaluated two lessons taught by five teachers (50 observations in total). If our instrument is, in fact, one-dimensional, the item facet should explain the main part of the overall variance and the other facets (including the interaction effects) should explain a lesser part of the variance (e.g., Brennan 2001; Shavelson and Webb 1991). As shown in Table 3, the item facet was responsible for most of the variance (47.25 %), indicating that our instrument is one-dimensional in regards to observing how history teachers promote historical contextualization in classrooms.

The instrument’s reliability

To determine the reliability of our instrument, a new G-study was conducted using the same data set (50 observations). The analysis was conducted on the final version of our observation instrument, which consisted of 40 items (see Appendix A). Table 4 displays the results of this G-study and presents the variance decomposition to assess the instrument’s reliability. A reliable instrument should have a high proportion of the variance explained by differences between the observed teachers and a low proportion of the variance explained by lessons and observers.

Table 4 Variance decomposition for the observation instrument

The difference between the observed teachers accounted for 59.12 % of the variance, the difference between the observers accounted for 4.58 % of the variance, and the difference between the lessons accounted for 1.63 % of the variance. The residual was 34.67 %. The results show that the influence of the observers and lessons was very low, indicating that the observers and lessons can be considered to be inter-changeable and that the observers understood the observation items. Interaction effects between the different facets (observers * lessons, observers * teachers, and teacher * lessons) were also calculated and did not display any variance, indicating small differences between the observers’ observations of the different teachers and lessons.

The optimal reliable scoring design

To identify the optimal number of observers and lessons needed for a reliable scoring design, we conducted a D-study based on the results of our G-study, which estimated the instrument’s reliability. Because we are interested in the absolute level of an individual’s performance independent of others’ performance, we calculated the index of dependability coefficient (Φ) to identify the optimal number of observers (Shavelson and Webb 1991). The Φ should be ≥0.7 research purposes, ≥0.8 for formative evaluations, and ≥0.9 for summative evaluations (Brennan and Kane 1977).

The results of our D-study can be found in Fig. 1. A scoring design with one observer evaluating one lesson taught by a teacher yields a Φ of 0.59 (poor reliability), and this value increases to Φ = 0.72 when one observer evaluates two lessons taught by the same teacher. Because we are interested in research purposes and formative evaluations, the optimal scoring design would use two observers who each evaluate two different lessons taught by the same teacher (Φ = 0.83) or three observers who each evaluate the same lesson taught by a teacher (Φ = 0.80).

Fig. 1
figure 1

Results of the D-study

Conclusion and discussion

The aim of the present study was to develop a reliable observation instrument and scoring design to assess how history teachers promote historical contextualization in classrooms. This study resulted in the FAT-HC observation instrument. Using expert panels, we found positive indicators of the instrument’s content validity. Furthermore, generalizability theory analysis provides indicators that the instrument is one-dimensional when used to evaluate how history teachers promote historical contextualization. Generalizability theory analysis also showed that a large proportion of the instrument’s variance was explained by the differences between the observed teachers and a small proportion of the variance was explained by the differences in lessons and observers, which demonstrates the instrument’s reliability (Brennan 2001; Hill et al. 2012; Shavelson and Webb 1991). Our D-study showed a reliable scoring design, with one observer evaluating two lessons as the most effective method for research purposes. For formative teacher evaluations, a reliable scoring design in which two observers each evaluate two lessons or three observers each evaluate one lesson is most effective.

Van Hover et al. (2012, p. 604) noted that instruments that provide “useful discipline-specific feedback to preservice (and inservice) history teachers” are lacking. Additionally, Darling-Hammond et al. (2012) emphasized that most current teacher evaluation programs do little to help teachers improve their teaching. The FAT-HC instrument could provide insight into teachers’ subject-specific needs, resulting in a valuable addition to existing generic observation instruments (Grossman and McDonald 2008). For example, if a teacher obtains low scores on the instrument, attention could be devoted to specific items of the instrument in teacher education or professional development programs. The pre-observation and post-observation interviews also could be structured based on the instrument’s items, resulting in more concrete feedback for the observed teacher.

The instrument could also help researchers examine the instructions and methods that teachers employ to promote historical contextualization in classrooms. In the history education literature, there is a clear view of high-quality teaching and learning of history; however, research instruments that capture this view when observing history teachers while they work do not exist (Van Hover et al. 2012). Furthermore, our instrument could be used to gain more insight into the association between history teachers’ instructions and methods and student achievement. Do teachers who activate their students to reconstruct a historical context better promote students’ historical understanding than teachers who do not? The instrument could also be used to evaluate intervention studies, for example, to examine the effects of training teachers in the use of instructions incorporated into the observation instrument.

In addition to the function of the research instrument and feedback instrument, the instrument could be used as a framework for teachers who want to reshape and improve their instruction on historical contextualization. Slavin (1996) noted that teachers who explicitly model and scaffold their instructions contribute to their students’ academic success. The instrument’s strategies and items could provide direction for designing meaningful learning tasks and scaffolds for students. This is important, especially because, as noted by Grant and Gradwell (2010), many history teachers focus on recalling factual knowledge despite the fact that the teaching and learning of history includes far more activities, such as investigating sources and evaluating the past (Van Sledright 2008). Bain and Mirel (2006) and SSIRC (2013), therefore, argued that instruction models that help teachers learn how to promote students’ ability to perform historical contextualization or other historical reasoning competencies are needed. In a post-observation interview, one of our observed teachers noted that he now uses the instrument as a checklist when designing his lessons. Prior to the study, he would forget the spatial context of historical phenomena. However, he now structurally includes the geographical context in his lessons when reconstructing the historical context of phenomena.

Despite the positive indicators of the instrument’s reliability, some limitations must be acknowledged. We used a research design with only five observers and five teachers, who participated voluntary and, thus, might be more eager to learn (Desimone 2009; Desimone et al. 2006). More observers, teachers, and lessons (cf. Hill et al. 2012) are needed to provide greater insight into the instrument’s dimensionality, reliability, and optimal scoring design. Including teachers and observers with more varied backgrounds (e.g., differences in gender, student performance, age, and educational qualification) might also provide useful insights to further strengthen the instrument and scoring design.

Furthermore, when examining the instrument’s reliability, nearly 35 % of the variance (residual) could not be explained by teacher, observer, or lesson variance. Future research and analyses must be conducted to decrease the residual variance and achieve greater reliability. The observers also noted that it is difficult to evaluate 40 items when observing one history lesson. Because the observation instrument must be practical and suitable for observing a single lesson, more research is needed to decrease the number of items while maintaining good reliability. A larger G-study including a D-study, which focuses on how many items are necessary to achieve reliability, could provide these insights (Brennan 2001). We also used videotaped lessons. Although videotaped lessons have many benefits and are widely used for constructing and validating observation instruments (e.g., Yoder and Symons 2010), they differ from “live” classroom observations. Future research should include live observations to assess possible differences in the instrument’s reliability for live vs. videotaped sessions. Live video classroom observations (e.g., Liang 2015) could also be an interesting method to examine possible differences in reliability.

To further assess the instrument’s construct validity, intervention studies with a quasi-experimental design and pre- and post-tests to further test the framework’s efficiency for promoting historical contextualization are needed. The use of other methods to assess teacher factors, such as student questionnaires and teachers’ self-reports on historical contextualization, could also provide important insights into the instrument’s construct validity (e.g., Kyriakides 2008; Muijs 2006). Additionally, Rasch modeling could provide information on the instrument’s reliability, which items history teachers find more difficult to perform and which items they consider easier to perform (e.g., Fischer and Molenaar 1995; Maulana et al. 2014; Van de Grift et al. 2014).

In conclusion, Ball and Forzani (2009) noted that current teacher education programs are often centered on teachers’ beliefs and knowledge and argued that teacher education programs should mainly focus on the task and activities of teaching. They concluded that far more research is needed to gain insight into the tasks and activities of teaching across different subjects. We hope that our instrument can contribute to further insights into teachers’ subject-specific activities for the teaching and learning of historical contextualization. Our instrument is not designed to assess history teachers; rather, it should function as a tool used to improve history instruction. Marriott (2001) noted that “Teachers seldom have a clear idea about their strengths and weaknesses. This is often because they have not been systematically observed and constructively debriefed” (p. 6). History teachers could observe each other using the instrument, discuss their lessons and findings, and collaboratively design new lessons with the instrument as framework, which might result in a giant step forward in the teaching and learning of history.