Possible biases in observation systems when applied across contexts: conceptualizing, operationalizing, and sequencing instructional quality

Capturing and measuring instructional patterns by using standardized observation manuals has become increasingly popular in classroom research. While researchers argue that a common vocabulary of teaching is necessary for the field of classroom research to move forward, instructional features vary across classrooms and contexts, which poses serious measuring challenges. In this article, we argue that potential biases embedded in observation systems have to be identified and addressed in order for interpretations of results across different classrooms and contexts to be valid and relevant. We identify three aspects of possible systematic biases (related to the grain size of conceptualization, operationalization, and sequencing of lessons) and how these may influence ratings of instructional quality when an established observation system (the Protocol for Language Arts Teaching Observations [PLATO]) is applied in the contexts of Nordic mathematics classrooms. We discuss implications of such possible biases and make suggestions for how they may be addressed.


Introduction
growing number of studies demonstrate that a teacher's instruction has a substantial effect on student achievement (Baumert et al., 2010;Kersting et al., 2012;Lipowsky et al., 2009), and around the world, different measures of instructional quality are used in attempts to understand and improve teaching. The ambition of measuring instructional quality in systematic and reliable ways has led to the development of a range of research-based tools to observe teaching. Such standardized observation systems (Hill et al., 2012;Liu et al., 2019) are more than just a "sheet of paper with rubrics and checklists detailing specific scales" (Liu et al., 2019, p. 64); rather, they are integrated systems of measurements aligned with training procedures and scoring specifications (Hill et al., 2012). An increased interest in observation systems as lenses into features of quality instruction, as well as an emphasis on comparative research (Alexander et al., 2000;Clarke & Xu, 2018;Stigler & Hiebert, 1999;van de Grift, 2014), has caused observation systems to travel. For example, the quality of teacher-child interactions in Norway (Vattøy & Gamlem, 2019), Finland (Virtanen et al., 2017), and Chile (Bruns et al., 2016) are assessed by using the Classroom Assessment Scoring System (CLASS) manual developed in VA, USA . Likewise, the Teaching and Learning International Survey (TALIS) video study developed a common observation system for studying features of mathematics instruction in rather different contexts, such as Chile, Japan, Mexico, Germany, Columbia, England, Shanghai, and Spain (Pons, 2018). However, how contextual factors may impact rating is rarely accounted for in the design of standardized observation systems, implicitly assuming that instructional quality can be measured in a consistent, fair, and valid way across classrooms and contexts (Cohen & Goldhaber, 2016b;Liu et al., 2019). In addition, the main source of literature and empirical research on instructional quality stems from the USA and Central Europe, which therefore have been most influential in framing, conceptualizing, and providing measures of instructional quality (Fischer et al., 2019;Teddlie et al., 2006). Hence, there is a call for more studies focusing on possible theoretical and methodological biases in observational instruments (Grissom & Youngs, 2016), especially related to how the classroom context influences ratings. There are many obvious advantages when using predefined and standardized observation instruments (Klette, 2015;Praetorious & Charalombous, 2018;Huber & Skedsmo, 2016), but the aim of the present article is to investigate how standardized conceptualizations, operationalizations, and sequencing of instructional quality may produce biases when applied in different classroom contexts. We apply the Protocol for Language Arts Teaching Observations (PLATO) in mathematics classrooms in a Nordic context as an illustrative case, while we argue that such biases are relevant independent of national context. Our goal is to (a) contribute to the emerging methodological and conceptual discussion on ways to study, understand, and measure instructional quality; (b) assess limits of transferability of a specific observation system (the PLATO instrument); and (c) contribute to the development of literature on how contextual factors produce possible biases in measurements of instructional quality. This is highly relevant for researchers and practitioners alike, as observation systems are spreading rapidly, with great implications for our understanding of what counts as high-quality instruction.

Defining context and bias
As underscored by Bell and colleagues, when you measure instruction, you also capture context (Bell et al., 2012). Thus, when assessing features of instructional quality, we need to discuss how the classroom context influences scoring with a specific system. Context is an ambiguous term that may refer to voluminous aspects of instruction (subject, time of year, student demographics, etc.) and operate on different levels, from micro (e.g., classroom context) to macro (e.g., national context). For the purpose of the present paper, context refers primarily to the micro levelhow mathematics instruction is organized and played out in the individual classroom (e.g., activity formats and instructional practices). However, as instruction tends to have similar patterns within national contexts (Hiebert et al., 2003), aspects of instruction on a micro level-for example, how teachers organize lesson activities-also have implications for how results in comparative cross-cultural studies are interpreted on a macro level. Therefore, context sensitivity in observation systems here refers to the extent that the operationalization of instructional features are broad enough to capture somewhat differently enacted practices of the same construct .
Bias is a generic term involving all kinds of factors threatening different aspects of validity in research. In this paper, we follow Kane's (2013) argument approach to validity, where validity is thought of as a property of the argument, i.e., the degree inferences can be made about characteristics of classroom instruction from the scores produced by an observation system. In cross-cultural studies, researchers may differentiate between construct bias, method bias, and item bias (van de Vijver & Leung, 1997;van de Vijver & Tanzer, 2004). Construct bias occurs when the measured construct is not identical across the studied groups. For observations systems, this could mean that a feature of instruction is conceptualized differently in an observation manual than how the same feature is enacted in the observed classrooms. For example, if differentiation is defined as teachers giving tasks on different levels to students, while in the classroom, teachers give differentiated assistance to students yet the task is the same-the observation system may not capture the intended construct. Method bias arises from the characteristics of the instrument or how the instrument is administered (van de Vijver & Leung, 1997;van de Vijver & Tanzer, 2004). This involves sample bias, which can occur in comparative classroom studies due to differences in how curricula and teaching units are structured across samples from different contexts (Praetorious et al., 2019). For example, if one classroom is in the middle of a student presentation week during data collection and this class is engaged in high-quality discourse, while another classroom is filmed during a test-preparation week where students mostly work individually, the sample might be incomparable in terms of making statements about quality of discourse. Another instance of method bias refers to administration bias, related to differences in language or culture between participants and researchers (van de Vijver & Tanzer, 2004). Regarding observation systems, such a bias may occur if raters are not familiar with the language, norms, or culture in the classrooms they rate, which may lead to inaccurate interpretations of features of instruction. Item bias refers to distortions on the item level (van de Vijver & Tanzer, 2004). In observation systems, an item may be seen as one of many operationalized indicators of an overarching construct, often called a domain, of instructional quality. For instance, a domain called teacher responsiveness to student academic progress might be operationalized on the item level to include an indicator of teachers following up on homework. However, homework may not be relevant or even appropriate in some classroom contexts, in which case this particular item would be biased.
Taken together, if any such biases (construct, method, or item) result in observation systems systematically privileging or disadvantaging specific instructional practices and providing an inaccurate picture of teachers' instruction in certain contexts, they may be considered biased for that context (Cohen & Goldhaber, 2016a). Following Kane (2013), in such a scenario, interpretations of quality instruction based on scores derived by observation systems alone have low validity.

Standardized observation systems
All observation systems prioritize certain features of instruction and exclude others, embodying a certain community's view of instructional quality . They are often designed and validated for specific instructional purposes, such as capturing subject-specific instruction-for example, in mathematics (e.g., Mathematical Quality of Instruction [MQI] by Hill et al., 2008), science education (e.g., Inquiring into Science Instruction Observation Protocol [ISIOP] by Minner & DeLisi, 2012), or language arts (e.g., PLATO by Grossman, 2015). In addition, some systems are generic, designed to capture instruction across subjects (e.g., CLASS by Pianta et al., 2008) or tailored to curriculum standards (e.g., the US Common Core standards in Danielson, 2013). While interest in using standardized observation systems among classroom researchers is increasing, the majority of classroom observation studies use non-standardized and informal instruments (Bostic et al., 2019;Stuhlman et al., 2010), often in the forms of field notes and other inductive, in situ descriptions, arguing that the instrument should be sensitive to their specific research questions and research ambitions. This may be called a "bottom-up" approach to conceptualizing instructional quality, building the definition from the data . Such informal and non-standardized measurements of instruction may be useful for local purposes and for identifying features in classrooms that otherwise would be overlooked in a standardized manual. However, they make it difficult to systematically capture and analyze patterns of instruction across multiple classrooms or aggregate consistent knowledge across studies, and may represent researchers' idiosyncratic understanding of quality instruction rather than an empirically driven systematic understanding .
Standardized observation systems have significant methodological benefits in general and in video studies in particular, as they facilitate a clear categorization and standardization of the massive amount of data that video studies tend to generate. Standardized observation systems are "top-down" approaches , consisting of pre-determined conceptualizations of instructional quality as lenses for understanding instruction in classrooms (Klette, 2015;Praetorieus & Charalombous, 2018). Broadly scoped, standardized observation instruments are useful because they can provide similar data across a broad range of contexts, and this supports the goal of integrating and accumulating knowledge (Klette & Blikstad-Balas, 2018). There is however a fundamental trade-off in this goal because obtaining consistent data across a range of contexts necessarily means losing context sensitivity (Knoblauch & Schnettler, 2012;Snell, 2011), which may conceal important local aspects of instruction. Bell et al., (2019) and Liu et al. (2019) identify scoring specifications, rater quality procedures, and sampling specifications as three critical areas when evaluating the properties of an observation system. Scoring specifications define which constructs will be attended to and how these will be operationalized and codified as well as the scales assigned for scoring (e.g., present/not present, a three-point or seven-point criterion-referenced scale). Rater quality procedures point to how raters are trained to ensure that they are able to use the scales and rubrics in a precise and reliable way. Sampling specifications address how the observation sample relates to the larger universe of teaching and instruction, as well as sequencing processes for scoring. This includes whether lessons should be observed during specific parts of the school year or the whole school year, as well as time segments of observations (every 7 min, every 15 min, the whole lesson, etc.). Biases attached to rater quality are discussed by others (e.g., Bell et al., 2012;Liu et al., 2019), while for the purpose of the present article, we concentrate on scoring specifications (conceptualization and operationalization of scales) and sampling specifications (time sequencing of lessons) when discussing possible inherent biases in the PLATO observation system.

Scoring specifications: conceptualization and operationalization of instructional quality
A common international theory of instructional quality does not exist, while there are a number of different conceptualizations developed in certain cultural settings influenced by, for example, prioritized educational and teaching goals and specific subjects . Despite differences in views of learning underpinning different systems, scholars seem to agree upon some key features that are critical when measuring aspects of quality instruction (see, for example, Klette, 2015;Kunter et al., 2007;Nilsen & Gustafsson, 2016). These dimensions, or domains, include instructional clarity (clear learning goals, explicit instruction), cognitive activation (cognitive challenge, quality of task, content coverage), discourse features (teacher-student interaction, student participation in content-related talk), and supportive climate (managing behavior and time in the classrooms, creating an environment of respect and trust). While agreement is taking shape on an overarching level, Charalambous and Praetorius (2018), who systematically investigated 12 different observation systems, point out that different systems name the same constructs 1 3 using different terms and name different constructs using the same term. This in turn underscores how construct bias may occur when applied across settings where the construct is understood differently (van de Vijver & Leung, 1997). In addition, even scholars using the same instrument may interpret the dimensions differently, partly due to unclear foci and poorly refined descriptions of operationalization (Schlesinger & Jentsch, 2016), which challenges the ambition of observation systems as a means to accumulate knowledge across studies. Operationalization of conceptualizations of instructional quality into observable behavior involves the level of grain size (e.g., how discrete/targeted are the practices to be measured). The overarching construct of instructional quality includes several domains; in PLATO, these are the following four domains: instructional scaffolding, disciplinary demand, representation and use of context, and behavioral management. Each domain is operationalized into indicators, or elements. For example in PLATO, the domain instructional scaffolding is operationalized into four elements: quality of a teacher's feedback, strategy use and instruction, modeling, and accommodation for student learning (Grossman, 2015). These elements are further decomposed into rating scales. On such a scale, the quality of instructional practices is operationalized and decomposed into an even narrower grain size (e.g., the type and frequency of feedback that constitute different points on a scale). Some observation systems, like PLATO, operate with a scale of 1-4, while others have a scale of 1-7 . Other systems only have conceptual codes describing a practice as present or not present. As discussed, item bias may be introduced both at the level of elements or at the level of scales-for example, if the operationalization does not correspond to instructional norms in a certain context (recall the aforementioned example of homework).
Classroom observation systems capture teachers' work in action, while instructional practices in a particular classroom always are influenced by situational factors unknown or uncontrollable by the researcher (Kennedy, 2010). These factors include the cultural context and its norms associated with teaching (Bishop, 1988), teacher beliefs and values (Leder et al., 2003), student composition (Steinberg & Garrett, 2016), and expectations and support from school leaders (Kraft & Papay, 2014). There are also other contextual factors shaping instruction, such as teaching learning goals for specific lessons (Hiebert & Grouws, 2007) and activity structure of lessons (Luoto et al., 2022). The grain size of operationalization of instruction influences how context sensitive the observation systems are and whether the operationalization of instructional quality is broad enough to capture somewhat different practices . For instance, in a CLASS validation study by Virtanen and colleagues (Virtanen et al., 2017) in a Finnish grade-six context, teachers received systematically low ratings on responsiveness to students' academic and social challenges due to frequent use of individual seatwork. Virtanen and colleagues (Virtanen et al., 2017) argue that in settings where students are commonly expected to engage in individual seatwork or group work on their own, teachers have few opportunities to actually demonstrate strategies of responsiveness in line with the manual's definition. Low scores may thus indicate unresponsiveness, even if there were no students' challenges to respond to, and if such scores are not discussed in relation to contextual understanding of such practices, results may be interpreted as teachers neglecting challenges. In other words, CLASS holds an assumption that more responsiveness helps students learn more than less responsiveness in all situations, while it is an empirical question if this is true across contexts. If CLASS systematically rates individual seatwork with little teacher response as low quality while students would learn just as much, it would indicate a bias. Similarly, Muijs and colleagues (Muijs et al., 2018), using the International System for Teacher Observation and Feedback (ISTOF) (Teddlie et al., 2006), argue that this measure of instruction, intended to be internationally valid, did not reflect teachercentered and direct-instruction approaches with enough nuances. Hence, teachers using methods effective for particular goals, such as direct instruction for practicing basic skills, may be rated as low quality by the ISTOF (see Muijs et al., 2018). Correspondingly, Jones (2016) argues that teachers using special education techniques are disadvantaged when their instruction is measured with the constructivist Framework for Teaching (FfT) manual (Danielson, 2013), as teaching practices that may very well be adequate for particular student groups are not considered high-quality instruction by this manual. While scores might accurately reflect instruction according to FfT standards, the interpretation of the scores are biased if FfT is used as a measure of quality instruction. In addition, Gill and colleagues (Gill et al., 2016) studied instruction with five different observation instruments (including PLATO) and found that student characteristics (racial/ethnic minority or lower-achieving students) influence observation scores in English language arts classrooms for the FfT and CLASS instruments. They hypothesize that this might signal a systematic bias, if the instruction given was indeed appropriate for the needs of specific groups of students. However, they speculate that it may also be teachers themselves that hold biases and low expectations for minorities and low-achievers which lead them to alter their instruction to be less effective-in which case it is not a bias in the instruments (Gill et al., 2016). Consequently, whether there is a bias in the conceptualization and operationalization in the actual instrument is not always clear or easy to assert.

Sampling specifications in capturing instructional quality
Sampling specifications address how the rated observation samples relate to teaching in general. This includes whether lessons should be observed during specific parts of the school year or throughout the whole school year, as well as the time sequencing of lessons, which will be the focus here. Time sequencing concerns whether to score the whole lesson or sequence the lesson into several sections, and different observation systems apply different sampling specifications in this respect (Bell et al., 2012). While some systems score every 15 or 7 minutes (e.g., PLATO and MQI), other alternatives are to code "meaningful chunks," (e.g., the Teaching for Robust Understanding [TRU] framework; see Schoenfeld & the Teaching for Robust Understanding Project, 2016)-in other words, the part of a lesson with a certain topic of interest (e.g., only group work). Some observation systems also provide a joint score for the whole lesson-for example, by aggregating ratings into higher-level scores across all observed sequences, which is common in the CLASS manual (Bell et al., 2012). Meanwhile, others again score for each element whenever relevant-for example, by rating a teacher's feedback practices only when the teacher is indeed providing feedback on content relevant to the lesson. Sequencing thus determines how many different segments-rating points-there are within a lesson. Scoring individual segments may not give an accurate picture of the lesson or a lesson unit, as activities may last longer than the scored segment and fall into different time segments, which may impact the scoring. However, if the whole lesson is scored, the qualities of different instructional activities within the lesson may be concealed by a common score, which also may produce inaccurate results. Conversely, the way an observation instrument sequences lessons may introduce method bias (van de Vijver & Tanzer, 2004)-for example, if lessons are scored differently because of varying activity structure rather than actual differences in instruction.

The present study
Biased or context-insensitive scoring and sampling specifications in standardized observation systems may seem unavoidable, especially when capturing instruction across different classroom contexts. While we agree with Grossman and McDonald (2008) who argue that a common vocabulary of instruction is needed to move the field of classroom research forward, no single observation system alone is likely to capture instruction in a way that is sensitive to the multitude of classroom complexities. Therefore, we also concur with Schlesinger and Jentsch (2016) that connecting different observation systems under a common frame is "promising for obtaining a comprehensive picture of teaching" (p. 355). To reach agreement on common vocabulary and construct common frames, we need to discuss how to identify and address systematic biases in aspects such as scoring specifications (conceptualization and operationalization) and sampling specifications (time sequencing). Researchers argue that valid comparative inferences require psychometric approaches demonstrating cross-country measurement invariance to ensure there are no biases involved in the interpretations of differences across groups (e.g., Fisher et al., 2019). However, since the privileged classroom context and activities embedded in an observation system are often under-communicated (Cohen & Goldhaber, 2016a;Liu et al., 2019), the source of such biases may not be easy to identify without investigating patterns in ratings across several lessons with similar and different activities and lesson structures. We are interested in the inherent assumptions about how instruction is enacted, which may make rubrics relevant or irrelevant for different contexts, and how we can ensure that findings are interpreted in a valid way. Therefore, this paper examines how aspects of conceptualization and operationalization, together with sampling specifications, embody possible biases due to how instruction is organized and carried out, illustrated with examples from Finnish and Norwegian mathematics classrooms. The overall research question is as follows: In what way may standardized observation systems of instructional quality produce biases when applied across different contexts? This is followed by three sub-questions: 1. In what way may the conceptualization of instructional quality produce possible biases when applied in different contexts?

PLATO
PLATO was originally developed for LA instruction by Pamela Grossman at Stanford University (Grossman, 2015). It was one of the manuals applied in the Measures of Effective Teaching (MET) study (Kane & Staiger, 2012), and has been used for mathematics instruction (e.g., Cohen, 2013;Luoto et al., (2022); Stovner, 2018), Content and Language Integrated Learning (CLIL) instruction in mathematics and science (Mahan et al., 2018), and in teaching English as a second language (Brevik, 2019). As mentioned, PLATO comprises four domains: instructional scaffolding, representation of content, intellectual demand, and behavioral management. These domains are further divided into 12 elements with detailed rubrics operationalizing features of instruction to observable behavior (Grossman, 2015). PLATO includes both socio-constructivist, cognitive, acquisitionist, and process-product approaches to learning . The socio-constructivist approach is, for example, evident in the domain intellectual demand, where PLATO elements favor joint public discourse with teacher moves enabling student contributions. The cognitive approach is apparent within the representation of content domain, emphasizing practices prompting content explanations that address conceptual understanding and connections between old and new content knowledge. Other elements within this domain also rely on an "acquisition perspective" of learning (Sfard, 1998), assuming that learning occurs when the teacher delivers clear and correct explanations to students. A cognitive approach may also be considered as a theoretical underpinning of the domain instructional scaffolding, focusing on teachers' use of strategy instruction, feedback, and modeling to decompose content and making it available to students, in line with explicit instruction (Archer & Hughes, 2011). The process-product approach, teaching approaches directly related to student learning (e.g., Brophy & Good, 1986), is mainly reflected in the behavioral management domain, as teachers who have dutiful students and who tightly control time are rewarded with high scores. In PLATO, the sequencing grain size of lessons is 15-min segments, and each of these segments are separately rated for each element of instructional quality. The scale is 1-4, where 1 typically means no evidence of a certain feature, 2 means vague evidence, 3 means clear evidence, and 4 means clear and consistent evidence. Video or text-based instructions, together with detailed rubrics, assist raters in understanding how domains and elements are embodied in classroom practices (Grossman, 2015). Rating with PLATO requires that raters have had training and passed a certification test with a minimum of 80% agreement on each element with master coders.

Data sources and sample
The present analyses are driven from the use of the PLATO observation system in Norwegian and Finnish mathematics classrooms. For comparative reasons, eight classrooms from the capital area of Norway (Oslo) were matched with the eight classrooms from the capital area of Finland (Helsinki) in terms of mathematical content (algebra, numbers, and geometry), socioeconomic status of school area, and urban/suburban location of school. Then, 26 Oslo lessons and 21 Helsinki lessons (91 and 71 segments, respectively) were systematically analyzed and compared (Luoto et al., 2022). All lessons were videotaped with two cameras synchronized with two microphones. One camera focused on the teacher, the other on the whole class, the teacher wore one microphone, and one microphone was placed in the classroom to capture students talk.
From coding these classrooms, we selected three illustrative examples to demonstrate possible biases when applying PLATO to Norwegian and Finnish mathematics classrooms. The three examples address aspects of rating and sampling specifications related to three different elements in the PLATO observation system: intellectual challenge, classroom discourse (the element uptake of student responses), and feedback. In the first example, a possible bias was investigated due to the low inter-rater agreement on intellectual challenge during individual seatwork in both Oslo and Helsinki classrooms. In the second example, a possible bias was investigated as Helsinki mathematics classrooms had overwhelmingly lower scores on the classroom discourse elements compared to Oslo classrooms. The third example, concerning the element feedback, is related to the scoring procedure of time sequencing within the PLATO manual. This is a reoccurring issue across several observation systems that illustrates how different lesson structures yield different results depending on whether quality feedback occurs within the same segment or is spread out across segments.

Analyses
This paper is mainly a theoretical and methodological contribution discussing possible biases in observation systems when applied across contexts. The examples we draw on come from our experiences when scoring Oslo and Helsinki lessons (N = 47) with the PLATO observation system for the LISA-study. In this prior coding, all lessons were divided into 15-min segments and scored by trained and certified PLATO raters on all 12 elements and activity format (whole class instruction, individual seatwork, and group work). The analyses for the present paper concentrate on the elements intellectual challenge, uptake of student responses, and feedback, in combination with activity formats. After initial PLATO coding, aggregated to classrooms and teachers, we identified patterns across classrooms and national contexts that needed further examination; these were discrepancies between coders, and large differences across the two contexts of Oslo and Helsinki. When such "odd patterns" were identified, we examined underlying reasons for PLATO scores by qualitatively analyzing transcripts from these segments together with scoring justifications and rating procedures. For the element intellectual challenge, there were discrepancies in coding scores among raters, and to determine why, we qualitatively analyzed the justifications for scoring and discussed the video-segments where disagreement was present. For the element uptake of student responses, there was big differences across Oslo and Helsinki classrooms, and to examine why, we analyzed rating specifications (conceptualization and operationalization) for this specific element along with analyses of transcripts and justifications of instances with uptake. Feedback did not derive from an odd pattern but was included because it illustrates issues with sequencing lessons. For the analysis of this element, we studied the rubric and the scoring, and how feedback patterns tended to be distributed across lessons.

Limitations
Since PLATO was originally developed for LA instruction, some issues with scoring and sampling specifications might have to do with a mismatch between the nature of mathematics instruction and PLATO, rather than construct bias in PLATO's view of instructional quality. However, the reviewed elements of intellectual challenge, uptake of student responses, and feedback are operationalized in the rubrics in a generic way that also resonates with what mathematics education literature proposes as quality instruction (see Stovner, 2018).
The possible biases were analyzed through an iterative process and are not based on a statistic analysis of all possible biases. However, this article discusses issues relevant for observation systems in general (e.g., conceptualization, operationalization, and sequencing), and the examples we chose demonstrate how contextual factors-for example, how mathematics instruction often is organized as individual seatwork in classrooms around the world (Alexander, 2001)-illustrate possible biases when applying observation systems that carry different assumptions about instruction.

Example 1: conceptualization of intellectual challenge during individual seatwork
Intellectual challenge is a PLATO element building on research on cognitive activation (Baumert et al., 2010), describing how a teacher may cognitively activate their students through more or less rigorous activities and assignments. For mathematics instruction, lower-level intellectual challenge tasks (score 1 and 2) are rote tasks, where students are simply asked to apply given procedures on similar procedural tasks, while higher-level intellectual challenge tasks (score 3 and 4) have students engaging in "high-level thinking," such as reasoning, justifying, and generalizing (e.g., Lipowsky et al., 2009). An additional distinction in the intellectual challenge element is that the score can be adjusted if the teacher reduces or increases the challenge by changing the activity initially presented. This happens, for example, if the teacher solves the problems for the students or adds demanding new questions (Stein et al., 1996). In PLATO, intellectual challenge is conceptualized in a way that assumes that a common score can be given to the activities occurring during each 15-min segment. However, in the mathematics lessons analyzed (N = 47), a considerable amount of class time (30% in the Norwegian sample and 60% in the Finnish sample) was spent on individual seatwork where students worked on tasks in textbooks (Luoto et al., 2022). The conceptualization of intellectual challenge in PLATO thus became problematic, since students in both contexts-but especially the Finnish classrooms-often worked on different types of tasks within the same or sometimes different books, and the teachers often differentiated instruction by individually guiding students on what challenge level to choose from. Furthermore, teachers often lowered the challenge for struggling students while maintaining it for others. The tasks listed in Fig. 1 below are all from the same Finnish classroom and the same chapter from the mathematics textbook, illustrating how different levels of intellectual challenge may occur simultaneously.
In Fig. 1, task 1 demonstrates how the same type of procedural tasks can be more or less difficult (a is assumingly easier than b and c), while they are all rote tasks where students follow given procedures (score 1 on PLATO). Task 2 and task 3 are more challenging, as task 2 requires students to analyze and test out how they can justify mathematical arguments, thus moving beyond simple rote tasks, while task 3 requires students to evaluate information and generalize this information into a mathematical expression. However, whether task 2 and task 3 require "higher-level thinking" that may depend on whether students have practiced such tasks so often that they essentially have become "rote," following predetermined procedures. This example undermines the assumption that a single score can measure intellectual challenge in classrooms where students individually work with tasks on different challenge levels, while teachers differentiate and lower the challenge levels for some but not all students. Moreover, no other PLATO elements capture the rigor of tasks nor are there any elements that capture the level of differentiation.
As mentioned, the issue of coding intellectual challenge in such situations became obvious through examining low inter-rater agreement between raters. When raters discussed the low agreement, it appeared to stem from difference reference points for what tasks were considered when assessing the intellectual challenge at hand, as well as lack of specification of how to rate multiple simultaneous challenge levels in the rubric. It should be noted that the intellectual challenge rating was much less problematic in situations where the class worked on joint tasks and the tasks were presented in plenum, indicating that this element of PLATO privileges, or even necessitates, teaching methods where all students work on the same tasks.

Example 2: operationalization of teachers' use of uptake-quality of discussions during individual seatwork
In the PLATO manual, classroom discourse is divided into two sub-elements: uptake of student responses and opportunities for student talk. The focus in this example is uptake of student ideas, which captures the extent to which the teacher or other students "pick up" on students' comments and ideas. Low-level teacher uptake is either no or automatic and brief responses to student ideas (score 1 or 2), while high-level teacher uptake is characterized by elaborating, revoicing, and asking for clarification or evidence of student ideas (score 3 or 4), which is in line with high-quality discourse in mathematics education research (e.g., Franke et al., 2007;O'Connor & Michaels, 1996). Classroom discourse is, by definition in PLATO, either whole-class, small-group, or partner talk, excluding individual teacher-student conversations. This means that uptake is always a 1 for teacherstudent conversations during individual seatwork. However, as we show in the Prove the following with contradicƟve evidence a) Show that 5 + ≠ 5 b) Show that + ≠ c) What values can and have if + is 4?
A movie Ɵcket costs 5 euros for children and 9 euros for adults, Make an expression for what it costs for children and adults. If 7 children and 12 adults go to the movies, how much do they pay? Fig. 1 Examples of the nature and diversity of tasks from a Finnish classroom following excerpt, teacher-student conversations during individual seatwork may also reflect high-level uptake. In this excerpt, students work individually, and a student calls for the teacher and asks for help with the task, which instructs "Make an expression of the areal of a rectangle with the sides 2a and a." The following is the dialogue that ensued: In this excerpt, the teacher is asking for clarification and justification of the student's ideas (line 7, 9, 13), meeting PLATO criteria for high-level uptake. However, because of the definition of classroom discourse as a joint activity involving several students, high-level uptake during individual teacher-student conversations are, by default, rated a 1, equivalent with no responses to student ideas. The element feedback might capture aspects of these types of conversations but not the nature and quality of the uptake: Teachers may receive a high score on feedback without ever asking students to explain or justify their reasoning. Hence, the qualities of teacher uptake in commonly occurring teacher-student discussions, especially in Finnish mathematics classrooms, remain uncaptured and overlooked in the PLATO observation system. Consequently, as content-based individual student-teacher conversations are not captured by any other element during individual seatwork, the way uptake of student responses is defined and operationalized may lead to bias on an item level. This may also provide a somewhat misleading interpretation that there are few conversations about content across classrooms with individual seatwork, while in reality, the way teachers engage in discourse with individual student vary from teacher to teacher, and how they do so may be an important indicator of instructional quality in this type of classrooms where learning is often individualized.

Example 3: sampling specifications and teachers' structuring of lessons-the case of scoring feedback
In PLATO, the element feedback captures the quality and quantity of feedback provided in response to student ideas and work. Low quality (score 1 or 2) is when feedback is vague, misleading, and/or absent. High quality scores (score 3 or 4) require that the feedback goes beyond procedural thinking and addresses students' understanding and how to improve their work, aligning with mathematics education literature that stresses that feedback should address how students can move forward with their learning (Small & Lin, 2018). In terms of quantity, PLATO requires at least two instances within one 15-min segment to score 3 or 4. Teachers may, however, sequence lessons in different ways. While some teachers tend to "slice up" the lessons in different, clearly separated activities (for example, by first introducing the lesson and learning goals, then explaining and showing something, then allowing students to solve a specific problem, then providing feedback), others may have a more flexible approach, where they transition between several of these activities multiple times during a lesson. In measuring feedback, these two ways of sequencing the lessons would possibly yield very different results. Here, we use the scoring of two different 45-min-long lessons where the teacher provides high-quality feedback (as defined by PLATO) on three different instances (F1, F2, and F3). In the first lesson (case A), all high-quality feedback instances happen in the last segment, while they are spread around evenly in the second illustrative lesson (case B). These two cases illustrate how the fact that whether the three instances of highquality feedback occur in the same segment or not is recognized and rewarded differently (Fig. 2). In case A, the element of feedback would typically be scored as 2-2-4, if there was any generic feedback at all during the two first segments (there usually is), and the quality of the three instances of feedback is in line with the manual's description of level 4 feedback. In case B, the quality of the feedback is the same, yet the measurement would differ, as the high-quality feedback occurs across three different segments. This lesson would typically score 2-2-2. From the Norwegian mathematics lessons in the LISA-study, the last segment within a lesson generally receives more high-quality feedback than the first segments (Stovner et al., 2021), as shown through case A above, while there was no consistent clustering of high-quality feedback in the last segment of the Finnish lessons, which resemble case B in lesson structure. Therefore, we may see patterns of no highquality feedback at all in some classrooms, but when approaching lessons qualitatively, we may actually see exemplary instances of feedback; they are just isolated in different sequences of the lesson and become invisible in the rating. If it is the sequencing of lessons that systematically cause different ratings, rather than the amount and quality of feedback, there might be a method basis in the PLATO observation system.

Discussion
The aim of this paper was to investigate how standardized conceptualizations, operationalization, and sequencing of instructional quality may produce biases when applied in different classroom contexts. While PLATO, originally developed in the USA, has been validated for psychometric properties at a broader grain size (Cor, 2011), no one has systematically examined how context-sensitive this manual is to different activity structures and instructional patterns. However, since its initial phase, PLATO has been used in mathematics (e.g., Cohen, 2013) and other school subjects (e.g., science and history education) in different US school settings (Cohen & Brown, 2016;Kane & Staiger, 2012) and, more recently, in different national contexts (e.g., Luoto et al., 2022 ;Brevik, 2019;Tengberg et al., 2021). We argue that applying observation systems across different contexts enables researchers to distinguish the way established systems might need to be adapted or complemented with additional analyses or systems in order to represent different classroom contexts in a valid way. We will now discuss the identified possible biases and their implications Case A-the teacher has a clearly sequenced lesson, where all instances of high-quality feedback occur in the last segment: Case B-the teacher lets the students work independently or in pairs on the task at hand, walks around, and provides individual guidance. One instance of high-quality feedback occurs in each of the three segments: Fig. 2 Examples of how teachers may sequence lessons. Case A-the teacher has a clearly sequenced lesson, where all instances of high-quality feedback occur in the last segment; case B-the teacher lets the students work independently or in pairs on the task at hand, walks around, and provides individual guidance. One instance of high-quality feedback occurs in each of the three segments and suggest concrete steps to address these issues when studying instructional practices in a Nordic context and beyond.

Instructional format and structure of lessons
Biases in an observation system mean, among other things, that the measured construct is not identical across classrooms, which leads to a risk of producing inaccurate pictures of classroom instruction (e.g., Cohen & Goldhaber, 2016b). This may be due to construct, method, or item bias (van de Vijver & Tanzer, 2004). Our analyses indicate that the way some PLATO elements operationalize instruction, privileging whole-class instruction as the preferred instructional format, risk misrepresenting the quality of instruction in classroom contexts dominated by individual seatwork. This lack of context sensitivity , concerning the interplay between activity format and the measured instructional features, is a challenge permeating all three examples reviewed in this article. While none of these examples are exclusive to a Nordic context, they may be more prevalent in contexts where certain activity formats, such as individual seatwork, dominate.
The first and second examples concerning conceptualization and operationalization indicate a potential mismatch in how PLATO presumes instruction to unfold, emphasizing peer classroom exchange and joint discussions as quality indicators when decomposing instruction into observable behavior. The conceptualization of intellectual challenge, overlooking differentiated instruction, and the operationalization of teacher uptake, disregarding uptake during teacher-student conferences, indicate that these elements are not contextually sensitive enough to capture these features of instructional quality during individual seatwork. Interestingly, Virtanen and colleagues (Virtanen et al., 2017) mention similar issues with the CLASS in Finnish classrooms, as teachers scored low on emotional support due to the high amount of individual seatwork.
The third example, related to sampling specifications, features how the sequencing of lessons together with quality and quantity criteria may produce different ratings for the same instruction. This is also related to instructional format, since teachers may be penalized or rewarded by PLATO depending on instructional format and how they organize their lessons. However, as we will discuss in the following, the issues raised in these examples do not necessarily mean that the way instructional quality is conceptualized and operationalized is biased.

Implications of "biases" and how we may address them
Conceptualizing and operationalizing instruction Examples 1 and 2 both illustrate that if conceptualization and operationalization of an aspect of instruction are not broad and flexible enough, it is very difficult to capture and fairly compare practices across contexts . The common use of individual seatwork, especially in Finnish mathematics classrooms, combined with the teachers' common use of differentiating their help as well as the tasks, challenged us to rethink how to conceptualize and capture intellectual challenge and uptake of student ideas in these classrooms. Differentiation and adaptive teaching is an ideal highlighted in the national curricula in both Finland and Norway (Finnish National Agency for Education, 2014; The Norwegian Directorate of Education & Training, 2015), and such national guidelines might reflect other and different views of quality instruction than PLATO. However, the way intellectual challenge is conceptualized is not a bias, as it does not portray an inaccurate picture of instruction or systematically produce too low or too high scores. Rather, PLATO's conceptualization of quality instruction as whole-class instruction and common tasks lacks specifications on how to capture differentiated tasks. An alternative is to adapt PLATO by adding multiple simultaneous scores, showing a web of different challenge levels and teacher instruction. This would be much more complicated than one single score for each 15-min sequence, but it would capture a variation of challenges, which may be more relevant in a context where differentiated instruction is a frequent practice and tasks are commonly adapted to each student's individual progress. Yet, another alternative would be to replace or complement the intellectual challenge element with a more relevant instructional feature in situations where the teacher instruction and tasks are differentiated. The obvious candidate is an element capturing differentiated instruction. Such an element can be found in other established observation systems such as ISTOF (Teddlie et al., 2006), emphasizing the degree to which teachers consider students' differences.
As example 2 explains, the quality of teacher uptake during individual studentteacher conversations remains uncaptured with PLATO during seatwork. While this instructional format seems common in Nordic mathematics classrooms, and especially in Finnish mathematics classrooms (Luoto et al., 2022), capturing the nature of discussions during individual seatwork might be an important indicator of instructional quality. In this case, we argue that there might be an item bias (van de Vijver & Tanzer, 2004), as the way teacher uptake is operationalized in PLATO does not capture the intended practice across different activity formats and may characterize contexts with individual seatwork as settings with very little teacher uptake. However, if teacher-student discussions in whole class settings would prove to be more efficient for student learning than individual teacher-student discussion, it would be legitimate to score whole-class sessions higher. In any case, there might be a need to adapt the PLATO framework when using it for capturing teacher uptake, either by including individual teacher-student discussions in the already existing rubric or by creating a separate sub-category for capturing uptake during instances of individual seatwork.

Sequencing instruction
As the example of feedback illustrates, sequencing lessons when scoring may result in inaccurate picture of quality feedback practices. The way teachers organize different activities within a lesson may naturally yield different scores in different segments, and teachers who provide the same quality and amount of feedback practices (case A and case B) may end up receiving ratings at different points of the quality scale. There may thus be a method bias (van de Vijver & Tanzer, 2004) if some organization formats are systematically scored low because of the requirement for a certain quantity of instructional features to occur in an intense time span in order to count as high quality. However, the way PLATO measures feedback is not biased if clustered feedback is especially important for student learning, and to solve this issue requires empirical investigation on whether clustered feedback is more effective for student learning. There are both practical and analytical reasons for sequencing and slicing the lessons into smaller sections. It allows for understanding of how different parts of a lesson target different content, activities, and instructional practices, and in addition, it enables analyses of how different instructional features (e.g., feedback and intellectual challenge) correlate in the same segments. Rating based on a whole lesson has also shown to be problematic; for example, a "holistic score" would hide the variance when specific instructional practices occur and co-occur. Sequencing lessons into "meaningful chunks" (e.g., into specific activity formats) is challenging due to high-inference coding on when exactly a meaningful chunk starts and ends. Future research combining all three versions (short segment, whole lesson, and meaningful chunks) of sequencing could shed light on the real implications of the sequence grain size.

Addressing the unresolvable contradiction of common conceptualizations and context sensitivity
As discussed above, we may use identified biases to further define and refine conceptualizations and operationalization features of instruction to make observation systems relevant and sensitive for a certain context. However, a context-sensitive approach actually contradicts a "common vocabulary" (Grossman & Mac-Donald, 2008) and a common "frame for teaching" . This might seem like a paradigmatic challenge to overcome, but in line with Schlesinger and Jentsch (2016), we argue that using several validated frameworks and adapting these frameworks to contextual purposes, applying a mix of "bottom-up" and "top-down" approaches , may be done while still using the same instructional language. This is challenging, however, as researchers have to be extremely explicit and attentive to how potential adaptions may undermine and disqualify comparisons and how to interpret results when using different frameworks that may essentially favor different views of learning. As Alan Schoenfeld (2016) discusses, no matter how clearly you improve and clarify items, even experts in the same field may never agree upon how to conceptualize and operationalize teaching due to different value systems. However, if observation systems cannot produce valid representations of instructional quality across differently structured lessons, classroom researchers will continue to use mostly informal and unstandardized observations that might reflect researchers' idiosyncratic preferences of instructional quality, from which it is difficult to aggregate knowledge (Praetorious et al., 2019;Klette & Blikstad-Balas, 2018 ).
The issue of what instructional features to focus on and how to conceptualize and operationalize these features in observation systems will continue to challenge scholars of classroom research. What we can do is to be much more transparent about the trade-offs and the potential biases embedded in the chosen observation systems, as illustrated in this article in the way the quality of teacher uptake of student ideas was not captured during segments dominated by individual seatwork. Identifying possible biases and conceptual mismatches between classroom contexts and observation systems is essential for valid interpretations of results, and it helps clarify what adaptions or complementary frameworks may be needed to better understand instruction, as well as the kind of empirical research needed to either confirm or refute that instruction scoring low actually is "low quality," in that context. Examining biases across different observation systems on the same data can also illuminate inherent assumptions about teaching and to what degree different systems give rise to different challenges for specific contexts. We argue that this is a necessary step for moving the field of systematic classroom observations forward toward a common "frame for teaching" , and discussions about different types of biases ought to be a standard for any cross-cultural interpretations. We may never completely solve these issues, as there might always be tension depending on how we contextualize and operationalize instructional constructs and sequence the scoring of lessons. The challenge worth pursuing here is not to identify the ultimate grain size in describing and measuring instruction; rather, it is how to assure that important aspects of instruction relevant to a certain context are captured and that teachers with the same instructional practices who may structure their lessons differently are not rated differently.

Concluding remarks
This study focused on how the way standardized observation systems conceptualize, operationalize, and sequence instruction might produce biases when applied in different contexts. There are many other types of possible biases in observation systems, as well as in other methods, attempting to capture instruction across contexts (see van de Vijver & Tanzer, 2004), and it would be legitimate to ask whether we should import observation systems across contexts. Our answer, despite the challenges we have raised, is yes. However, as we have argued, adopting a standardized and validated instrument also comes with the responsibility of assessing whether the observation system is capturing the quality of instruction in meaningful and relevant ways in the specific context. This is particularly important if the instrument is traveling across national borders that might have different norms regarding what highquality instruction is-which is always both an empirical question of what instructional practices we identify as effective and a cultural question of what we value as good teaching.
Funding Open access funding provided by University of Oslo (incl Oslo University Hospital) Data availability The video datasets generated and analyzed during the current study are not publicly available due the fact that they are strictly regulated and participants have not given consent for data sharing beyond the project.