Introduction

Teaching quality has a pivotal influence on students’ academic outcomes (Darling-Hammond 2000; Patrinos and Angrist 2018; UN Sustainable Development Goal (SDG) 4, 2019). Research shows that students’ perceptions of their teacher behaviours are quintessential for describing perceived teaching quality in the intertwined dynamic psychosocial structure of learning environments (de Jong and Westerhof 2001; Levy et al. 2003; Maulana and Helms-Lorenz 2016a Seidel and Shavelson 2007). Teaching quality and teacher behaviours have been defined and operationationalised in different ways (Kyriakides et al. 2013; Muijs and Reynolds 2018; Reynolds et al. 2014; Scheerens 2016; Scheerens and Bosker 1997), implying that understanding teacher behaviour is still an ongoing process. This is mainly because of the complex dissonance between the educational setting and interrelated factors of teaching behaviours that should not be treated separately (Kim et al. 2019; Klassen and Tze 2014; Kyriakides et al. 2009; Seidel and Shavelson 2007). Overall, past studies define teaching quality as teachers’ behaviour that has significant and positive impacts on student outcomes (Fauth et al. 2019; Hattie 2009; Holzberger et al. 2019; Kyriakides et al. 2020; Lee and Mamerow 2019; Maulana et al. 2015a; Maulana et al. 2016b). Therefore, understanding how students perceive their teachers’ behaviour could strengthen teaching quality. However, studies of teacher professional development have revealed that teaching quality is exponentially related to the number of years of job experience; it remains relatively low among early career teachers and generally takes time to develop and reach a sufficient level (Brekelmans et al. 2005). Thus, countries such as Turkey face an additional challenge because of the high proportion of young teachers (TALIS 2018).

The present study was based on a framework from a theory-driven and evidence-based research approach of observable teaching behaviours (Maulana and Helms-Lorenz 2016a). According to this framework, observable teaching behaviours cover the six teaching domains of Learning Climate (CLM), Classroom Management (ORG), Clarity of Instruction (CLR), Activating Teaching (ACT), Differentiation (DIF) and Teaching Learning Strategies (TLS). These six domains synthesise various research traditions including teacher effectiveness (Creemers 1994; Scheerens and Bosker 1997), learning environments (Opdenakker et al. 2012) and teacher support (Klem and Connell 2004). Adressing this conceptualisation, My Teacher Questionnarie (MTQ) was initially developed in the Netherlands and has been found useful for measuring perceived teaching behaviour in international research (de Jager et al. 2017; Inda-Caro et al. 2018; Maulana et al. 2019; van de Grift et al. 2017) and for preservice and inservice teacher professional development (Maulana et al. 2015a; Maulana and Helms-Lorenz 2016a; Maulana et al. 2016c; van de Grift 2014a; van de Grift et al. 2014b).

The Republic of Turkey aims to develop human capital for an improved future and the well-being of its citizens by investing in education (MEB 2017; World Bank 2011). One central way to achieve the goal is to support novice teachers’ performance up to the level of experienced teachers by monitoring and coaching their teaching behaviours continuously. However, a valid, reliable, low-cost and user-friendly instrument to assess teaching behaviours is scarce in the Turkish context. The aim of the present study was to adapt the MTQ for use in Turkish secondary education.

Theoretical framework

Teacher behaviours in learning environments

The teacher is a crucial actor in educational settings [UN (United Nation) Sustainable Development Goals (SDG) 4 2019)]. Teaching behaviour makes a difference to students’ engagement, learning, achievement and well-being (Martin and Dowson 2009; Muijs et al. 2014; Pineda-Báez et al. 2019; Reeve 2006). Researchers and practitioners agree that teacher behaviours are complex (Hattie 2009; Kyriakides et al. 2009; Muijs et al. 2014). Growing attention has been directed towards identifying components of teacher behaviours that have substantial impacts on students’ outcomes (Scheerens 2016; Seidel and Shavelson 2007) and ways to promote sustainable teaching quality in learning environments (Harbour et al. 2015; Kyriakides et al. 2002; Maulana et al. 2019; Panayiotou et al. 2014).

Within the expanding educational research literature, van de Grift et al. (2014b) and Maulana et al. (2015a) unite theory and evidence-based practice to conceptualise and operationalise observable teaching behaviours into six domains. Learning Climate (CLM) is characterised by a psychosocially-safe learning environment that stimulates students’ learning and development. This includes behaviours such as fostering respect, encouraging self-confidence, facilitating healthy interpersonal relationships, and providing a base for healthy growth. Classroom Management (ORG) illustrates teacher behaviours associated with efficient time management for students’ activity and minimisation of physical and pschosocial barriers in teaching–learning time, while processing knowledge in an appropriate manner for students’ comprehension level. Clarity of Instruction (CLR) deals with behaviours such as informing students about the lesson objectives and their expected gain, using multiple instructional strategies in a clear unity, facilitating students’ prior knowledge, and checking whether lesson objectives are achieved and if students understand a given task as intended. Activating Teaching (ACT) indicates teaching behaviours that facilitate students’ active learning. Activating students’ knowledge makes them aware of the relevance of content to their learning and their expected performances. Differentiation (DIF) covers teaching behaviours related to higher-level operations and strategies at the cognitive and affective levels to support individual student needs to link existing and desired skills for their own learning and metacognition. This serves as a base for students to achieve higher-level cognitive skills. Teaching Learning Strategies (TLS) concern teacher behaviours that deliberately demonstrate, teach and scaffold learning processes aimed at to improving self-regulation of learning processes.

Empirical studies show that the six domains of teaching behaviour follow a stage-like order on a unidimensional continuum (van de Grift et al. 2014b; van der Lans et al. 2018, 2019). More-complex teaching behaviours require sufficient experience, practice and knowledge, even though a small number of novice teachers are cabaple of displaying highly-skilled teaching behaviours. The first three teaching behaviours (CLM, ORG, CLR) are viewed as basic competences for teaching, while the other three are viewed as more complex behavioural domains (van de Grift et al. 2014b).

Context of the study: Turkey

The Turkish Ministry of National Education (MEB) is responsible for the educational administration of the national curriculum. The third level of compulsory secondary education, which is the focus of this study, is the four-year (15–19 age) educational context that prepares students for further study. The schooling at this level consist of 40 class hours per week that vary depending on the track, curriculum and elective courses (EURYDICE 2020). Over the years, significant improvements in education have been made in Turkey (MEB 2019a; TUK 2020). However, a number of educational challenges remain apparent, as revealed by international testing studies (i.e. PISA, TIMSS) (MEB 2019b). This suggests some needed alterations, hard work and roadmaps for developing teaching quality in general and understanding how the students perceive their teachers’ behaviours (MEB 2017).

Recently, the Teaching and Learning International Survey (TALIS) of 2018 revealed that learning environments are perceived positively by Turkish students and teachers. Nevertheless, teachers reported that they spent 72% of classroom time on actual teaching and learning, which is lower than the OECD average (78%). In addition, teachers did not broadly use effective instructional practices such as student cognitive activation approaches (Burge et al. 2015). Turkish teachers’ average age was 36 years, which is below the average of 44 years for the remaining 48 countries. Only 6% of Turkish teachers were aged 50 years and above (OECD average 34%). Alignment of these insights calls for rapid implementation of support programs for novice teachers’ professional development and sustainable teaching quality in learning environments.

In Turkey, the majority of studies of teaching behaviours focus on effective, good or ideal teaching from students’ perspectives (Telli et al. 2008) and teacher candidates and teachers (Bozkuş and Taştan 2016; Çakmak 2009; Karakelle 2005; Kozikoglu 2017). A recent study focused on effective teaching criteria in subject teaching, such as mathematics (Yıldırım and Yıldırım 2019). Jointly, some comparative studies have involved teachers’ behaviour in terms of professional development (Özkan et al. 2019) and teacher questioning styles (Çalık and Aksu 2018). Although the aforementioned studies highlight the importance of positive learning environments and teacher behaviours in general, teaching quality was not an explicit focus in the secondary-education setting. Therefore, little is known about teaching behaviour in secondary education from students’ perspectives. Student questionnaires have been recognised in the learning environment literature as highly valuable for tapping into what is happening in the classrooms based on the lens of students (de Jong and Westerhof 2001; Fraser 2012).

The nature of the teaching profession requires practical, yet theory-based, solutions (European Commission 2013; Ingvarson 2019). The present study is particularly important because it attempted to provide evidence regarding the psychometric quality of a student questionnaire that can be used to assess perceived teacher behaviours in secondary schools. In the long term, information gathered in this way could be used to enhance and support teaching quality and to increase the ‘true’ potential of the teacher’s presence in real-time learning.

Research aims

To provide sustainable teaching quality in diverse learning environments, teaching behaviours should be supported and monitored in the professional context. Professional feedback should be provided to improve teaching (i.e. lesson studies, research lessons, professional learning communities). Knowing that higher levels of teaching quality are related to more teaching experience (van de Grift et al. 2014b), and that the Turkish teaching force is younger than the OECD average (TALIS 2018), a practical, highly-reliable and valid measure is needed to provide prompt professional feedback, in real time, to boost teaching performance. We aimed to develop an instrument that is concise and at the same time adequately represents the construct of effective teaching behaviour. These practical characteristics are highly important in contemporary classroom assessments to maintain sufficient participation rates and reduce response fatigue (Brick 2018; Groves 2006). To our knowledge, a student questionnaire that meets the mentioned characteristics is not available yet in the Turkish context. The present study filled this gap by examining an existing, valid and reliable measure to tap perceived teaching behaviours (Maulana and Helms-Lorenz 2016a) and adapting it for use in the Turkish context. To reach this goal, we applied Mokken Scaling (MS).

Mokken Scaling (MS)

Test construction is based on one of two test theories: classical test theory (CTT) and Item Response Theory (IRT). The present study applied IRT, whose two main models, parametric (IRT, e.g. Rasch) and nonparametric (NIRT, e.g. Mokken), try to explain the structure in the manifest item and test responses by assuming the existence of a latent structure (θ) on which persons and items have a position. In this respect, both have the same assumptions. Hovewer, to do this, the parametric approach defines the shape of the Item Response Function (IRF) and transformations are used that result in measures on an equal interval scale, while the nonparametric approach explores measurement properties by evaluating the relationship between items and the latent structure (θ) being measured (i.e. kernel smoothing, isotonic regression estimation) (Meijer and Baneke 2004; Meijer et al. 2014). Thus, NIRT supports the interpretation of total scores (i.e. sum scores) to meaningfully order persons and items on the latent structure (θ) without any parametric transformations while identifying the unexpected answering behavior in response patterns. Several scholars recognise that these psychometric properties of NIRT are particularly useful in contexts in which the underlying response processes are not well understood, such as non-cognitive data and avoiding misleading results of parametric IRT models (Chernyshenko et al. 2001; Meijer and Baneke 2004). This is important for enhancing our understanding of different learning environments (e.g. multi, hybrid, in-formal) and explore the social, physical, psychological and pedagogical contexts in which learning occurs and which affect students’ affective outcomes.

Mokken Scaling-MS (Mokken 1971) describes the relationship between trait scores and item responses, similar to the way in which IRT models explore the shape of IRF without forcing or matching a particular structure (i.e. logistic ogiveshape) that they do not have (Meijer et al. 2014; Molenaar 2004). Empirical data almost never satisfy the strong IRT model assumptions fully. NIRT (e.g. Mokken scalling) helps to explore the reasons why the data fit the model and it reveals the reasons why a specific logistic IRF model fails to fit the data (i.e. Meijer and Baneke 2004). NIRT also provides information about the psychometric quality of items in a particular population (Meijer et al. 2014). MS is based mainly on Guttman (1945) scaling and, because of its explorative nature, it is described as a probabilistic theory-driven NIRT (van Schuur 2003). MS provides advantages and flexibility to researchers for exploring the nature of data as long as basic ordering requirements are consensus ad idem. Additionally, the availability of frequently-updated software R with graphical features and the package Mokken (van der Ark. 2007, 2012) supports the popularity of MS among educational researchers (Wind 2017, 2019). MS uses two NIRT models: the Monotone Homogeneity Model (MHM) based on three assumptions (monotonicity, unidimensionality and local independence); and a general and more strict Double Monotonicity Model (DMM) obtained by adding a fourth assumption, namely, evidence of Invariant Rater Ordering (IRO). Based on the same requirements, Molenaar (1982, 1997) proposed dichotomous and polytomous formulations of these two models by specifing the polytomous DMM with Item Step Response Functions (ISRFs).

Based on the theoretical outline above, we studied effective teaching from the perspective of observable teaching behaviour based on teacher effectiveness and learning environments frameworks (Maulana et al. 2015b; van de Grift 2007). MS polytomous DMM was employed to adopt the MTQ for assessing effective teaching behaviours in a limited time under diverse teaching conditions in Turkey.

Methods

Participants

The sample consisted of 12,036 students (Grade 9–12, age 15–19 years) from 446 classes/teachers from 24 coeducational general public schools accessible for students from various socio-economic backgrounds. Schools were located in two cities (7,995 students, 66.4%) and rural areas (4,041 students, 33.6%) from the highly-populated north-west part of the country (Marmara). This region geographically connects Europe and Asia. Each school participating in the study provided between 8 and 29 classes/teachers (M = 18.85, SD = 4.78). There were 8,458 students (70.3%) from 296 classes/teachers from one city and its districts and 3,578 students (29.7%), from 150 classes/teachers from the other city and its districts. More than half of the students (N = 6, 544, 54.40%) were females, while 306 students (2.5%) did not report their gender. According to national statistics, a total of 1,668,086 students (913,404 are female, 54.76%) attend general public secondary schools (MEB 2019a, p. 129). Therefore, the gender distrubution of our sample was representative of the country. Students are distributed by grades as follows: 4,248 (35.3%) in grade 9, 3,470 (28.8%) in grade 10, 2,905 (24.1%) in grade 11 and 1,413 (11.7%) in grade 12. The distribution by subject taught was: 4,784 students (39.7%) for Beta Subjects-science track (biology, chemistry, physic, mathematics); 4,259 students (35.4%) for Alpha Subjects-the language track (i.e. English, German, Turkish); 2,567 students (21.3%) for Gama subjects-Social sciences; 176 students (1.5%) for Physical education; and 220 students (1.8%) for MusicArt track. Class size varied from 7 to 39 students (M = 26.29, SD = 6.31).

Ethics approval was granted by the authorities concerned. Throughout the study, students, teachers and schools were randomly selected on a voluntary basis. All questionnaires were completed during normal class hours (40 min) without the presence of teachers. Data (with multiple measures) were collected in 2017 (9,046 students, 75.2%) and 2018 (2,990 students, 24.8%) during fall (October–December) and spring (March–May) as a part of the International Comparative Analysis of Learning and Teaching (ICALT3) project comparing the perceived teaching quality. This study focused only on modifying the MTQ for the Turkish secondary-education context.

Measures

Two instruments were used. The My Teacher Questionnaire-MTQ (Inda-Caro et al. 2019; Maulana and Helms-Lorenz 2016a) was the main instrument and measured students’ perception of teaching behaviour. The Student Engagement Scale (Skinner et al. 2009) was a criterion measure for checking convergent validity considering the theoretical connection between the two constructs (Maulana and Helms-Lorenz 2016a). Response alternatives were on a 4-point Likert scale, with higher responses indicating higher quality levels. Surveys were conducted employing a paper and pencil method.

The MTQ contains 41 items measuring perceived behaviour in the six domains: Learning Climate (CLM) (5 items, e.g. “My teacher answers my questions”, α = 0.75); Classroom Management (ORG) (8 items, e.g. “My teacher applies clear rules”, α = 0.83); Clarity of Instruction (CLR) (7 items, e.g. “My teacher explains the purpose of the lesson clearly”, α = 0.86); Activating Teaching (ACT) (10 items, e.g. “My teacher encourages me to think”, α = 0.86); Differentiation (DIF) (4 items, e.g. “My teacher knows what I have difficulty with”, α = 0.79); and Teaching Learning Strategy (TLS) (7 items, e.g. “My teacher teaches me to check my solutions”, α = 0.85). Prior studies have shown that the items of the MTQ (Maulana et al. 2016c; van de Grift et al. 2014b) can be ordered in a unidimensional structure (Maulana et al. 2015a, 2019; van der Lans et al. 2019) and is valid and reliable across countries (Maulana et al. 2019).

Students’ engagement was assessed using 10 items in two scales: Behavioural Engagement-BEHE (5 items; e.g. “In this class I pay attention”, α = 0.84), and Emotional Engagement-EMEN (5 items; e.g. “In this class I feel good.”, α = 0.80.)

Translation process

Following International Test Commission (ITC 2018) guidelines, instruments were translated separately from English into Turkish by two native Turkish speakers majoring in English as a Foreign Language (Translation-1). The translations were then double checked, proofread and finally back translated by three different independent experts who were qualified and experienced in these languages and knowledgeable about the instrument development and adaptation (Back translation-2). Translated items were checked for the content and the appropriateness of the translation. Concurrently, a Turkish secondary-school language teacher with over 15 years of teaching experience reviewed the measures for the semantic structure (Committee approach-3).

Through Translation-1 and Back translation-2, MTQ items were independently double checked with the original Dutch version (source for the translation) by a native Dutch speaker and a multilingual teacher educator. This combination was preferred for maximising the suitability of the test adaptation and recognising the differences (i.e. linguistic, cultural, psychological) and equivalence (Grisay 2003; van de Vijver and Tanzer 2004).

Data analysis

Descriptive statistics were generated for items and subscales. Construct validity of the MTQ involved (1) data examination, (2) scaling as recommended by Sijtsma and van der Ark (2017) and (Wind 2017) and (3) predictive validity (Crișan et al. 2020). For the student engagement measure, Principal Component Analysis (PCA) (varimax rotation) and Confirmatory Factor Analysis (CFA) were performed with the ML estimator for both models using the R package ‘lavaan’. Model fit was checked using the Comparative Fit Index (CFI), Tucker–Lewis Index (TLI), Standardised Root Mean Residual (SRMR) and Root Mean Square Error of Approximation (RMSEA). RMSEA < 0.05 and TLI and CFI close to 0.95 were considered to indicate good fit (Schreiber et al. 2006). Missing data were reported and deleted listwisely for both measures. Analysis was performed using the programs SPSS25 and R (version 3.6.1) and MSP5 for Windows (Groningen:ProGamma).

Results

(1) Data examination for the MTQ involved, first, the Kolmogorov–Smirnov test [df (9415) = 0.00. p = 0.005] which indicated that the data did not follow a normal distribution, but was skewed to the left in all cases. Second, the Graded Response Model (GRM), an extended IRT model for ordered polytomous observed variables, was applied to understand the response behaviours and how the set of items performed (Samejima 1968; Perner and Imiya 2005). The MTQ items were visualised using R package ‘psychotree' to explore the unidimensionality assumption further (Maulana et al. 2015a, 2019; van der Lans et al. 2019). The total information estimated by this model indicated the presence of a nonnormal distribution with the highest frequency towards the maximum scores (3–4) in the data (Fig. 1).

Fig. 1
figure 1

Observed raters’ response patterns across the classified groups, gender versus school subjects (Nstudent = 12,036, MTQ 41-iems)

Subsequently, the NIRT approach was carried out to identify the items which satisfied the four assumptions of Unidimensionality (UD) (all items are related to a single latent variable-θ), Monotonicity (M) (as person locations on the latent variable increase the probability for correct response, X = 1, does not decrease), Local independence (LI) (answers on items depend solely on the latent trait and not on some other characteristics of the individual or its environment), and Non-intersecting ISRFs (the conditional probability for a rating in category k or higher on Item i has the same relative ordering across all values of the latent variable θ) using MS polytomous DMM. Under the DMM, IRFs can take on a variety of shapes as long as they do not intersect.

Regarding (2) Scaling during the first stage, the data were scanned for missing scores, inadmissible scores and outliers for MS. The number of Guttman errors showed that the Guttman pattern was consistent (Meijer et al. 2016). Missing values varied between 0.5 and 2.0% at the item level. The missing values were less than 5% and within acceptable range to be considered as missing at random (Tabachnick et al. 2013) and less than the figure of 10% that is unproblematic for MS (Sijtsma and van der Ark 2017). MS properties (i.e. element accuracy, scalability coefficients, and confidence intervals around scalability coefficients) have been shown to be sensitive to sample size. The applied sample (11,230 students, 6.7% missing) is sufficient to perform MS polytomous DMM with the real data (Crișan et al. 2020; Watson et al. 2018).

During the second stage, items were examined for scalability (H coefficient) and dimensionality using both the Automated Item Selection Procedure (AISP) and Generic Algorithm (g.a) in R ‘package Mokken’ because these different searching algorithms can provide different results (Meijer et al. 2015). The Loevinger’s H coefficient indicated an unscalable scale if H < 0.3, a weak scale if 0.3 ≤ H < 0.4, a medium scale if 0.4 ≤ H < 0.5, and a strong scale if 0.5 ≤ H ≤ 1.0 (Mokken 1971; Sijtsma and Molenaar 2002). Higher H values imply a more reliable ordering of both items and persons (Hemker and Sijtsma 1993). Items were selected stepwisely, consistent with the procedure proposed and taken by earlier initiatives for MS item reduction (Molenaar and Sijtsma 2000). H was used to select scales with both Type = Search normal and Test. The default settings were used in each algorithm. The procedure was run for positive constant c1 initially set at ILowH = 0.00 as a control condition (Crișan et al. 2020; Meijer et al. 2015; Sijtsma et al. 2011) and then re-run with c increased by increments of 0.05 up to 0.80 (the upper bound 1). Meanwhile, some items were separated into more subscales (Hemker et al. 1995; Moorer et al. 2001). The H value at each c value and the number of suggested scales were examined to confirm and test the unidimensional structure (Sijtsma and Molenaar 2002).

During the first round of item refining, this procedure was used to remove unscalable items [e.g. My teacher talks interestingly, ACT, H < 0.3, item-pair scalability (Hij) and item scalability (Hj = 0.08) were positive, MItem = 2.65] and the rest of 40 items were scaled into one dimension H ≤ 0.3 (Scale: H = 0.52, ρ = 0.97, Hj = 0.42–0.60). At c 0.40, two items were removed because of the lower bound (e.g. My teacher makes sure that I treat others with respect., LC, Hj = 0.37) and the rest of the 38 items were scaled on one dimension (Scale: H = 0.54, ρ = 0.97, Hj = 0.42–0.60). With c set at 0.50, seven items were removed because Hj < 0.50 (e.g. My teacher makes sure that others treat me with respect, LC, Hj = 0.49). The rest of the 31 items were classified into one dimension (Scale: H = 0.57, ρ = 0.97) while two items formed a second scale (Scale: H = 0.65, ρ = 0.80). These two items (My teacher lets me summarise the content of the lesson, TLS, and My teacher lets me explain the content of the lesson to other students, TLS) were removed stepwisely. Afterwards, 31 items at c 0.50 fitted the unidimensional measure (Scale: H = 0.56, ρ = 0.97) with Hj varying between 0.51 and 0.61 (strong scale). During the second round, the item’s factor loadings on the scale were calculated. The items with the lowest factor loading were deleted stepwisely when the scale internal consistency (Molenaar Sijtsma rho-ρ) was lower than 0.70 and H ≤ 0.50 at c 0.50 (e.g. My teacher makes sure that I use my time effectively, ACT). During the third round, content-based and item correlations were examined to identify redundant items. If items were similar in content or in the same domain (van de Grift 2007, 2014a), the item with the lowest H score was deleted (e.g. My teacher answers my questions, LC, Hj = 0.51). After removing 12 items stepwisely, 19 items remained for further evaluation.

During the third round, monotonicity and local independence assumptions were examined. The last assumption, ISRFs, was checked based on PMatrix information. Nine items with Crit  > 80 (for Crit see foodnote 2) showed a strong violation and were discarded in succession (e.g. My teacher motivates me, ACL, Crit 116) (Molenaar and Sijtsma 2000). After these rounds of item reduction, 10 items (MTQ10) remained and fitted the unidimensional structure and satisfied all assumptions for the MS polytomous DMM (Table 1).

Table 1 Summary of items and monotonicity a checks for 10-item of My Teacher Questionnarie (MTQ10) and its item distribution over the six domains of original MTQ

During the third stage, scale properties were investigated. MS provides the scale reliability statistic, Molenaar Sijtsma rho-(ρ), which is comparable to Cronbach’s α (Molenaar and Sijtsma 1984). A value of ρ > 0.7 is considered acceptable (Kline 2000; Nunnally and Bernstein 1994). Items generally scored higher (M = 2.98, SD = 0.0072, skewness = −0.488 SD = 0.023, kurtosis = −0.559 SD = 0.046). Cronbach’s α and rho-(ρ) were 0.93. MTQ10 properties are presented in Table 2. Meeting these four assumptions provides evidence that the MTQ10 is sufficiently unidimensional, represents the teaching behaviour (construct) more concisely and is reliable (Wind 2019).

Table 2 Scale properties, univariate frequencies and item means and reliability for MTQ10 [Nstudent = 11,230 (Nstudent = 12,036, missing 806, 6.7%)]

Furthermore, scale equivalance was examined (ITC 2018, p. 116). MTQ10 satisfied all MS polytomous DMM assumptions, which indicates that the 10-item set does not exhibit Differential Item Functioning (DIF) (Moorer et al. 2001). Thus, the analysis was extended with Differential Scale Functioning (DSF). The scale was tested with three randomly-formed subsamples (N1student = 3,744; N2student = 3,743; N3student = 3,743) according to grade level, school subject and gender for equal functioning to determine whether the scale composition and properties are generalisable. There was no indication of DSF across these groups. Results for the school subjects are given in Fig. 2.

Fig. 2
figure 2

MTQ10 Differential Scale Functioning (DSF) across school subjects

Eventually, (3) predictive validity, the MTQ10 was validated by consensus ad idem (Downing 2003; Nunnally and Bernstein 1994). Initially, face and content validity were examined by an expert group. Next, MTQ10 met the four assumptions of MS polytomous DMM’s and shows the unidimensional structure with high reliability (α and ρ = 0.93) (Wind 2019). Irrespective of these results, researchers agreed to cross-validate the MS results as sine qua non of assessment (Crișan et al. 2020). Thus, the predictive validity was determined between the measures, MTQ10 and Student Engagement (criterion measure).

For Student engagement, first, PCA was performed on the 11388X10 matrix (5.4% missing). The Bartlett-test (χ2 (45) = 48,738,334 p < 0.000) and the Kaiser–Meyer–Olkin measure (KMO = 0.854) were suitable for PCA (Field 2009). The scales correlated with each other (r = 0.607, p < 0.001) and explained 59.02% of the variance in Model 1 (10 items, see Table 3). All the items fell into their respective factors with two exceptions. The Behavioral Engagement BEHE item “In this class, I participate in class discussion” loaded on Emotional Engagement-EMEN (0.43) and marginally (0.32) on the BEHE. The exact opposite pattern was found for the EMEN item “In this class, when we work on something, I feel interested.” which was loaded on the BEHE (0.44) and loaded marginally on the EMEN (0.34).

Table 3 Descriptive information for student engagement, Cronbach α and CFA goodness-of-fit indices

It is possible that many students interpreted this item as a mixture of behavioural and emotional engagement as the word ‘interested’ in Turkish language and culture also implies an affective state. Second, these two items were excluded from the analysis and PCA was performed on the 11513X8 matrix (4.35% missing). The Bartlett-test (χ2 (28) = 39,043.277 p < 0.000) and the Kaiser–Meyer–Olkin measure (KMO = 0.816) and BEHE α = 0.84, EMEN α =0. 76 were satisfactory. The scales correlated with each other (r = 0.50, p < 0.001) and explained 64.43% of the variance in Model 2 (after removing two cross loaded items, 8 items, see Table 3). Considering the discussion about the Kaiser criterion (Fabrigar et al. 1999; O’connor 2000) and comments about checking the unidimensional structure in multiple ways (Ziegler and Hagemann 2015), we conducted Horn Parallel Analysis (Horn 1965) that is considered among the most accurate methods (Dinno 2009; Glorfeld 1995).

Parallel analysis involves extracting eigenvalues from random data (Horn 1965) and Glorfeld’s (1995) extension. For this study, the Horn Parallel Analysis (Horn 1965) was performed using R package ‘paran’ which showed that the two-factor structure was retained for Model 1 and Model 2 (Fig. 3).

Fig. 3
figure 3

Plot for Horn’s Parallel Analysis for Model 1 (1a) and Model 2(1b)

Third, CFA was performed (R package Lavaan). The fit indicated slightly lower values for CFI and TLI, but a high value for RMSEA (Schreiber et al. 2006). Results for both models (see Skinner et al. 2009 for details) are presented in Table 3. The person correlation coefficients (varies between 0.41 and 0.47) and the Corrected Attenuation-CA3 (varies between 0.52 and 0.64) (Spearman 1904; Nunnally and Bernstein 1994, p. 240–241) were calculated between the MTQ10 and Student Engagement (two models) for predictive validity (see Table 4).

Table 4 Pearson correlations and the Corrected Attenuation (CA) between MTQ10 and student engagement (two models)

Discussion

This study’s aim was to shorten the MTQ (Inda-Caro et al. 2019; Maulana and Helms-Lorenz 2016a) to assess perceived teaching behaviours in Turkey. When the MS polytomous DMM was applied, the resulting MTQ10 showed strong psychometric characteristics, internal consistency and construct validity. Its unidimensional structure is consistent with previous findings and the original version of MTQ (Maulana et al. 2015a, 2019; van der Lans et al. 2019) and is consistent across groups (random samples, gender, subject, grade level). MTQ10 met all the MS polytomous DMM assumptions. The observed violations of monotonicity were minor [Table 1, Crit (see footnote 2) less than 80], which could be because of sampling fluctuations (Molenaar and Sijtsma 2000), and the Guttman Error in Response Pattern was consistent (Meijer et al. 2016, see Appendix). MTQ10 had adequate validity and strong reliability (Cronbach’s α and Molenaar Sijtsma rho-ρ are 0.93).

The applied methodology, MS polytomous DMM, confirmed the pyschometric quality of the MTQ10. Firstly, parametric IRT, item factor analysis was used to test the assumption of unidimensionality (Reise and Waller 2009). MS selects items that circumvent the assumption by upper and lower asymptotes, because the H coefficient is used as a criterion for including items in a scale. Items with asymptotes substantially different from 0 and/or 1 were rejected stepwisely (for not being discriminating enough). This means that the ceiling-floor effects are eliminated. Therefore, in the NIRT literature, it is suggested that nonparametric approaches for assessing unidimensionality are preferred over parametric ones (Meijer and Baneke 2004; Sijtsma and Molenaar 2002). Second, students’ and teachers’ preferences for short questionnaires are well recognised (Maulana and Helms-Lorenz 2016a; Maulana et al. 2019). However, reliability increases with the test length and the shorter tests often consist of items with relatively low inter-item correlations. It could be difficult to optimise both reliability and predictive validity at the same time (Magnusson 1967; Nunnally and Bernstein 1994).

In the context of MS, Hi values or discrimination parameters optimise both predictive validity (through content heterogeneity) and reliability (through test length) (Crișan et al. 2020). In this respect, the present study shows that the psychometric quality of MTQ10 is sufficiently strong to give prompt feedbacks to teachers. Third, MS, similar to other NIRT methods, measures constructs at the ordinal level (categorical variable). Hovewer, the distinction between continuous and categorical variables is not always clear-cut (Tabachnick et al. 2013, p. 7, 204). Data collected with the MTQ10, which satisfies the MS assumptions, can be treated directly as a continuous distribution which is straightforward and easy to apply in practice. Last, educational settings are intertwined dynamic systems and difficult to disentangle. The necessity of system thinking4 is evident for understanding the basic level of the setting. Research has shown the theoretical and empirical links in this structure (i.e. student engagement relates to teaching quality and ultimately learning outcomes) (Maulana and Helms-Lorenz 2016a; Pianta et al. 2012). Thus, in addition to construct validity, evidence of the predictive validity of MTQ10 for student engagement (Skinner et al. 2009) also was established. The results (Fig. 3) confirm the theoretical structure, with all the items falling into their respective factors with at least three loadings (Zwick and Velicer 1986), with only two exceptions in the PCA results. Table 3 provides descriptive statistics, correlations and the fit parameters for Model 1 and Model 2 (Skinner et al. 2009).

The concatenated psychosocial components in the educational settings could be reliable at a certain measurement time, but they might fluctuate over short periods of time. This cross-sectional survey study did not include this possible fluctuation over time (Akkerman and Bakker 2019; Christenson and Reschly 2012; Downings 2003; Reeve and Lee 2014). Student engagement measures might suffer from the chosen research design by revealing lower Pearson’s correlation coefficient, while the Corrected Attenuation (CA) (Nunnally and Bernstein 1994) generally showed adequate convergent validity (r ≥ 0.60, Hinkle 2003, see Table 4).

Additionally, as noted by Skinner et al. (2009), more multidimensionality could be present than was identifed in their study. These are the most probable reason for the slightly-lower values of fit indices. Considering that validation is a continuing process (Messick 1995), further studies with MTQ10 would benefit from more powerful approaches with instrument batteries, such as the Multitrait-MultiMethod Matrix-MTMM (Campbell and Fiske 1959) and the clinical study design (i.e. retrospective analysis and time series to understand the puzzle in perceived teaching behaviours more comprehensively.)

In summary, understanding student perceptions of teaching behaviour requires advanced knowledge and statistics to analyse large data sets with fine measures. Future research should investigate the response process and how participants benefit from the interaction over time to undertand the obstacles for teaching quality. Such research designs should consider how to tackle the procedures that can fluctuate and develop over time while controlling or eliminating the error sources associated with the test takers (Lüdtke et al. 2009; Mainhard et al. 2019). Students’ perceptions of teaching behaviour provide new opportunities for providing real-time feedback to teachers to boost their own teaching behaviours in learning environments or in co-operation with coaches. In this manner, teaching practice, especially on the complex level of teaching behaviours, can benefit from tailored interventions.

The MTQ10 is also subject to limitations. First, perhaps the most important limitation is that the MS procedure for item selection is sample dependent. Although multiple methods were employed in the item-refining process (see Results), the sample was very large and representative of the secondary-school context5 in Turkey, and the MTQ10 performed well in terms of DSF, results still could be sensitive to the population and learning environment (i.e. laboratory, classroom, outdoor). Moreover, methodologically, Meijer and Egberink (2012) strongly advised that care be taken to investigate the inclusion and exclusion of outliers in their sample because H is sensitive to outliers. This means that researchers should carefully examine the data before performing any analyses. Second, although this study assumed that general teaching behaviours apply to all subjects, the MTQ10 does not cover any subject-specific teaching factors. Hence, investigations of subject-specific didactics could require the inclusion of subject-specific measures. Third, this study ignored possible sources of bias in sampling fluctuation, test takers’ response behaviours, and their perceptions in responses (ITC 2018; Mokken 1971).

In conclusion, particularly because of its strong content validity (Downings 2003; Nunnally and Bernstein 1994), the MTQ10 is a robust and practical measure for assesing perceived teaching behaviours in secondary schools in Turkey. Overall, the teaching profession is traditionally a highly-respected profession in Turkish society (Dolton et al. 2018). The profession faces universal transformations (Dijkema et al. 2019; Papanastasiou and Karagiorgi 2019). Todays’ teachers need viable collaboration and professional feedback to educate not only future citizens, but also their future colleagues. Also more practice-oriented training is anticipated. The MTQ10 has the potential to deepen our understanding of students’ perceptions of teaching behaviour. It can be used to assess, formulate and set tailored interventions. The results of the present study are anticipated to support the teaching profession and contribute to understanding of teaching behaviours as perceived by students.

Notes

  1. 1.

    Sijtsma and Molenaar (2002, p. 68) define a Mokken scale as a set of items for measuring a common trait that is determined by reasonable discriminative power c that is a user-specified value. Reasonable discriminative power is defined by a lower bound c = 0.3 that is not strictly necessary. c = 0 provides interesting information about which items comply to the minimum requirements of the Monotone Homogeneity Model (MHM). Intermediate values of c are between 0.40 and 0.60 (Meijer et al. 2014; Sijtsma and Molenaar 2002, p. 68).

  2. 2.

    Crit is an effect size measure, a critical value, calculated by summing the coefficient values of ItemH, #ac, #vi, #vi/#ac, maxvi, sum, sum/#ac, zmax and #zsig into a single statistic (Molenaar and Sijtsma 2000, p. 74). If Crit > 80, there is serious doubt about the validity of the model for this item. If Crit < 40, the violations reported could be ascribed to sampling variation. If Crit ≤ 40 and ≥ 80, a decision can depend on further consideration of the pros and cons. Crit values provide an idea about the seriousness of model violations in the data analyses (Meijer et al 2014).

  3. 3.

    Corrected Attenuation (CA) (Disattenuation) is a statistical procedure developed by Charles Spearman in 1904 to allow researchers to estimate the relationship between two constructs as if they were measured perfectly reliably and free from random errors that occur in all observed measures (Nunnally and Bernstein 1994).

  4. 4.

    System thinking is the ability to understand how an entire system works; how an action, change or malfunction in one part of the system affects the rest of the system; and adopting a ‘big picture’ perspective on work. It includes judgement and decision-making, system analysis and system evaluation, as well as abstract reasoning about how the different elements of a work process interact (NRC 2010, p. 63–64).

  5. 5.

    Schools were invited to participate in the study by providing at least 20 teachers with their one class. Schools with less than 20 classes were also invited but, in those cases, all teachers were asked to participate voluntarily.