Teacher specialization and student perceived instructional quality: what are the relationships to student reading achievement?

At an international level, teachers’ work is increasingly circumscribed and regulated. Notions of accountability have shifted from primarily inputs to primary outcomes, and investment in strengthening teacher performance evaluation has expanded. At the same time, investment in enhancing the quality of teacher education programs is contested in many countries. Occupational professionalism, that is, a traditional, historic form characterized by discretionary decision-making, collegial authority, and trust in the practitioner, has been replaced by organizational professionalism that incorporates target-setting and performance review. The overarching question in this study concerns the meaningfulness and appropriateness of using student perceived instructional quality for the estimation of teaching quality in comparison to teacher specialization. The study investigates relations between fourth grade students’ reading achievement levels, teacher specialization, and student perceptions of instructional quality, based on the Swedish PIRLS 2011 data. Performing two-level structural modeling with latent variables, this study revealed a positive relationship between teacher specialization relevant for the grade and subject taught, and student reading achievement. By contrast, there was no association between student perceptions of instructional quality and student reading achievement, or between instructional quality and teacher specialization. The results raise questions about the benefit of student evaluations of teacher classroom practices from both a validity perspective, as well as from a teacher professionalization perspective. However, the cross-sectional data used does not allow for causal inference, and further research on the relationships between teacher specialization, student perceived instructional quality, and student achievement is therefore needed.


Background
In recent decades, research has increasingly emphasized the importance of developing and maintaining a high-quality teaching force. There is now compelling evidence that teachers account for a significant portion of variance in achievement between classrooms (Darling-Hammond 2000Goldhaber and Anthony 2007;Hanushek 2011;Hanushek and Rivkin 2012;Hattie 2009;Hedges and Greenwald 1996;Kyriakides et al. 2009;Muijs et al. 2014;Nye et al. 2004;Rockoff 2004). Nonetheless, there are conflicting views on how best to guarantee teacher quality. While some researchers argue that investing in high-quality preparation is the most promising approach, others suggest that better facilitating entry to teacher education would attract stronger candidates (Boyd et al. 2009: Ingersoll 2007. Issues about how best to design and organize teacher preparation have become increasingly prominent in undertakings to improve teacher quality. In this debate, there is an important underlying assumption that assessment of teacher quality in the classroom context can filter out poor-quality teachers, and effectively stimulate instructional improvement. However, in a review of existing literature, Hallinger et al. (2014) found little support for the belief that teacher evaluation represents a high-impact school improvement strategy.
In fact, a recurring theme in the instructional and school development literature emphasizes the potential costs and negative consequences of the frequent monitoring of teachers' work. Close surveillance and evaluation of teacher classroom behavior has been linked to teacher stress (Perryman et al. 2011), to de-professionalization (Zeichner 2010), to decreasing attractiveness of the teacher profession (Ingersoll et al. 2016), and to teacher attrition (Borman and Dowling 2008).
An important rationale for the scope of the current study is the increasing focus on efforts to categorize, scrutinize, and estimate teacher instructional behavior (Seidel and Shavelson 2007). The present study explores and compares relations between two measures of teacher quality-whereof one is a distal measure and the other is more proximal to teaching practice-and their relation to student achievement. Specifically, the study aims to investigate the relations between teacher specialization (defined as teacher education relevant for subject and grade), and student perceptions of instructional quality and fourth grade students' reading achievement.
2 Occupational and organizational teacher professionalism: opposing framings of teachers' work In school systems that rely on accountability regimes, specialized teachers with deep subject knowledge and the ability to translate this knowledge into highly effective teaching find themselves under scrutiny. Teacher agency is under increasing threat from national governments, as statutory bodies are tasked with the control of curriculum content, and the assessment of teacher pedagogy and professionalism. Quite simply, reforms have changed what it means to be a teacher (Day 2007).
The concept of professionalization is widely used. It generally denotes an idea that certain occupations move towards strong occupational control (Abbott 1991). Professions are recognized as the mediators and applicators of knowledge in specific domains. The exclusive nature of skills and knowledge means that professionals have a mandate to make their own choices and decisions about proper interventions (Brante 2013). Professionalization has been described as the control of work, of scientific transformation, and the licensing of ethical conduct for the relation between professionals and clients. The classical way of understanding professional work comprises a high level of trust (Lilja 2014).
In recent decades, schools and teachers in many countries have increasingly become subject to evaluation of varying kinds. Quality control and audit are regular elements. In education, professional work is both increasingly controlled and increasingly fragmented. It is characterized by external, low-trust accountability based on private sector strategies (Bottery 2006;Lilja 2014). In addition, the nature of audit culture demands quantification and measurement. Quality auditing builds on measures of performance that, in the main, eschew the use of peers or recourse to expert knowledge. Although schools are held accountable for the quality of their provision, it is often stakeholders beyond the teaching profession who determine what is to be prioritized. Consequently, the trust on which an expert system such as teaching depends is placed at risk (Perry 2006). Abbott (1991) asserts that professions move in many directions, rather than in a single direction as implied by the term professionalization. Evetts (2006Evetts ( , 2011 distinguishes between occupational and organizational professionalism. Occupational professionalism, according to Evetts, is the more traditional, historical form and includes discretionary decision-making in complex cases, collegial authority, and the occupational control of work. It is based on trust in the practitioner by both clients and employers. Organizational professionalism, on the other hand, incorporates accountability, target-setting, and performance review. Although national reforms differ in content, direction, and pace, there are important similarities. Day (2007) identifies three factors of particular relevance from a teacher quality perspective. First, reforms are proposed because governments believe that by intervening to change the conditions under which students learn, it is possible to accelerate improvements, raise standards of achievement, and thereby increase economic competitiveness. Second, the reforms result in an increased workload for teachers. Third, the reforms do not always pay attention to teacher identities. These identities are central to motivation, efficacy commitment, job satisfaction, and effectiveness. Day (2007) summarizes developments in England and Wales where schools are subjects to market pressures through parental choice of school, greater financial autonomy, evaluation, and target-setting. As outlined by Perryman et al. (2011), since the mid-1980s, teachers in England have lost a degree of relative autonomy, i.e., where they self-accountable though informal reflection and peer review. This relative autonomy has been replaced by loss of control, with stress being a common outcome. From an American perspective, Lavigne (2014) describes how, ever increasingly, hiring, firing, and tenure-granting policies are based on teacher evaluations. In a similar vein, Ingersoll (2005) explains how a teacher deficit perspective overlooks the organizational and occupational contexts of the work of teachers, which is characterized by low stature and social standing.
Much like developments in other countries, the influence that Swedish teachers have over their own work has diminished substantially. Teacher professionalism has been challenged or replaced by a teacher identity that no longer stresses the importance of specific and well-defined subject knowledge (Stenlås 2009). Teachers are accountable for perceived, as well as genuine shortcomings, and risk being "named, blamed, and shamed" (Dovemark and Holm 2017;Lindqvist et al. 2009). Occupational professionalism is contested by the frequent use of quality assessment from outside of the profession. While trust in the quality of teacher education programs, and relevantly prepared teachers indicates a meritocratic view on teacher quality and occupational professionalism, reliance on close monitoring and the assessment of teacher classroom behavior suggests an organizational professionalism that is characterized by a loss of autonomy and trust.
3 The nature of effective teaching Much research has been devoted to identifying actions and conditions that affect student outcomes (Seidel and Shavelson 2007). Though categorization varies across studies, a number of teaching process variables have been shown to positively affect student outcomes. Reinforcement, feedback, maximization of learning time, adaptive instruction, cooperative learning, and mastery learning are all examples of instructional strategies that have been found effective (Creemers and Kyriakides 2010;Darling-Hammond and Bransford 2005;Scheerens and Bosker 1997;Seidel and Shavelson 2007;Hattie 2009). It has also been argued that monitoring teacher behaviors is preferable, or indeed necessary, for guaranteeing positive student learning outcomes ). In the domain of literacy, Cunningham and Zibulsky (2009) noted that in the past three decades, the field of teacher knowledge has grown considerably, with studies specifically categorizing the knowledge and skills that teachers must acquire and apply. In this context, the development of instruments that can provide reliable and valid estimates of teacher knowledge has received considerable attention. Still, this work is based on the premise that it is possible to examine how the teacher knowledge base is associated with student outcomes, and as a consequence of this, to develop empirically validated best practices. However, solid evidence of the transformation from teacher knowledge into teacher practice is yet to be presented.

The role of teacher education for obtaining high-quality teaching
Despite the lack of compelling evidence of the nature and effects of transformation processes, there is an almost universal quest to improve teacher, and therefore also educational quality. With this comes the demand by policy-makers for higher quality teacher education (Imig and Imig 2006). According to Cochran-Smith et al. (2010), teacher education in the USA has begun to shift from preparing highly qualified teachers, to preparing highly effective teachers. This has resulted in a deprioritizing of university-based aspects of education, with an emphasis instead on practice. Notions of accountability have shifted from inputs to outcomes. As Darling-Hammond (2017) has noted, the knowledge base for teaching and the role of universities in preparing teachers is contested. In Australia and the USA, initiatives such as 'Teach for Australia' and 'Teach for America' involve the recruitment of candidates who enter teaching with just a few weeks of pre-service training. Cochran-Smith et al. (2015) suggest that there has been very little examination of alternatively certified teachers' professional preparation in the USA. They are placed in classrooms largely based on the assumption that their previous knowledge, and tightly compressed preparation, is sufficient to equip them for teaching. At the same time, there is growing evidence that expert knowledge is to a large degree attained through formal, high-quality teacher education during which content knowledge, pedagogical content knowledge, and general pedagogical knowledge is acquired (Adamson and Darling-Hammond 2012;Baumert et al. 2010;Croninger et al. 2007;Depaepe et al. 2013;Kleickmann et al. 2013;Ball et al. 2008;Nye et al. 2004). Results indicate that teacher specialization with respect to subject and grade taught is important for effective teaching. Kunter et al. (2013) found that teacher characteristics that were not specific to the profession, such as for example general academic ability, had no relation to students' mathematics achievement or enjoyment. Instead, domain-specific knowledge had positive effects not only on student achievement, but also students' motivation. Simply being a smart student, they suggest, does not make somebody a good teacher. Further, effects of qualifications appropriate for grade level and subject can vary depending on subject domain and grade level (Wayne and Youngs 2003). For science and mathematics, and particularly at secondary level, a sizeable proportion of empirical results from Europe and the USA support the importance of specialized teachers (Baumert et al. 2010;Goe 2007). Wayne and Youngs (2003) argue that in the case of degrees, findings about the influence of coursework and certification have been inconclusive for all subjects other than mathematics. One reason for this result may be that mathematics is mostly learned in school, and that outcomes more sensitive to instruction than is the case for reading (Nye et al. 2004). While studies on the effects of teachers on reading at primary level are few (see, for example, Snow et al. 2005), evidence from the USA indicates an effect of relevant teacher education on reading achievement in lower grades (Clotfelter et al. 2007;Darling-Hammond 2014;Ferguson 1991;Nye et al. 2004). With data for Sweden, Johansson et al. (2015) estimated substantial effects of teacher education relevant for subject and grade on third graders' reading achievement levels, and found significant effects of teacher specialization on fourth graders reading achievement levels (Myrberg et al. 2018). 5 The role of student perceived instructional quality for obtaining high-quality teaching Much endeavor has been dedicated to development of new models of teacher performance evaluation. "The dynamic model of educational effectiveness" was developed by Creemers and Kyriakides (2008), and has been used to measure teacher performance as perceived by students. Kyriakides et al. (2014) argue that secondary students are able to provide valid data on the classroom behavior of their teachers, and recently, studies have explored the validity of student ratings of instructional quality. The construct validity of ninth grade student ratings of instructional quality was investigated by Wagner et al. (2013). They found that while the structuring of teaching and classroom management could be generalized over classrooms and subjects (English and German languages), student motivation, the clarity of teaching, and the degree of student involvement could not. Similarly, Gaertner and Brunner (2018) investigated stability of student perceptions of instructional quality on class level over subjects, student grade levels, and for specific subjects. Results indicated that student ratings provided measures of teaching constructs that were invariant across time, and for particular subjects, but not by grade levels. It was also suggested that young students may interpret certain item formulations differently than older students.
A few studies have related student ratings of instructional quality to student achievement. In a longitudinal study by Fauth et al. (2014), student ratings of teaching quality were related to science learning among third graders. While classroom management predicted achievement, supportive climate and cognitive activation did not. Here too, the researchers considered student ratings to be useful measures of teaching quality. Panayiotou et al. (2014) investigated relationships between concrete teacher actions in the classroom as reported by students, and student achievement gains in mathematics and science in six European countries. Though student prior achievement had by way the largest explanatory power, teacher behavior contributed with a small but significant part of the variation at student and class level. Results indicated that the student questionnaire was not equally interpreted between countries for all dimensions of teaching quality. The study design did not include data on teacher education.
In a comparative study of Nordic countries, and with large-scale data from TIMSS and PIRLS (mathematics, science, and reading) for fourth graders, Scherer and Gustafsson (2015) found that individual students tended to evaluate the teacher positively in the domains where they had performance strengths, and that student perceptions of how easy the teacher was to understand had significant relations to achievement in reading and maths between classrooms. On the other hand, Blömeke et al. (2016) noted substantial between-country differences in the relationship between student ratings of instructional quality, and fourth grade student mathematics achievement. Similarly, Nortvedt et al. (2016) investigated effects of the quality of teaching as measured by fourth grade students on reading achievement in 34 countries. They too found a largely inconsistent pattern, ranging from significant negative relationships to significant positive relationships. These researchers suggest that the varying sign and strength of the relationship between student assessments of instructional quality and achievement across countries is influenced by response styles and other, as yet unknown factors.

Summary of previous research
A traditional, occupational teacher professionalism, characterized by meritocracy, collegial authority, discretionary decision-making, and a high level of trust, is increasingly contested. In many countries, it is being replaced by an organizational professionalism that emphasizes accountability, target-setting, and performance review. External low-trust accountability is accompanied by a growing interest in reliance on student assessments of teaching quality.
Irrespective of the decreasing attractiveness of the teaching profession stemming from the frequent monitoring of teachers' work, it is important to note that there is little evidence supporting students' abilities to make accurate estimations of teacher quality (and which could potentially be related to student attainment). On the whole, results are inconsistent, and large between-country differences in the relation between student perceptions of instructional quality and student achievement levels are to be found. An overall inconsistency in results means that it is wise to question the reliability of student assessments of teacher effectiveness. Furthermore, due to the large contextual differences between educational systems, country-specific analyses are warranted.
While there is a general agreement on the importance of teachers for student achievement, research on the effects of teacher education is still inconclusive. While some studies indicate that teacher education has little or no impact on teacher effectiveness, others have found it to be positively related to student outcomes. However, the relation between teacher education and student achievement is likely to be subject and grade specific. In particular, there is compelling evidence that teacher preparation is positively associated with student achievement in mathematics, and especially so in upper secondary grades. For teacher effects on reading, the picture remains unclear, though a growing body of research supports the importance of well-educated teachers. It has been suggested that more detailed and precise measures of teachers' education tend to be better predictors.
Few studies have investigated the relation between student perceptions of the quality of teaching, and formal teacher qualifications. Against a backdrop of contradictory opinions on the need for investments in maintaining and developing high-quality teacher education, there is a continuing need to shed light on the effects of teacher education on student achievement. The purpose of this study is therefore to investigate the relationship between teacher specialization, student assessed instructional quality, and student reading achievement. More precisely, the research questions are: 1) To what extent are teacher specialization and student reading achievement related? 2) To what extent are teacher specialization and student assessed instructional quality related? 3) To what extent are student assessed instructional quality and student reading achievement related?

Data and method
The empirical base for the study is Swedish data from the regularly recurring reading achievement study, Progress in International Reading Literacy Study (PIRLS), carried out by the International Association for the Evaluation of Educational Achievement (IEA). PIRLS assesses fourth graders' reading achievement in well over 50 countries.
In the present study, we make use of Swedish data from the 2011 assessment, which contains a number of important add-ons to the international questionnaires. A national extension with additional background questions in the teacher questionnaire provides more detailed information on teachers' education. Students' assessments of the quality of teacher instruction were also expanded in the Swedish design. A representative sample of 4622 Swedish students and 218 teachers participated in the 2011 round of PIRLS.

Variables
The current study uses information from questionnaires from students, parents, and teachers. Student and teacher data was used to explore relationships between indicators of teacher specialization, student ratings of instructional quality, and student achievement. Data from parents served as control variables. In the following sections, the variables used are presented, starting with the teacher variables.

Teachers
As previously mentioned, a national extension in the Swedish PIRLS 2011 teacher questionnaire provided unique data on aspects of teachers' education. Six education variables were considered for the current study: (1) type of teacher education, (2) emphasis on reading pedagogy, (3) preparation in teaching reading comprehension as a part of teacher education, (4) emphasis on Swedish language, (5) number of semesters studying Swedish language, and (6) focus on primary school-years during initial training. These six indicators define the latent variable "TchSpec." Together, they aim to capture specialization towards subject and grade. Cronbach's alpha for the six indicators (standardized) is .72. The variables are presented in the table below, along with descriptive statistics. The variable indicating type of teacher education, "Tch_Ed," is based on an item comprising nine different teacher education programs. Mainly as a consequence of several teacher education reforms during recent decades, the type of teacher education varied substantially in the sample. Teacher education programs vary in the degree of relevance for teaching reading to fourth grade students. We therefore categorized the different education programs and recoded them into a dichotomous variable based on a categorization made in a previous study (Myrberg et al. 2018). The first category (code 1) comprises teachers with an education relevant for both subject and grade, that is, teaching reading in fourth grade. The other category (code 0) comprises all other teachers. For example, mathematics and science teachers, who held an education directed towards teaching in fourth grade, but were not educated for teaching reading, have been assigned the latter category.
The mean values of the indicators express the proportion for the dichotomized variables. As can be seen in Table 1, for a majority of teachers' education was aimed at the primary level. Further, a large proportion of teachers reported that they took courses with an emphasis on teaching methods. Also, many teachers answered that their education had a major emphasis on reading pedagogy and Swedish language. In addition to the teacher education variables, information provided by teachers on the total number of years of teaching experience, "TchExp," was used as a control in the analyses. In general, teachers were highly experienced, with on average over 16 years of teaching. While many students change teacher between the third and the fourth grade in Sweden, it is notable that more than half of the total number of teachers (54%) indicated that they taught the same class for two or more semesters. A total of 218 teachers responded to the questionnaire, whereof 84% were female. The average age was around 45 years and 34% of the teachers were in the age group 40-49. The response rates for the teacher competence items were some 90%.

Students
Six items from the student questionnaire, whereof one item was a national option, were used to operationalize the latent variable student assessed instructional quality (Instr_qual). A 4-point Likert scale ranging from "agree a lot" to "disagree a lot" was used for these items. Cronbach's alpha for the six items was .79. The variables are presented in Table 2.
A measure of student reading achievement was used as an outcome variable ("ReadAch"). The IEA provides five plausible values (PV's) for each individual's reading ability on a continuous scale (Martin et al. 2003). Based on Item Response Theory (IRT), reading achievement results for all students are placed on a common scale, even though they have not taken all the test-items. In IRT methodology, both individual and item attributes are taken into account when modeling a test result, this since the individual's achievement is considered to be a latent trait (e.g., Embretson and Reise 2000). IRT does not assume that the test scores include errors of measurement, only standard errors. As recommended by Rubin (1987), five separate analyses should be carried out-one for each PV. By averaging the results from the five runs, estimates are achieved. Since we performed two-level modeling, standard errors were pooled, taking into account both the between-and within-PV variances.
In order to account for stratification, case weights were used. To facilitate analyses with student and teacher data, IEA have provided weights for each hierarchical level, i.e., student and classroom. Weights at the classroom level are calculated by means of the product between a class weighting factor and a class weighting adjustment, as well as the product between a school weighting factor and a school weighting adjustment. At the student level, weights are a product of the student weighting factor and student weighting adjustment (Foy 2013). In the present study, analyses were carried out using the multiple imputation option in Mplus 8 Muthén 1998-2018). This conveniently generates averaged results.

Parents
Information on students' socioeconomic background (SES) was obtained from the home questionnaire completed by parents or guardians. Five indicators were used. One item indicates parents' estimation of the economic situation of the family in comparison with other families on a five-point scale, ranging from very well-off to not at all well-off. This item was a Swedish national option. Two items indicate the number of books in the household; the total number of books other than children's books on a five-point scale (ranging from 0 to 10 books to more than 200 books), and children's books only (on a five-point scale ranging from 0 to 10 books to more than 100 books). The item indicating mother's occupational status was reported on a fivepoint scale; remunerated work at least full-time, remunerated work part-time only, unremunerated work, other, and not applicable. It was decided not to use fathers' occupational status as the distribution in that variable was uneven and affected by ceiling effects, with 87% indicating working full-time. It should also be noted that the home questionnaire was mostly completed by mothers or female guardians-82 versus 33% for fathers (some had completed the questionnaire together). Finally, an indicator of mother's and father's education was used. The question was stated "What is the highest completed education by the mother and father respectively?" Seven alternatives (1-7), ranging from less than 9 years of compulsory schooling to post graduate education (PhD) were provided. Parents' education, "ParEd," was computed as a mean score of both parents' education. For example, if a student's mother held a PhD, and the father a bachelor degree, the estimated "score" is 6.5. Cronbach's alpha for the five SES indicators was .70.

Further controls
Additionally, student language background was used as a control variable. This item related to how often Swedish was spoken at home. Three alternatives were given: "always or almost always," "sometimes," and "never." This item was coded 1-3, where 3 indicated "always or almost always." As shown in Table 2, most students often spoke Swedish at home. PIRLS is a cross-sectional study that does not offer any pre-test of student achievement. However, parents have estimated their children's knowledge of written language before school start. This makes it possible to at least partly control for prior knowledge levels, and adds to the information on student background. By taking this information into account, the reliability of estimations of teacher effects increases.
To take students' early literacy skills into account, five indicators were retrieved from the parent questionnaire. On a four-point Likert scale (1 = Very well, 2 = Moderately well, 3 = Not very well, 4 = Not at all), parents estimated how well students could (1) recognize letters, (2) read words, (3) read sentences, (4) write letters, and (5) write words at the time for school start. Typically, this would be 3 years before the PIRLS assessment. Cronbach's alpha for this scale was .91 (Table 3).

Analytical approach
The main method of analysis is multilevel Structural Equation Modeling (SEM) with latent variables (e.g., Hox 2002). Compared to single indicator approaches, analyses that use latent variables have advantages in large samples, both theoretically and technically (Kline 2016). Since they are not directly observable, concepts within educational science are often difficult to frame with a single indicator. However, with latent variables, theoretical constructs can be represented in a more comprehensive manner. Moreover, latent variables are in a technical sense free from measurement errors. This is because the variance of different indicators not explained by the latent factor are sorted out. Thus, a latent variable comprises one "true" part (taking the common variance of the indicators into account), and one unexplained part (which is due to measurement error or score unreliability). From a validity point of view, this enables an analysis of better quality than would be the case in, for example, multiple regression analysis (MRA). In MRA it is assumed that all predictors are measured without error, which rarely is the case.
Educational assessment data often has a nested observational structure, e.g., students are clustered in classrooms. This means, for example, that students reading achievement scores within a classroom are not independent. The shared experiences/common influences among students within a classroom (same teachers, same classroom climate, peer-effects) are not repeated in any other classroom, and this dependence needs to be handled statistically. Assuming independence in analyses on hierarchical data would underestimate the standard errors. This in turn would lead to a too frequent rejection of the null hypothesis (Kline 2016). Hierarchically structured data is difficult to analyze; however, beginning in the 1980s, appropriate analytical methods have been developed through extensions of the basic regression model (e.g., Muthén 1989Muthén , 1991. Still, it was first in the early twentieth century that computer programs with built-in capabilities for handling hierarchical data were developed. The analyses were conducted using SPSS version 24, and Mplus version 8 Muthén 1998-2018).

Model evaluation
For several decades, there has been a discussion around how to best assess model fit in SEM. One reason for the long-standing debate is that there are no golden rules or explicit cut-off values that indicate whether or not a model should be rejected or accepted (Bentler 2007;Fan and Sivo 2007;Goffin 2007;Markland 2007;Marsh et al. 2004). A reason for the lack of strict cut-off rules is that these values depend on factors such as the types of factor structures, sample sizes, and the size of factor loadings. In the current study, the χ 2 goodness-of-fit test was used. Considering that the χ 2 is sensitive to sample-size, it was combined with three other fit indices. RMSEA (Root Mean Square Error of Approximation) takes both the number of observations and free parameters into account. A RMSEAvalue of 0.05 indicates a close fit, while a value of 0.08 has been suggested as acceptable (Loehlin 2004). The CFI (Comparative Fit Index) is a fit index that depends on the average size of the correlations in the data. CFI should be as close to 1.0 as possible, and 0.95 is considered as an acceptable value. SRMR (The Standardized Root Mean Square Residual), which is a measure of residual correlations computed separately for within and between levels, was also used. It has been suggested that the value of SRMR should be 0.08 or less for the model to be accepted (Brown 2006;Hu and Bentler 1999). When interpreting these cut-off values, it should be cautioned that these guidelines may not completely apply to multilevel SEM, as they have mainly been studied when SEM has been carried out using data at a single level.

Procedure
In the first step of the analysis, measurement models were formulated. Latent variables for teacher specialization (TchSpec) and student assessed instructional quality (Instr_qual) were formulated. As regards student socioeconomic background (SES) and student early literacy activities (EarlyLit), mean scales were used. This was because latent measurement models did not have an acceptable fit. The measurement models are presented in the "Results" section ( Figs. 1 and 2). The latent teacher specialization variable was fitted at teacher level only. However, the other variables could be formulated at both student and teacher level (classroom means). Due to the many categorical items in the variable TchSpec, the WLSMV (robust weighted least squares) estimator was required. Hence, the model was fitted at one level only. This was because Mplus 8 does not allow WLSMV for two-level modeling. In a next step, and in order to facilitate two-level analyses with the teacher data, factor scores of TchSpec were saved and merged to the student level dataset. Since PIRLS data has a nested observational structure, with teachers being linked to groups of students, we could assign the factor scores of individual teachers to their students. The zero-order correlations for the variables used at the teacher level are presented in the table below (Table 4).
Structural modeling In a next step, we ran structural models according to a stepwise procedure. The first models seek to determine the relationships between TchSpec, Instr_qual, and student achievement. The purpose was to investigate whether the more specialized teachers taught with higher achieving classes, but also to see if students perceived the more relevantly educated teachers as providing better instructional quality. Thereafter, the explanatory variables, teacher experience, language background, SES, and early literacy abilities were introduced in the model one by one. Because the more specialized teachers may have been clustered together with groups of students with more advantageous backgrounds, we used SES and language background as controls. Furthermore, the students' early literacy abilities (Earlylit) was used as a proxy for students' prior achievement. As a next step, and in order to take into account any differential effects, we introduced a number of interactions.
The rationale and procedure for testing interactions A set of cross-level interactions were carried out in order to explore any possible compensatory effect of specialized teachers. In other words, we ran tests see if teachers with a more relevant education had a differentiated impact with respect to student-SES. An interaction between achievement and student assessed instructional quality was also tested to investigate whether high or low student achievement levels had a relation to students' estimations of the quality of instruction. In such an interaction, the slopes and intercepts are assumed to vary across classrooms, and are thus specified as random latent variables at the teacher level. These specifications correspond to cross-level interactions where the regression of achievement on SES at the student level varies as a function of teacher competence at the between-level part of the model. The results from these tests are presented under the heading "Structural models" in the "Results" section.

Results
The first step in the modeling procedure was to fit the two latent measurement models to the data. In Fig. 1, the teacher education variable (TchSpec) is presented with its factor loadings. Six indicators formulated the latent variable TchSpec. A covariance was included between the residuals of "T_pedag" and "Tch_ed," and "T_primary" and "T_semest." The covariance between the error terms was negative, which shows that they share unique variance which is not absorbed by the latent factor, where high values in one of the variables correspond to low values in the other variable. The negative covariance  Fig. 2 Measurement model of student assessed instructional quality. All factor loadings are significant at p < .01 Table 4 Zero-order correlations for variables at between level (teacher level) between the indicators "T_primary" and "T_semest" is reasonable, as teacher education programs for primary teachers are usually shorter than programs directed towards middle school. Nevertheless, since both these variables were considered important for reasons of construct validity, they were kept in the model of TchSpec. The model obtained good fit to the data. Fit statistics for all the presented models are displayed in Table 3. The factor loadings for the latent variable TchSpec are all significant, and are moderate to high. Factor scores were saved for the latent variable and merged onto the student level data set, thus creating a continuous variable used in the further analyses applying two-level modeling.

Variables
In the next step, student assessed instructional quality was modeled using seven indicators from the student questionnaire. The measurement model is presented in Fig. 2.
The measurement model of student assessed instructional quality "Instr_qual" could be formulated at two levels. At teacher level, the classroom averages for the items are estimated. The factor loadings are all significant and fairly high, especially so at the between level. Fit statistics for the latent variables are presented in Table 5.

Structural models
In the next step, we investigated if (1) teacher specialization (TchSpec) was related to student reading achievement levels, if (2) teacher specialization was related to student assessed instructional quality, and if (3) student assessed instructional quality was related to student reading achievement levels. In the first model (model 1), a positive relation between teacher specialization and student achievement of .19 was found. In model 2, student assessed instructional quality was introduced both as a dependent and an independent variable. Unexpectedly, student assessed instructional quality was uncorrelated with teacher specialization. Further, no significant relation was found between instructional quality and student reading achievement. Consequently, students (or classrooms) with higher achievement did not rate their teachers higher. In model 3, teaching experience (T_exp) was introduced as an explanatory variable. While the previously estimated coefficients remained about the same, "T_exp" had a positive relation to Instr_qual of .21. This indicates that the students taught by more experienced teachers perceived them as providing better instructional quality. Notably, with the effect of teacher specialization under control, the relation between experience and student achievement was zero. The correlation between "T_exp" and TchSpec was modest, albeit statistically significant, amounting to .18. In model 4, students' early literacy abilities were accounted for. While "EarlyLit" had a substantial positive relation with student achievement, it did not have any influence on the relation between teacher specialization and achievement. In model 5, SES was taken into account, mainly as a control for potential selection effects (well-educated teachers clustered together with students from more advantageous background). When SES is introduced in models linking measures of teacher competence to student achievement, it can often be anticipated that the SES-effects overshadows any other effects. Interestingly, in this model, the effect of TchSpec on "Read ach" was reduced, but nevertheless remained significant at p < .10. The relation between "T_exp" and Instr_qual did not change when "SES" was accounted for. In model 6, we included "Lang" in the model. However, language background seemed to be confounded with both "EarlyLit" and "SES," since no significant effect could be observed for any of these three variables. Therefore, in order to avoid multicollinearity, and to shed more light on the relationship between "Read ach," "SES," and "Lang," in model 7, "EarlyLit" was deleted from the model. It could be noted that "Lang" did not have an influence which went beyond "SES." In line with previous evidence, "SES" had a strong relation to achievement. Results reported for the teacher level are presented in Table 6. In order to test the robustness of results, we used two modeling approaches. First, we ran ordinary regression analysis using cross-products on an aggregated teacher level data set. However, multicollinearity was observed, and the accuracy of estimates was not deemed trustworthy. Then, we tested a set of interactions in a multilevel model (see, Standardized coefficients and their standard errors (in parentheses). * Coefficients are significant at p < .05 Hox 2002). Here, random slopes were specified for the following relations: Read_ach on Lang; Read_ach on SES; Read_ach on EarlyLit; Read_ach on Instr_qual. Thereafter, the between-level variables TchSpec and T_exp were used to investigate possible interaction effects. However, there was no significant variation in the slopes, and no changes occurred when we related the teacher variables to the slopes. This indicates that relations did not vary between classrooms. For example, the effect of SES on reading achievement was similar across classrooms. It should though be noted that the number of clusters in this study might have been too limited to identify differential effects with the use of the random slope technique. In summary, the main result from the relationships tested in a series of structural models is that teacher specialization was linked to Swedish fourth grade students' reading achievement levels, while student perceptions of instructional quality were not.

Discussion and conclusion
This study explored and contrasted two measures of teacher quality and their respective relations with fourth grade students' reading achievement levels. A Swedish national extension made possible the use of a number of indicators of teacher specialization with respect to preparation for the subject and the grade taught. A positive, significant relation between teacher specialization and student achievement was observed, which is in accordance with both previous results from Sweden (Myrberg 2007;Johansson et al. 2015, Myrberg et al. 2018, and a substantial amount of international research (e.g., Adamson and Darling-Hammond 2012;Kleickmann et al. 2013;Nye et al. 2004).
Teacher specialization is a distal measure of teacher quality. Student perception of instructional quality is however a proximal measure, and one increasingly used to evaluate teacher performance. It has been suggested that young students are well placed to evaluate the qualities of teaching, and particularly so with regard to aspects of classroom management (for example, management of time and disorderly student behavior) (Fauth et al. 2014;Kyriakides et al. 2014;Panayiotou et al. 2014). We used student questionnaire data covering a range of aspects of instructional quality, and related it to student achievement. However, contradictory to some previous research, no association between student perceptions of instructional quality and achievement was detected. It should, though, be noted that student perceived instructional quality as measured in the present study was intended to capture cognitive activation, academic focus, and clarity of instruction, rather than aspects of classroom management and student misbehavior.
Comparative studies have concluded that, as regards the relationship between students' perceptions of instructional quality and achievement, results are generally inconclusive (Blömeke et al. 2016;Nortvedt et al. 2016). In particular, current knowledge on the predictive power of primary school students' perceptions is limited. As pointed out by Nortvedt et al. (2016), a highly inconsistent pattern can be observed between countries in the association between fourth graders' reading achievement and their perceptions of instructional quality, with the causes of between-country differences poorly understood.
The results of the current study support the idea that while teacher specialization can be linked to effective teaching practices affecting student achievement, students might not be able to recognize important aspects of instructional quality influencing achievement levels.

Implications
Our results have implications for practice, policy, and research beyond the Swedish context. As Hallinger et al. (2014) have noted, the broadening consensus on the importance of teaching quality has emerged during an era of expanding educational accountability. It is an era too where a number of negative effects emanating from the frequent measurement and surveillance of teacher classroom behavior, such as for example, stress and decreased job satisfaction, are also recognized (Zeichner 2010).
It has previously been proposed that proximal measures of effective teaching practices, such as student perceptions of instructional quality in classrooms, are preferable, or indeed necessary, in studies of instructional effectiveness (Seidel and Shavelson 2007). Kyriakides et al. (2014) have for example suggested that student questionnaire data can be used to identify individual teachers' professional needs, and in guiding the development of area-specific courses for each teacher and school. In this was data can be used for individual school improvement.
However, the absence of an association between student perceptions of instructional quality and achievement in this study raises questions about the meaningfulness or appropriateness of using (younger) students' accounts of teaching quality to evaluate teacher effectiveness. Instead, we would like to point to the potential value of teacher specialization, a measure more distal to classroom practice. The basis of this argument is twofold. First, a growing body of research demonstrates substantial, positive effects of teacher education in general, and teacher specialization in particular, on student achievement. Second, the close surveillance and constant evaluation that, in many countries, teachers experience, is associated with distrust and de-professionalization. A working environment where student evaluations are used to measure teacher quality enhances the risk for teachers adopting strategies intended to satisfy student opinions, rather than being based on long-term, didactical deliberation (Perryman et al. 2011;Dovemark and Holm 2017). In order not to further damage the attractiveness of the teaching profession, and to not further erode teacher professionalization, distal measures of teacher quality may be preferable if basic reliability criteria are met. If teacher quality can be assured by employment of teachers who are appropriately educated for the job, it would positively affect the possibilities for teachers to formulate and further develop quality standards in collegial collaboration. Development of effective teaching practices could thus be redirected to the teacher community, and the (external) influence of other stakeholders could be restricted. This would probably strengthen professional claims, and likely increase the attractiveness of the teacher profession, as well as the quality of education (Borman and Dowling 2008;Cochran-Smith et al. 2010;Ingersoll et al. 2016). Bottery (2006) highlights a need for education professionals to add their voice in a larger dialogic project. We would like to add that in the complex professional work of teaching, it is not possible or even desirable to make all the theoretical and empirical considerations underpinning actions visible, transparent, and auditable. Expert teachers use a wide variety of approaches, tools, and methods that neither can, nor should be totally transparent or entirely understandable to others. Instead, well-educated and specialized teaching professional should be given the opportunity to develop successful teaching practices collegially, and in forms characterized by autonomy and trust.

Limitations and suggestions for future research
The classroom was the primary level of interest in this study, and two-level analysis was used. This approach takes account of the variation in individual student background characteristics likely to interfere with teacher effects. Ideally, the organization of students within classes within schools should be accounted for by means of three-level analysis. However, the sampling procedure of PIRLS in Sweden does not allow for such analyses, as only some 20 of the schools included in the study participated with more than one classroom. Restriction of analysis to two levels could have affected results. This is because the aggregated knowledge-level of the teachers in a particular school can be anticipated to exert an influence that extends beyond the individual classroom. This having been said, a substantial body of previous research suggests that the variation between classrooms is likely to be far more significant than the variation between schools (Hill and Rowe 1996;Luyten et al. 2005).
When interpreting results, it should be considered that teacher effects are likely to vary across grades and subjects. Luyten (2003) found support for larger teacher effects in primary school than in secondary school. Also, previous research indicates that mathematics seems more sensitive to classroom instruction than reading, probably because reading skills are more likely to also be acquired outside of school (Clotfelter et al. 2007;Goe 2007;Nye et al. 2004). Teacher effectiveness is most certainly subject-and gradespecific; it might also be differentiated with respect to student characteristics. There is still a relatively small number of studies that has investigated teacher effects on achievement. Consequently, further research on this relationship would be of significant value.
The cross-sectional design of PIRLS does not allow for causal inferences to be made, as prior knowledge cannot be accounted for in the estimation of the strength of relationships between independent and dependent variables. Nevertheless, we were able to control for student language and social background, accounting for effects of social selection. In a further attempt to control for student prior knowledge, information provided by parents on students' language skills when formal schooling started was included. A significant portion of variance could thus be sorted out this way, thus strengthening the validity of estimated effects.
Ideally, studies investigating teacher effects would employ rigorous designs. As the resources needed for experimental and longitudinal designs aimed at studying the effects of teacher education are restricted, the possibilities attaching to large-scale international comparative studies should be highlighted. Especially at country level, school systems can to large extent function as their own controls, as many contextual factors within school systems do no change substantially over time. Questionnaires could therefore be further developed to gather more precise and detailed measures of teacher education, its length, and content. We believe that research would benefit from this type of development. It is imperative that increased knowledge is generated about these important features of effective teachers, and the role that teacher specialization has for student achievement.
Funding This research was supported by grants from the Swedish Research Council (grant number 721-2013-2207).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.