Research in Higher Education

, Volume 54, Issue 8, pp 825–850 | Cite as

Experimental Effects of Student Evaluations Coupled with Collaborative Consultation on College Professors’ Instructional Skills

  • Mariska H. Knol
  • Rachna in’t Veld
  • Harrie C. M. Vorst
  • Jan H. van Driel
  • Gideon J. Mellenbergh
Article

Abstract

This experimental study concerned the effects of repeated students’ evaluations of teaching coupled with collaborative consultation on professors’ instructional skills. Twenty-five psychology professors from a Dutch university were randomly assigned to either a control group or an experimental group. During their course, students evaluated them four times immediately after a lecture (class meeting in which lecturing was the teaching format) by completing the Instructional Skills Questionnaire (ISQ). Within 2 or 3 days after each rated lecture, the professors in the experimental group were informed of the ISQ-results and received consultation. Each consultation, three in total, resulted in a plan to improve their teaching for the next lectures. Controls received neither their ISQ-results nor consultation during their course. Multilevel regression analyses showed significant differences in ISQ-ratings in the experimental group compared to the control group, specifically on the instructional dimensions Explication, Comprehension and Activation. In addition, the impact of each of the three consultations plus differences between targeted versus non targeted dimensions were analyzed. This study complements recent non-experimental research on a collaborative consultation approach with experimental results in order to provide evidence-based guidelines for faculty development practices.

Keywords

Faculty development Students’ evaluations of teaching Collaborative consultation University teaching Lecturing Feedback 

Introduction

Interventions employed to improve university teaching include (1) workshops, seminars and programs; (2) consultation; (3) instructional improvement grants; (4) resources such as newsletters, manuals or sourcebooks; and (5) colleagues helping colleagues (Weimer and Lenze 1997). Previous reviews of the effects of these interventions all stress the importance of more research in this field employing more rigorous designs (Levinson-Rose and Menges 1981; Prebble et al. 2004; Stes et al. 2010; Weimer and Lenze 1997). The aim of the present study is therefore to present an experimental study, in which we focus specifically on the effects of individual consultation based on students’ evaluation of teaching (SET consultation). Individual peer or expert consultation is one of the most commonly used interventions in faculty development and support (Knapper and Piccinin 1999; Penny and Coe 2004; Prebble et al. 2004). Based on the available research, Lenze (1996) identified consultation as an instructional development strategy preferable to the other approaches stated above. It has shown to increase considerably the impact of student ratings on teaching practices (Menges and Brinko 1986; Penny and Coe 2004; Weimer and Lenze 1997). On the other hand, the variation in effect size and in actual procedure (implementation) is large (Cohen 1980; Menges and Brinko 1986; Penny and Coe 2004). As SET consultation is a widely used, but relatively expensive intervention, we need to know more about the effects of specific models and procedures in order to guide current faculty development practices in the field.

In terms of consultation models, we investigated a collaborative approach to consultation on the instructional skills of professors at a Dutch university. In terms of consultation procedures, we studied the effects of SET consultation during a course, instead of at the end of the course, to assess the role of timing of the feedback and intervention. Students in this study rated four specific lectures (class meetings in which lecturing was the teaching format) during the course, to improve the specificity, comparability, and quality of the feedback. We studied the specific effects of one intermediate consultation and the additional effects of a second and third intermediate consultation on student ratings. We investigated these issues in a two group experimental design and used multi-level regression analyses to take into account random effects. The professors at the time of their participation had not approached, or were not involved with, a teacher-training center with the aim of improving their teaching. This implies that the effects were investigated among professors in general, rather than professors, who were particularly motivated to change. First we provide an overview of previous research on the effects of SETs and SET consultation and a theoretical framework for a collaborative approach to consultation.

Research on SETs and SET Consultation

At many universities, course evaluations based on students’ evaluations of teaching are common practice. One of the main purposes of collecting SETs is to provide professors with feedback so they can improve their teaching practices. As with all learning processes, feedback is considered one of the most powerful tools to achieve progress (Hattie and Timperley 2007). SETs provide a unique perspective on teaching practices, and have proven to be valid and reliable in many different settings (Marsh 2007b).

Despite the effort that goes into collecting SETs for every course at the end of every term, SETs do not often improve teaching practice (Kember et al. 2002). Considering a period of over 13 years, Marsh and Hocevar (1991 ) and Marsh (2007a) showed no improvement in the teaching effectiveness of 195 faculty members, as judged by their student ratings. This implies that simply collecting student ratings does not automatically help to improve the quality of teaching (see also Richardson 2005).

The lack of a positive effect of course evaluation systems on teaching practices may be understood in the light of the basic rules of effective feedback. Specifically, feedback should be well-timed, specific, reliable, and should concern changeable behavior (McLaughlin and Pfeifer 1988). Evaluations, provided at the end of term, are arguably ill timed, as they do not provide professors with an immediate opportunity to benefit from this feedback. Furthermore, course evaluations often contain mainly unspecific items (e.g., “rate your professor”), which do not provide concrete feedback and serve merely as a general monitor of teaching quality.

Still, well-timed qualitative feedback is insufficient. Multiple studies and meta-analysis have shown more improvement with mid-term feedback compared to end-of-the-term feedback on student ratings, but the effects are small according to both short and long term analyses (Cohen 1980; Lang and Kersting 2007; Menges and Brinko 1986). Besides the quality and timing of the feedback, the fundamental validity issue in student evaluations concerns the interpretation and use of the data (McKeachie 1997). Theall and Franklin (2001) found that faculty often misinterpret, misunderstand, or misuse SETs, and that consequently SETs seldom contribute to actual improvements in teaching. When SETs are augmented with consultation (SET consultation), the effects are considerably larger compared to (mid-term) feedback alone (Cohen 1980; Menges and Brinko 1986; Penny and Coe 2004). According to Penny and Coe’s (2004) meta-analysis based on 11 studies, SET consultation resulted in a weighted mean effect size of 0.69. Menges and Brinko (1986) found an even larger average effect size of 1.1.

However, the variation in effects and procedures is large. The confidence interval in the study by Penny and Coe ranged from 0.43 to 0.95, which suggests considerable variation in the effectiveness of SET consultation. Furthermore, results of Marsh and Roche’s study (1993), who found no effect of mid-term SET consultation, were excluded. The range in effectiveness was also noted by Menges and Brinko (1986). They found effect sizes ranging from 0 to 2.5.

Penny and Coe (2004) attempted to identify the factors that contributed to successful SET consultation. They could not find clear, statistically significant differences amongst consultation approaches, due to the small number of experimental studies available. They concluded: “… the most robust finding may be that more research is needed.” (p. 236). “Considerably more research on the effects of consultative feedback in settings other than North America is sorely needed. The sample for our meta-analysis was too small to provide adequate statistical power to demonstrate clearly the effectiveness of consultation or to identify moderator variables…… Although our review uncovered some strategies that may be important for consultative feedback, there is need for research that directly assesses the effects of these strategies.” (p. 248)

Two additional issues underline the importance of more experimental research in this field. First, the large majority of studies in Penny and Coe’s meta-analysis were published in the period 1975–1986. Exceptions are Marsh and Roche’s study in 1993 and one study by Hampton and Reiser, which was published in 2004. The results of these two studies on the effects of SET consultation are inconsistent. Hampton and Reiser did find effects of mid-term consultation, while Marsh and Roche only found effects of end-of-the-term consultation and not for mid-term consultation. Additional experimental research is required to identify the specific effects of specific procedures, and ultimately to formulate guidelines for faculty development practices in current university settings.

Second, most studies in the 1975–1986 period could not take the multilevel structure of the data into account by means of multilevel regression modeling. The current statistical software now allows us to analyze student ratings data while taking into account variance on different levels. The failure to take into account the multilevel structure often results in incorrect conclusions (Snijders and Bosker 1999; Hox 2002).

In the past decade, a few non-experimental studies on SET consultation practices were published (e.g., Rindermann et al. 2007; Dresel and Rindermann 2011; Piccinin et al. 1999). These studies showed positive results, but varied in procedures. Rindermann et al. (2007) used one mid-term collaborative consult in a private school for speech therapy, and found a medium effect on their total instructor scale with all faculty (16 in total) included in the analyses (without the three best faculty they found a large effect). Piccinin et al. (1999) used data from participants, who had approached a teacher training centre to improve their teaching. Faculty members were assigned to three different interventions based on their needs; SET consultation, SET consultation plus observation, and SET consultation plus observation and student consultation. Results showed positive, but different, patterns of increase in ratings over time in each group. Dresel and Rindermann (2011) provided 12 German faculty members (teaching 98 courses over a period of 2 years) with SET consultation in the first year, and found moderate to large effects. In this study, however the intervention lasted a full day. The study demonstrated the generalizability of SET consultation effects to other courses. Using multilevel analyses they controlled for potential bias and unfairness variables on the professor, course and student level.

The results of these recent non-experimental studies are important, because they have good external validity, they address biasing variables and effects in non-English speaking countries for the first time. Dresel and Rindermann (2011) also illustrated the importance of using appropriate multilevel procedures. Whether non-experimental results are due to the intervention remains an open question. Non-experimental designs suffer from potential selection bias and are often open to alternative explanations of the results (like the Hawthorne effect). Dresel and Rindermann (2011) underline the difficulty of conducting research, which is both internally and externally valid. Therefore, we stress that different studies with different characteristics need to be conducted. With this study we aim to augment recent non-experimental studies with up-to-date experimental results. In addition, we aim to provide more knowledge on the effects of amount of consultation.

Theoretical Framework for Collaborative Consultation

Penny and Coe (2004) defined instructional consultation as a structured, collaborative, problem-solving process that uses information about teaching performance as a basis for discussion about improving teaching practice. They concluded from the literature that consultation for teaching improvement should be voluntary, individualized, confidential, reflective, and carried out for formative purposes, not for summative evaluation.

Penny and Coe categorized experimental studies by the approach to consultation used. They distinguished a diagnostic (N = 2), an advisory (N = 6) and an educational (N = 3) approach. They found a medium effect size for the first approach and larger effect sizes for the last two, which involved more extensive interventions. Specifically, these more extensive interventions included at least one other source of information on teaching behavior, e.g., observation or videotaping, and/or additional educational activities, such as seminars and workshops.

When it comes to approaches to consultation, the two most common models are the prescriptive model and the collaborative model (Brinko 1990). In the prescriptive model, the consultant identifies, diagnoses, and solves problems. In a collaborative approach the consultant serves as a facilitator, by encouraging faculty to reflect on the current situation, teaching effectiveness, and possible alternative teaching strategies to achieve his or her goals. The diagnostic approach, as described by Penny and Coe, is a prescriptive approach.

Psychological theories on behavior and behavioral change help identify conditions for effective consultation. Here we focus specifically on aspects of Eagly and Chaiken’s attitude-behavior theory (1998), which includes the theory of reasoned action (Ajzen and Fishbein 1980), theory of planned behavior (Ajzen 1991), and theories on self-efficacy (Bandura 1977, 1997). Empirical support for these theories is reported in several studies (Madden et al. 1992; Sheppard et al. 1988; Van den Putte 1993). Figure 1 depicts a combination of these theories, including the most relevant variables for our present purposes.
Fig. 1

Combination of attitude-behavior models, conditions for behavioral change

We consider these theories as they provide us with a frame of reference on consultation from a teacher-centered perspective. According to these theories, behavior starts with an intention. Intentions relating to (new) behavior are based on the professor’s personal attitude (either positive or negative), and the professor’s self-efficacy concerning this (new) behavior. A positive or negative attitude depends on the expected outcomes of this behavior, and on the value of these expected outcomes. Finally, these values depend on personal goals (Eagly and Chaiken 1998).

By considering the attitude-behavioral theories in the teaching context, we identified the conditions that should be met in the consultation process to facilitate effective change in teaching practice based on SETs. According to these theories, planned improvement in teaching is more likely to be successful if the professor’s expected outcomes are consistent with the professor’s values and instructional goals, and if the professor feels sufficiently confident to achieve the desired outcomes. For example, given a low rating on ‘engaging students during the lecture’, a professor will choose to interact with the students if he or she believes that this will indeed result in a more active participation of the students, and if the professor values this outcome in the process of teaching. In addition, the professor should feel confident enough about actually engaging his or her students in this fashion. On the other hand, a professor may view interaction with students during the lecture as a diversion that serves no purpose given his or her teaching objectives. Based on these theories, we postulate that consultation is effective if
  1. (a)

    the professor’s current behavior (according to student ratings) is explored,

     
  2. (b)

    the professor’s values and goals are explored,

     
  3. (c)

    discrepancies between the current effects of the professor’s behavior and his or her instructional values and goals form the starting point for discussion and plans for improvement,

     
  4. (d)

    plans for behavioral change are considered important by the professor, and

     
  5. (e)

    the professor feels confident about executing the plans for improvement.

     

This implies that, in order to achieve lasting behavioral change, the professor should accept the outcome of each step of SET consultation, i.e., (1) the interpretation of the student ratings, (2) the selection of specific ratings to improve, (3) the diagnosis and analysis on the cause of selected low ratings, (4) possible strategies for improvement, and (5) the selection of final strategies. In the present study, we used a teacher-centered consultation protocol, in which each step of the consultation started with the professor’s opinion and ended with the professor’s conclusion. The protocol is consistent with a collaborative approach to consultation.

In summary this study addresses the following research questions:
  1. (1)

    What are the effects of a collaborative approach to SET consultation, provided during a course, on seven specific dimensions of college professors’ lecturing skills?

     
  2. (2)

    What are the effects of one consultation, and the additional effects of a second and third consultation during a course on college professors’ lecturing skills?

     
  3. (3)

    If effects occur, is there a difference in effect between dimensions which are targeted for improvement during consultation and dimensions which are not targeted, which would imply that the effects can be related to either the feedback or the consultation?

     

Context Information

The current experiment was conducted at a Dutch University. In the Netherlands, bachelor programs generally take 3 years and are focused on a specific field of study from day one (no general college courses are taught). This study concerned the third (final) year of the bachelor program of psychology. At this university, students select a specific field of interest within psychology (e.g., social psychology, clinical psychology, etc.) in their third year. Courses take 8 weeks (varying in workload) and students attend six to eight courses per semester. Regular SETs are anonymous and collected at the end of each course (most often during the exam). The results are used as feedback for professors, coordinators, management and quality control committees. In order to get tenure, it is mandatory to attain a teaching certificate. Obtaining positive SETs is one of the criteria for attaining such a certificate.

Method

Participants

Professors

At the department of psychology of a Dutch university, we selected 27 8-week courses from the second semester of the third and final bachelor year (the equivalent of the senior year). Each course included one weekly lecture (class meeting in which lecturing was the teaching format), with a minimum of four lectures given by the same professor (27 professors in total). A standard lecture at this university takes 90 min with a 15-min break after 45 min. Therefore course level and teaching format were comparable. The selected courses were all designed for psychology majors. Of the 27 professors, 25 agreed to participate (14 males, 11 females), and were randomly assigned to either a control group or an experimental group. All participating professors had a PhD, and were full-time ranked assistant, associate, or full professors.

Students

The students in the participating courses completed the Instructional Skills Questionnaire (ISQ), the instrument used to rate the professors, on four measurement occasions (see below). In total the ISQ was completed 1,954 times, with 1,225 forms containing a student ID number. There were 604 unique professor–student ID combinations. The 729 forms with a missing student ID number were given a unique ID number, which resulted in a total of 1,333 unique professor–student combinations. Some students attended more than one course and therefore rated both professors in the control and experimental group (N students attaining one course = 342, two courses = 90, three courses = 23, four courses = 2, and five courses = 1, according to the available student ID numbers). Since students did not know that the professors were participating in an experiment, this was not expected to be of any influence.

The mean class size was 19.7 students (median = 14, SD = 13.08, min = 6, max = 62). The mean class sizes in the control group and experimental group were 13.8 (median = 11.5, SD = 7.5, min = 6, max = 42) and 25.7 (median = 25, SD = 14.8, min = 8, max = 62), respectively.

Consultants

The first and second author served as consultants to assure full insight in the practical and procedural aspects of the feedback and consultation protocol. The collaborative consultation approach, as defined by Brinko (1990), was adapted for this study. According to Brinko, collaborative consultants function as partners: they encourage their clients to identify, diagnose, and provide solutions to the issues they raise. The consultants were trained in coaching- and social skills, including creating a safe learning environment, structured consultation, encouraging reflection, and formulation of concrete plans.

Design

The experiment was a randomized two-group design with four repeated measures. Twenty-five professors were randomly assigned to either the control group (N = 13) or the experimental group (N = 12). In both groups, students evaluated four 90-min lectures of their professors during the course with the ISQ, at four measurement occasions. The ISQ measures seven dimensions of the professors’ lecturing skills, specifically Structure, Explication, Stimulation, Validation, Instruction, Comprehension, and Activation. Professors in the experimental group received consultation based on their ISQ ratings within 2 or 3 days after each rated lecture. In total, three consultation sessions took place between the ratings. Professors in the control group received all ISQ ratings at the end of the course.

Dependent Variables

Feedback should be specific, multidimensional, reliable, and concern changeable behavior. In terms of specificity, we used the ISQ, which contained items that covered the seven dimensions relating to the professor’s lecturing behavior. These are based on previous practice and research on student evaluation. Marsh and colleagues developed a well studied course evaluation instrument (SEEQ), containing nine dimensions; Instructors Enthusiasm, Organization/Clarity, Group Interaction, Individual Rapport, Breadth of Coverage, Learning/Value, Examination/Grading, Assignments/Readings, Workload/Difficulty (Marsh 1984, 1987; Marsh and Hocevar 1991). The first five dimensions mentioned concern the professor’s teaching behavior. De Neve and Janssen (1982) developed a questionnaire for evaluation of lectures (Evalec) containing five specific dimensions: Validating, Stimulating, Interacting, Directing, and Structuring behavior. Based on the literature and on these two instruments, Vorst and Van Engelenburg (1994) developed a course evaluation instrument (UvAlon) for a Dutch university containing the general dimensions (Learning/Value, Entry Level, Time Invested/Workload, Difficulty, Literature, and Examination) and seven specific dimensions for teaching behavior (Structure, Explication, Stimulation, Validation, Instruction, Conversation and Interaction). The psychometric quality of this instrument was investigated and confirmed in different studies (Vorst and Van Engelenburg 1994; SCO Kohnstamn Institute 2002, 2005). We adapted this instrument to a one-lecture instrument with a selection of twenty-eight specific questions on instructional behavior. This resulted in the ISQ, which comprises 28 items, and measures the seven dimensions identified by Vorst and Van Engelenburg. We choose the label Comprehension instead of Conversation and Activation instead of Interaction to more accurately convey their meaning. Each of the seven dimensions is measured with four items, two positive (indicative) and two negative (contra-indicative) worded items. The response format of the ISQ items is a 5-point Likert-scale. In more detail, the seven ISQ dimensions are:
  1. (1)

    Structure the extent to which the subject matter is handled systematically and in an orderly way. Example item the subject matter is discussed in a logical order.

     
  2. (2)

    Explication the extent to which the professor explains the subject matter, especially the more complex topics. Example item the professor explains the subject matter clearly.

     
  3. (3)

    Stimulation the extent to which the professor interests students for the subject matter. Example item the lecture is boring (contra-indicative).

     
  4. (4)

    Validation the extent to which the professor stresses the benefits and the relevance of the subject matter for educational goals or future occupation. Example item the teacher explains the importance of the topics that are discussed.

     
  5. (5)

    Instruction the extent to which the professor provides instructions about how to study the subject matter. Example item the professor is unclear about which aspects of the subject matter are important (contra-indicative).

     
  6. (6)

    Comprehension the extent to which the professor creates opportunities for questions and remarks regarding the subject matter. Example item the professor encourages students to ask questions about the subject matter.

     
  7. (7)

    Activation the extent to which the professor encourages students to think about and work with the subject matter. Example item during this lecture there is hardly any occasion to discuss the subject matter (contra-indicative).

     

The dimension score is the student’s mean of the four specific dimension items. The total mean score is used as an estimate of Total Instructional Skills.

Missing item responses (0.3 %) were imputed with the student’s mean of the other three items of that specific dimension. Correlations between the seven domains range from 0.10 to 0.66, with a median of 0.30. Cronbach’s alpha on all professors mean scores on the first measurement occasion range from 0.66 to 0.95 with a mean of 0.82.

Procedure

All psychology professors at this Dutch university, who taught an undergraduate third year course, designed for psychology majors, in the second semester with at least four lectures (i.e., class meetings in which lecturing was the teaching format), were invited to voluntarily participate in the study. Professors first received an email with project information, and a request to meet with one of the researchers. At the subsequent meeting, the researcher then explained the procedure. Professors, who agreed to participate, were randomly assigned to either the control group or experimental group. There was no open access to previous SETs of the professors. Therefore, the director of the undergraduate school of the psychology department confirmed that the distribution of teaching quality was comparable in the two groups based on previous SETs of the professors. A multilevel t test on baseline mean ratings on Total Instructional Skills, measured by the ISQ on the first evaluation occasion, confirmed no differences in teaching quality between professors in the control group and experimental group. Then, prior to their course, all professors received procedural instructions by email.

During the first lecture of each course, one of the researchers invited the students to participate in a research project by completing the ISQ four times during the course (at the end of the first, third, fifth and seventh lecture). Students were instructed to focus on the current lecture, while completing the questionnaire. They were asked to provide their student ID number. The researcher guaranteed that professors would only receive anonymous ratings. In addition, students were assured of anonymity in their evaluations with a similar statement on the ISQ form. To further ensure anonymity, not the professor but one of the students in each course collected the completed ISQ forms on each evaluation occasion and brought them to the researchers. The students did not know that their professors were participating in a randomized experiment involving SET consultation.

Control Group

Professors in the control group received the ISQ ratings pertaining to all four measurement occasions at the end of their course. The procedure followed with the students was the same as for the experimental group.

Experimental Group

In the experimental group (i.e., SET consultation group), each professor met with a consultant after every rated lecture, for a total of three consultations, to discuss the ISQ ratings. The professor and the consultant also met prior to the study (the introductory meeting) and after the final lecture for a final evaluation.

Introductory Meeting. The introduction allowed the consultant and professor to get acquainted. The consultant explained the project, and the procedure of feedback and consultation, and topics, such as the collaborative consultation approach, additional responsibilities of the professor, and conditions that were deemed necessary for effective feedback (e.g., commitment in terms of time and motivation to achieve improvement).

Consultation. The consultation protocol was based on the collaborative approach. The consultant was responsible for the consultation process, by following each step of the protocol. The consultant’s role was to facilitate behavioral change. The professor had ultimate control over the content of the consultation, i.e., the professor determined which items of the student ratings questionnaire were addressed, the formulation of areas of improvement, and action plan. Still, consultants were free to be directive at any stage of the consultation, e.g., by providing alternative interpretations of student ratings, alternative views when exploring problems in teaching effectiveness, and alternative strategies for improvement. Nevertheless, according to the theory of planned behavior (Ajzen 1991), it is important that the professors recognize and identify with the newly formulated views on teaching and plans for improvement. Therefore we set out to organize the professor–consultant–professor approach such that every step of the consultation started and ended with the professor’s opinion and conclusions.

Consultation 1. The consultations involved a five-step procedure. The steps are (1) the evaluation of the previous lecture, (2) the evaluation of the student ratings, (3) the selection of items of the ISQ to improve, (4) the analysis of the current situation and problems that explain the selected ratings, and (5) the formulation of strategies for improvement. The consultation started with discussing how the professor experienced the lecture. Then the consultant explained the different ISQ dimensions. The consultant provided the professor with a profile based on the mean ratings on every dimension, and on the specific item scores. The consultant assisted with the interpretation of the results. The professor then undertook to link the results to his or her own experience and goals during the class. Results that were in any way surprising, unexpected, or unsatisfactory to the professor were discussed. The professor then selected the questionnaire items that he or she identified as being open to improvement. The consultant encouraged further reflection on the selected questionnaire items, and discussed (new) set goals, line of thought, possible internal conflicts, and practical difficulties. Once the desired goals, and current problems in achieving these goals were explicated, the consultant encouraged the professor to think about possible plans for improvement. If necessary, the consultant also provided suggestions. Eventually the professor decided on the final concrete plan of action. Finally, the consultant asked whether the professor had enough time to prepare, and whether he or she considered the plans to be sufficiently realistic, feasible, and relevant to pursue.

Consultations 2 and 3. Consultations 2 and 3 followed the same procedure as consultation 1, except they started out with discussing previous plans. At the beginning of consultation 2 and 3, the professor reported on his or her experiences in implementing the previously made plans. The consultant encouraged the professor to reflect on reasons for success or failure.

Final session. In the final session, the professor and consultant again discussed the previous lecture and the results of the final student ratings. They finished the consultation with an evaluation of the program and plans for the future.

Statistical Analyses

The data analysis required multilevel regression modeling, because measurement occasions are nested within students, and students are nested within professors (Snijders and Bosker 1999). Also, we wanted to investigate differences in ratings between professors whilst taking variation between students into account. Finally, individual professors and students might vary in ratings at the first measurement occasion (intercept variance) and in their increase or decrease of ratings over time (slope variance). With multilevel regression analyses we are able to accommodate random intercept and slope variation while analyzing the fixed (mean) effects of the intervention.

The effects of SET consultation were analyzed on each of the seven dimensions (Structure, Explication, Stimulation, Validation, Instruction, Comprehension, Activation) and on the Total Instructional Skills score. We analyzed each of these dependent variables, by fitting multilevel models to the data with Time as level 1 variable (t), Students as level 2 variable (i) and Professors as level 3 variable (j). With the first model, we analyzed the intra-class correlation for the professor level and the student level, the proportion of the total variance that is due to differences between professors and due to differences between students. The second and third models were used to analyze whether the slope for the professor and student level was indeed random. If this was the case, we needed to take this random slope variance into account in analyzing the effects of the intervention by adding it to the next model. The fourth model was used to analyze the effect of SET consultation. With the fifth model we analyzed the specific effect of the first consultation and the additional effects of the second and third consultation on the dependent variable Total Instructional Skills. If the fifth model showed an effect of SET consultation on a specific time interval (T1T2, T2T3 and/or T3T4), the sixth model was used to analyze specific effects of targeted interventions versus non-targeted interventions on this specific time interval for each dimension.

Intra-Class Correlation

The first model, the intercept-only model, contained a random intercept for professors and students (Model 1). Model 1 is defined through the equations:
$$ {\text{Level 1}}:Y_{tij} = \beta_{0ij} + e_{tij} , $$
(1.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(1.2)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} . $$
(1.3)
Here the student rating on dimension Ytij on occasion t of student i in the class of professor j is modeled by the intercept β0ij and a residual error term etij. In this model the intercept varied between students and between professors. Thus, in the second and third level Eqs. (1.2 and 1.3) the intercept β0ij is decomposed by a residual error term for students u0ij (random intercept on student level), a residual error term for professors v00j (random intercept on professor level) and a fixed effect parameter γ000 (the overall mean). The variances of the three residual error terms are denoted by
$$ {\text{var}}\left( {e_{tij} } \right) = \sigma^{ 2} ,{\text{ var }}\left( {u_{0ij} } \right) = \tau_{0}{^2} ,{\text{ var}}\left( {v_{00j} } \right) = \varphi_{0}{^2} . $$
(1.4)

This model was used to calculate the intra-class correlation for the professor level by dividing the variance of the professor level (φ02) by the total variance (σ2 + τ02 + φ02). For the proportion of the total variance that is due to differences between students we divided the variance of the student level (τ02) by the total variance (σ2 + τ02 + φ02).

Random and Fixed Effects

In the second model the linear main effects of Time (occasion 1, 2, 3 and 4, coded as 0, 1, 2 and 3) and the main effect of Condition (control group = 0 and experimental group = 1) were added (Model 2). Model 2 is defined by the equations:
$$ {\text{Level 1 }}:Y_{tij} = \beta_{0ij} + \beta_{ 1} {\text{Time}}_{ij} + \beta_{ 2} {\text{Condition}}_{j} + e_{tij} , $$
(2.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(2.2)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} . $$
(2.3)

Here β0ij is the intercept, β1 is the fixed effect parameter for Time, β2 is the fixed effect parameter for Condition and etij is the residual error term. Again the intercept β0ij is allowed to be random over professors and students by the decomposition into one fixed component (γ000) and two random components (u0ij and v00j) in de second and third level Eqs. (2.2 and 2.3).

In the third model, we allowed the slope of professors and students to be random (Model 3). In both groups, some professors may display systematic variation over time on the ratings. Similarly, students within classes may display systematic variation in their ratings over time. Model 3 accommodated this possible variation. It is important to establish whether these random effects are significant, because their presence should be taken into account in studying the effects of the intervention. By comparing models 2 and 3 with a deviance test,1 we evaluated whether it was necessary to retain a random slope. Model 3 (with random slope variances) is defined through the equations:
$$ {\text{Level 1}}:Y_{tij} = \beta_{0ij} + \beta_{ 1ij} {\text{Time}}_{ij} + \beta_{ 2} {\text{Condition}}_{j} + e_{tij} , $$
(3.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(3.2)
$$ {\text{Level 2 }}:\beta_{ 1ij} = \beta_{ 10j} + u_{ 1ij} , $$
(3.3)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} , $$
(3.4)
$$ {\text{Level 3 }}:\beta_{ 10j} = \gamma_{ 100} + v_{ 10j} . $$
(3.5)

Here β0ij is the intercept, β1 is the random effect parameter for Time, β2 is the fixed effect parameter for Condition and etij is the residual error term. Again the intercept β0ij is allowed to be random over students and professors by including the random components u0ij and v00j. In addition, the regression parameter β1ij for Time is allowed to be random over students and professors by including the random effects u1ij and v10j. The fixed component γ100 represents the overall average regression coefficient for Time (mean slope).

The slope variances are denoted by
$$ {\text{var }}\left( {u_{ 1ij} } \right) = \tau_{1}{^ 2} ,{\text{ var}}\left( {v_{ 10j} } \right) = \varphi_{ 1}{^ 2} $$
(3.6)
The intercept-slope covariances are denoted by
$$ {\text{cov }}\left( {u_{0ij,} u_{ 1ij} } \right) = \tau_{0 1} {\text{cov, }}\left( {v_{00j,} v_{ 10j} } \right) = \varphi_{0 1} $$
(3.7)

Effects of SET Consultation

With the fourth model we analyzed the effect of the intervention by adding the interaction effect Time * Condition (Model 4). Model 4 is defined through the equations:
$$ {\text{Level 1 }}:Y_{tij} = \beta_{0ij} + \beta_{ 1ij} {\text{Time}}_{ij} + \beta_{ 2} {\text{Condition}}_{j} + \beta_{ 3} {\text{Time}}*{\text{Condition}}_{ij} + e_{tij} , $$
(4.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(4.2)
$$ {\text{Level 2 }}:\beta_{ 1ij} = \beta_{ 10j} + u_{1ij} , $$
(4.3)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} , $$
(4.4)
$$ {\text{Level 3 }}:\beta_{ 10j} = \gamma_{ 100} + {\text{v}}_{ 10j} . $$
(4.5)

The parameters are the same as in Model 3. The additional fixed effect parameter for Time * Condition (β3) represents the effect of SET consultation: if Model 4 fits the data better than the previous model according to a deviance test and the parameter of Time * Condition is significant, the control group and experimental group differ significantly in their ratings over time.

Effects on Each Time-Interval and on Targeted Versus Non-Targeted Dimensions

With the fifth model we specifically analyzed the effect of the first consultation and the additional effects of the second and third consultation on the dependent variable Total Instructional Skills (Model 5). We recoded Time into the dummy variables T1T2, T2T3, and T3T4, representing the comparison of time 1 with 2, 2 with 3 and 3 with 4, respectively. We did not have enough data to fit a model with these additional parameters plus the parameters for all possible random effects. We therefore limited the random effects to the intercept in this model. Model 5 is defined by the equations:
$$ \begin{aligned} {\text{Level 1 }}:Y_{\text{tij}} = \,& \beta_{0ij} + \beta_{ 1} {\text{Condition}}_{j} + \beta_{ 2} {\text{T1T2}}_{ij} + \beta_{ 3} {\text{T2T3}}_{ij} + \beta_{ 4} {\text{T3T4}}_{ij} \\ & + \beta_{ 5} {\text{T1T2 }}*{\text{ Condition}}_{ij} + \beta_{ 6} {\text{T2T3 }}*{\text{ Condition}}_{ij} + \beta_{ 7} {\text{T3T4 }}*{\text{ Condition}}_{ij} + e_{tij} , \\ \end{aligned} $$
(5.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(5.2)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} . $$
(5.3)

The parameter β0ij represents the intercept. The intercept is random over professors and students. The fixed effect parameter β1 represents the main effect of Condition. The fixed effect parameters β2, β3 and β4 for T1T2, T2T3 and T3T4 represent the contrasts. The fixed interaction effect parameter for T1T2 * Conditionij (β5) represents the effect of the first SET consultation. The fixed interaction effect parameters for T2T3 * Conditionij (β6) and T3T4 * Conditionij (β7) represent the additional effects of the second and third SET consultation.

If there was an effect of SET consultation on a specific time interval, we specifically analyzed the effect of targeted dimensions versus non-targeted dimensions on each dependent dimension on that time interval (Model 6). These additional exploratory analyses were done to link the effects of the intervention either to the feedback or the consultation. In Model 6, professors in the experimental group were separated into two groups for each dimension on the specific time interval; a group which targeted the dimension for improvement (Target) and a group that did not target the dimension (No Target). Condition was therefore recoded into the dummy variables Control-versus-No Target and Control-versus-Target. We restricted the analyses to the time intervals associated with an effect in Model 5 to limit the number of tests on the data. In addition, to prevent a Type I error, these effects were tested with a more conservative alpha of 0.01. Time was recoded for the specific time interval (in case of time interval T1T2; T1 = 0 and T2 = 1, in case of time interval T2T3; T2 = 0 and T3 = 1, in case of time interval T2T3; T2 = 0 and T3 = 1). Again, we limited the random effects to the intercept in this model. Model 6 is defined by the equations:
$$ \begin{aligned} {\text{Level 1 }}:Y_{tij} = \,& \beta_{0ij} + \beta_{ 1} {\text{Time}}_{ij} + \beta_{ 2} {\text{Control-}} {\text{versus-}}{\text{NoTarget}}_{j} + \beta_{ 3} {\text{Control-}}{\text{versus-}}{\text{Target}}_{j} \\ & + \beta_{ 4} {\text{Time}}*{\text{Control-}}{\text{versus-}}{\text{NoTarget}}_{ij} \\ + \beta_{ 5} {\text{Time}}*{\text{Control-}}{\text{versus-}}{\text{Target}}_{ij} + e_{tij} , \\ \end{aligned} $$
(6.1)
$$ {\text{Level 2 }}:\beta_{0ij} = \beta_{00j} + u_{0ij} , $$
(6.2)
$$ {\text{Level 3 }}:\beta_{00j} = \gamma_{000} + v_{00j} . $$
(6.3)

The parameter β0ij represents the intercept. The intercept is random over professors and students. The fixed effect parameter β1 represents the main effect of Time on the specific time interval. The fixed effect parameters β2 and β3 represent the contrasts Control-versus-NoTarget and Control-versus-Target. The fixed interaction effect parameter for Time * Control-versus-NoTargetij (β4) and Time * Control-versus-Targetij (β5) represent the effects of SET consultation for non-targeted dimensions and targeted dimensions compared to the control group.

Results

Table 1 shows the number of participating professors, the number of ISQ forms completed, mean ISQ scores, the standard deviation of the professors and the standard deviation of the students within classes in the two groups on each measurement occasion. Mean ratings and standard deviations of the professors on each dimension are shown in Fig. 2a–h. At baseline (occasion 1), multilevel t test revealed that there were no significant mean differences between the groups on each dimension. The experimental and control group were therefore comparable on teaching skills at baseline. The control and experimental groups did not differ with respect to the inevitable student drop out (χ2[3] = 5.834, p = 0.12).
Table 1

Number of participants, mean ISQ scores and standard deviations on professor level and student level on measurement occasion 1 to 4 for the control condition and experimental condition

 

Measurement occasion 1

Measurement occasion 2

Professor level (between classes)

Student level (within classes)

Professor level (between classes)

Student level (within classes)

N

M

SD

N

SD

SD

SD

N

M

SD

N

SD

SD

SD

    

Min

Median

Max

    

Min

Median

Max

Control group

13

  

231

   

13

  

186

   

Structure

 

3.65

0.23

 

0.24

0.48

0.91

 

3.65

0.35

 

0.20

0.45

0.74

Explication

 

3.77

0.17

 

0.12

0.47

0.77

 

3.69

0.12

 

0.26

0.45

0.77

Stimulation

 

3.81

0.44

 

0.30

0.51

0.99

 

3.66

0.49

 

0.44

0.58

0.95

Validation

 

3.61

0.25

 

0.24

0.48

0.72

 

3.56

0.23

 

0.17

0.51

0.72

Instruction

 

3.50

0.21

 

0.24

0.45

0.66

 

3.41

0.17

 

0.35

0.56

0.69

Comprehension

 

3.93

0.23

 

0.29

0.42

0.59

 

3.85

0.28

 

0.25

0.50

0.63

Activation

 

3.77

0.43

 

0.40

0.56

0.74

 

3.68

0.49

 

0.27

0.56

0.79

Total Instructional Skills

 

3.72

0.20

 

0.19

0.31

0.47

 

3.64

0.18

 

0.21

0.34

0.55

Experimental group

12

  

378

   

12

  

352

   

Structure

 

3.77

0.25

 

0.30

0.52

0.61

 

3.87

0.17

 

0.36

0.45

0.63

Explication

 

3.84

0.17

 

0.36

0.49

0.64

 

3.83

0.21

 

0.39

0.47

0.71

Stimulation

 

3.69

0.31

 

0.32

0.61

0.76

 

3.62

0.29

 

0.22

0.63

0.78

Validation

 

3.65

0.27

 

0.29

0.48

0.66

 

3.61

0.22

 

0.39

0.55

0.82

Instruction

 

3.43

0.19

 

0.33

0.53

0.68

 

3.53

0.19

 

0.35

0.54

0.71

Comprehension

 

3.81

0.50

 

0.31

0.48

0.71

 

3.91

0.39

 

0.35

0.51

0.67

Activation

 

3.37

0.64

 

0.40

0.58

0.70

 

3.51

0.65

 

0.39

0.56

0.75

Total Instructional Skills

 

3.65

0.22

 

0.21

0.36

0.46

 

3.70

0.24

 

0.27

0.38

0.50

 

Measurement occasion 3

Measurement occasion 4

Control group

13

  

157

   

13

  

145

   

Structure

 

3.56

0.27

 

0.22

0.58

0.69

 

3.58

0.41

 

0.25

0.57

0.74

Explication

 

3.65

0.20

 

0.28

0.56

0.97

 

3.58

0.16

 

0.26

0.56

0.95

Stimulation

 

3.63

0.43

 

0.50

0.58

1.02

 

3.59

0.37

 

0.36

0.59

1.06

Validation

 

3.64

0.33

 

0.26

0.46

0.93

 

3.51

0.30

 

0.21

0.62

0.86

Instruction

 

3.39

0.26

 

0.27

0.52

0.70

 

3.48

0.22

 

0.23

0.54

0.84

Comprehension

 

3.82

0.28

 

0.24

0.51

0.77

 

3.84

0.19

 

0.31

0.53

0.78

Activation

 

3.66

0.49

 

0.32

0.59

0.79

 

3.73

0.49

 

0.28

0.56

0.80

Total Instructional Skills

 

3.61

0.20

 

0.20

0.42

0.59

 

3.60

0.14

 

0.20

0.38

0.74

Experimental group

11

  

225

   

12

  

280

   

Structure

 

3.89

0.25

 

0.35

0.45

0.59

 

3.83

0.28

 

0.38

0.50

0.64

Explication

 

3.80

0.23

 

0.37

0.51

0.90

 

3.85

0.18

 

0.37

0.49

0.65

Stimulation

 

3.59

0.34

 

0.34

0.52

0.79

 

3.64

0.43

 

0.39

0.59

0.74

Validation

 

3.62

0.16

 

0.41

0.51

0.83

 

3.69

0.20

 

0.38

0.50

0.62

Instruction

 

3.56

0.28

 

0.27

0.52

0.75

 

3.63

0.28

 

0.38

0.52

0.89

Comprehension

 

3.96

0.39

 

0.36

0.53

0.79

 

3.95

0.41

 

0.36

0.48

0.72

Activation

 

3.64

0.59

 

0.28

0.51

0.86

 

3.60

0.64

 

0.41

0.58

0.79

Total Instructional Skills

 

3.71

0.22

 

0.23

0.36

0.54

 

3.74

0.25

 

0.25

0.38

0.47

Overall ISQ ratings are indicated in bold (Dimension: Total Instructional Skills)

Fig. 2

ah Mean ratings on the eight dependent variables in the experimental condition (N = 12) and control condition (N = 13) on the four measurement occasions

Based on the intercept-only multilevel regression model (Model 1), the intra-class correlation for the professor level for all dependent variables varied between 0.09 and 0.42, with a mean of 0.21. The intra-class correlation for the student level varied between 0.26 and 0.51, with a mean of 0.38. Since a mean of 21 % of the variance is due to differences between professors and a mean of 38 % of the variance is due to differences between students within professors, the use of multilevel regression modeling is indicated.

We performed deviance tests between the first four models on all eight dependent variables, to determine which model fitted the data best. Table 2 shows the deviance tests between the first four models on Total Instructional Skills.2 For all of the dependent variables the deviance tests between the second and third multilevel regression models showed that the third model (with the explanatory variables Time and Condition and a random intercept and slope) fitted the data significantly better than the second model (without a random slope). This significance of the random slope shows that professors and students varied significantly in the change of ratings during a course. We retained these random effects when we investigate the effect of the intervention with Model 4.
Table 2

Deviance tests between the four models for Total Instructional Skills

 

−2 log likelihood

Deviance

df

p value

Model 1 (random intercept only)

1655.1

   

Model 2 (random intercept + Time + Condition)

1653.4

1.7

2

0.421

Model 3 (random intercept + random slope + Time + Condition)

1599.8

53.6

4

<0.001

Model 4 (random intercept + random slope + Time + Condition + Time * Condition)

1591.7

8.1

1

0.004

Model 4, which included Time * Condition as the effect of the intervention, fitted the data significantly better than Model 3 for four out of eight dependent variables: Explication, Comprehension, Activation and Total Instructional Skills. Table 3 shows the estimates and standard errors of the explanatory variables on these dependent variables. Model 3 (without the interaction parameter for Time * Condition) was therefore the final model for the remaining variables; Structure, Stimulation, Validation, and Instruction. Table 4 shows the estimates and standard errors of the explanatory variables under Model 3 for these dependent variables. In short, professors in the experimental group significantly improved their instructional skills on three specific dimensions and on their total rating score with the specified feedback and consultation protocol compared to the control group.
Table 3

Estimates and standard errors of the explanatory variables under Model 4 for the dependent variables Instructional Skills, Explication, Comprehension, and Activation

 

Instructional Skills Model 4

Explication Model 4

Comprehension Model 4

Activation Model 4

Estimate

(SE)

Estimate

(SE)

Estimate

(SE)

Estimate

(SE)

Fixed part

 β0

Constant

3.710

(0.060)

3.763

(0.055)

3.922

(0.101)

3.748

(0.146)

 β1

Time

−0.040

(0.015)*

−0.057

(0.019)*

−0.040

(0.024)

−0.026

(0.035)

 β2

Condition

−0.045

(0.086)

0.074

(0.076)

−0.086

(0.144)

−0.340

(0.209)

 β3

Time * Condition

0.065

(0.020)*

0.053

(0.025)*

0.086

(0.033)*

0.103

(0.049)*

Random part

 Intercept variance

  j02

Level 3: Professor

0.041

(0.013)*

0.025

(0.010)*

0.118

(0.036)*

0.258

(0.077)*

  t02

Level 2: Student

0.091

(0.008)*

0.153

(0.018)*

0.114

(0.011)*

0.175

(0.022)*

  s2

Level 1: Time

0.047

(0.003)*

0.141

(0.010)*

0.158

(0.009)*

0.175

(0.012)*

 Slope variance

  j12

Level 3: Professor

0.001

(0.001)

0.001

(0.001)

0.004

(0.002)*

0.011

(0.004)*

  t12

Level 2: Student

0.005

(0.002)*

0.008

(0.004)*

0.000

(0.000)

0.005

(0.005)

 Intercept-slope covariance

  j01

Level 3: Professor

−0.002

(0.002)

−0.001

(0.002)

−0.009

(0.006)

−0.007

(0.013)

  t01

Level 2: Student

0.002

(0.003)

−0.003

(0.007)

0.000

(0.000)

−0.001

(0.009)

Units level 3 25 professors, units level 2 1,333 students, units level 1 1,954 questionnaires

p < 0.05

Table 4

Estimates and standard errors of the explanatory variables under Model 3 for the dependent variables Structure, Stimulation, Validation and Instruction

 

Structure

Stimulation

Validation

Instruction

Model 3

Model 3

Model 3

Model 3

Estimate

(SE)

Estimate

(SE)

Estimate

(SE)

Estimate

(SE)

Fixed part

 β0

Constant

3.360

(0.069)

3.762

(0.104)

3.588

(0.071)

3.427

(0.057)

 β1

Time

−0.001

(0.021)

−0.044

(0.017)*

−0.005

(0.016)

0.025

(0.021)

 β2

Condition

0.189

(0.096)

−0.089

(0.148)

0.066

(0.095)

0.052

(0.076)

Random part

 Intercept variance

  j02

Level 3: Professor

0.051

(0.017)*

0.122

(0.039)*

0.060

(0.020)*

0.033

(0.012)*

  t02

Level 2: Student

0.136

(0.017)*

0.263

(0.025)*

0.131

(0.017)*

0.140

(0.019)*

  s2

Level 1: Time

0.141

(0.010)*

0.164

(0.012)*

0.138

(0.010)*

0.153

(0.011)*

 Slope variance

  j12

Level 3: Professor

0.008

(0.003)*

0.003

(0.002)

0.003

(0.002)

0.008

(0.003)*

  t12

Level 2: Student

0.002

(0.004)

0.011

(0.005)*

0.007

(0.004)*

0.012

(0.005)*

 Intercept-slope covariance

  j01

Level 3: Professor

−0.004

(0.005)

0.002

(0.006)

−0.006

(0.005)

−0.006

(0.005)

  t01

Level 2: Student

0.000

(0.007)

−0.014

(0.010)

0.000

(0.007)

−0.009

(0.008)

Units level 3 25 professors, units level 2 1,333 students, units level 1 1,954 questionnaires

p < 0.05

Class sizes in the control group were smaller than in the experimental group on the first measurement occasion [mean control group versus mean experimental group; t(18.8) = −2.5, p = 0.02]. Class-size (mean-centered) was therefore added as a covariate to Model 4 (Model 4b) in analyzing the effects of the intervention. Class-size did not influence the results found with Model 4. The increase in AIC and BIC with Model 4b did indicate an overfit with additional parameters (the AIC and BIC fit statistics normally decrease).

With the fifth model, we specifically analyzed the effect of the first consultation and the additional effects of the second and third consultation on the dependent variable Total Instructional Skills. Table 5 shows the estimates and standard errors of the explanatory variables under Model 5.
Table 5

Estimates and standard errors of the explanatory variables under Model 5 for the dependent variable Total Instructional Skills

 

Instructional Skills

Model 5

Estimate

(SE)

Fixed part

 β0

Constant

3.726

(0.059)

 β1

Condition

−0.08

(0.083)

 β2

T1T2

−0.08

(0.030)*

 β3

T2T3

−0.04

(0.032)

 β4

T3T4

0.016

(0.035)

 β5

T1T2 * Condition

0.123

(0.037)*

 β6

T2T3 * Condition

0.063

(0.041)

 β7

T3T4 * Condition

0.019

(0.044)

Random part

 Intercept variance

  j02

Level 3: Professor

0.037

(0.011)*

  t02

Level 2: Student

0.102

(0.006)*

  s2

Level 1: Time

0.056

(0.003)*

Results showed a significant effect of the first consultation (parameter β5) and no additional effects of the second and third consultation (parameters β6 and β7). The parameter of T1T2 (β2) indicates that the control group decreased significantly in ratings between the first and second measurement occasion. Compared to the control group, the experimental group significantly increased in ratings on the same time interval (parameter β5). The ratings of the control group decreased further between the second and third measurement occasion, but not significantly (parameter β3), and remained stable between the third and fourth measurement occasion (parameter β4). Compared to the control group, ratings of the experimental group increased after the second consultation, but not significantly (parameter β6), and remained stable after the third consultation (parameter β7).

Because there was a significant effect of SET consultation on the first time interval (T1T2), the differences in effects of targeted and non-targeted dimensions was analyzed on this time interval with the sixth model with a more conservative alpha of 0.01 for each of the seven specific dimensions. Results of these exploratory analyses showed a significant increase in ratings when they were targeted, compared to the control group, on four dimensions: Structure, Instruction, Comprehension and Activation (Structure: β = 0.22, SE = 0.07, p = 0.002; Instruction: β = 0.23, SE = 0.07, p = 0.001; Comprehension: β = 0.40, SE = 0.06, p < 0.001; Activation: β = 0.29, SE = 0.07, p < 0.001). When dimensions were not targeted there was still an effect on one dimension: Instruction (β = 0.21, SE = 0.07, p = 0.004). Furthermore, control group ratings on Stimulation (β = −0.15, SE = 0.05, p = 0.008) decreased significantly on this time interval. Ratings of the targeted dimensions started lower on baseline ratings compared to non-targeted dimensions for almost all dimensions, but in no case significantly lower with an alpha of 0.01 (on the dimension Structure ratings were lower with an alpha of 0.05).

Discussion

Student evaluations of teaching (SETs) collected at the end of the course often do not help improve professors’ instructional skills (Kember et al. 2002). This may be due to bad timing, lack of specificity, and ineffective use of the feedback. This study addressed the effects of intermediate SET followed by collaborative consultation on professors’ instructional skills, compared to a control group. We collected student feedback on seven dimensions of instructional skills at the end of four single lectures during the course (class meetings in which lecturing was the teaching format). In so doing we ensured the feedback was optimally timed and highly specific. Professors in the experimental group met with a consultant within 2 or 3 days after each evaluated lecture to formulate an appropriate action plan for the following lectures based on the feedback. By repeating this procedure of feedback and collaborative consultation during the course, we evaluated the effects of each SET consultation on the teaching skills. On the time-intervals on which the intervention had a significant impact, the effects of targeted dimensions compared to non-targeted dimensions were further investigated.

At baseline, professors in the experimental and control group were comparable on teaching skills. The courses were taught in the same semester, the students were at the same academic level (third and final bachelor year), and the teaching format was the same (i.e., lectures). Also, on the seven specific dependent teaching dimensions and on Total Instructional Skills, the two groups did not differ at baseline.

The professors, who participated in the experimental group, showed a significant increase in Total Instructional Skills compared to the control group. More specifically, we found significant effects of the intervention on the instructional dimensions Explication, Comprehension and Activation. In our analyses of the ratings on all seven dimensions, we included significant intercept and slope variances between professors and between students within classes. The effects of the intervention were therefore significant despite the differences between individual professors on their baseline rating, and despite differences between professors in how much they randomly changed in ratings over time. The effects of the intervention were also significant, despite the differences between students within classes.

Time-interval analyses showed that, of the three consultations during the course, only the first consultation resulted in a significant effect on the professors’ Total Instructional Skills ratings. The ratings in the experimental group did increase (relative to the control group) after the second consultation, but this increase was not statistically significant. The third consultation did not result in an increase in ratings. Thus, only the first SET consultation had a significant impact on student ratings of their professors. Further analyses on this time interval (T1 vs. T2) showed a significant increase for four dimensions when they were targeted for improvement (Structure, Instruction, Comprehension and Activation) and for one dimension (Instruction) when dimensions were not targeted for improvement. In the control group, ratings on one dimension (Stimulation) decreased significantly during this time interval.

We note that the differences observed were small from an absolute perspective. This was due to the small effective scale; although we used a 5-point Likert scale, the actual range of professors’ mean ratings was much smaller. The baseline mean ratings of professors on Total Instructional Skills varied from 3.32 to 4.07, with a high baseline mean of 3.7 and a small standard deviation of 0.22. The absolute size of the differences between experimental and control group is therefore rather small, but relative to the standard deviation it is substantial. The relatively small range and variance are comparable to those of previous experimental studies on SET consultation (e.g., Marsh and Roche 1993; Hampton and Reiser 2004).

Thus, the collaborative consultation approach seems effective. Considering different approaches to consultation, Penny and Coe (2004) found that mainly studies with an advisory or educational approach showed substantial effects. As interventions, these approaches are more elaborate than the collaborative intervention used in this study. Considering the costs of the current and more extensive interventions, the approach to SET consultation used in this study seems to be valuable. Mainly the first intermediate SET consultation resulted in appreciable effects. The second and third intermediate SET consult appeared to be superfluous.

Notably, ratings of professors in the control group showed a decrease over time. One explanation could be that professors may have given their best in the first few lectures, and resorted to routine later in the course. The results on specific dimensions show a decrease on Explication and Stimulation in the control group. Both depend on using lively examples and diverse stimulating and clarifying ways of presenting the subject matter, which requires extra time to prepare. While ratings of professors in the control group decreased on Explication, ratings of professors in the experimental group increased on this dimension. The effects of the intervention are therefore visible in terms of an increase in ratings as well as the prevention of decrease.

Furthermore, we note that there was a difference in class sizes between the groups. Professors in the experimental group taught significantly larger groups of students than professors in the control group did. When we controlled for differences in class sizes it did not influence the results, but the increase in fit indices indicated an over fit, meaning that there was not enough data to fit such a complex model. This means that the difference between the groups may still have had an effect on the ratings. For example, this may explain why professors in the experimental group started out lower on the dimension Activation, since it is more difficult to interact with larger groups of students. Nevertheless, during the intervention, these professors seemed to catch up, since the ratings on Activation in the experimental group increased significantly over time, and reached the same level of ratings on Activation as the control group on the third measurement occasion. The Activation ratings of the control group remained stable over time.

Although the results of this study showed positive experimental effects of SET consultation on different teaching skills, three limitations of this study deserve attention when interpreting these findings. They also imply suggestions for further research. First, the results represent the combined effects of intermediate feedback and consultation. In order to determine whether the specific results are due to the student feedback or due to the consultation, future studies on collaborative consultation should differentiate in groups with one or the other. Here, ratings on targeted dimensions evidently showed more increase than ratings on non-targeted dimensions, which provides an argument that the collaborative consultation made a difference rather than the feedback alone. These results agree with Marsh and Roche’s findings (1993) showing that targeted dimension were associated with significantly more improvement than non-targeted dimensions. In addition, previous studies have consistently found small effects of mid-term feedback only and larger effects of feedback plus consultation (Cohen 1980; Lang and Kersting 2007; Menges and Brinko 1986; Penny and Coe 2004), thus suggesting that the currently found effects are due to the consultation. Nevertheless, we suggest that the present results are complemented with further research on this approach to SET consultation with a more complex design.

Second, the sample size of twenty-five professors is relatively small. Consequently, it was not possible to investigate differences in effects due to potentially relevant teaching and course characteristics (e.g., faculty gender, rank, age, experience, prior teaching quality and class size). Junior faculty members may for example respond differently to the intervention than senior faculty members, who are more experienced, but also—possibly—more set in their ways. The small professor sample size is therefore a limitation of this study. Notably, previous studies have found positive effects of SET consultation with teaching assistants (e.g., Hampton and Reiser 2004, p. 37 TA’s) as well as full-time ranked faculty (e.g., Dresel and Rindermann 2011, p. 12 faculty). Additionally, in their meta-analysis, Penny and Coe (2004) found no differences in effects of SET consultation between teaching assistants and full-time faculty. These findings suggest that SET consultation is an appropriate intervention for professors of different age and rank, but, as noted earlier, the number of experimental studies available is limited, particularly studies with large and diverse samples (Marsh and Roche’s study in 1993 is one of the few). Therefore, further experimental research on this matter is important. With more knowledge on the moderating effects of potentially relevant teaching and course characteristics on the current and more extensive interventions, faculty developers can target and combine the optimal interventions, corresponding to prior aims for improvement.

Finally, some planned improvement may require several lectures to implement successfully, and major changes in the course or lectures cannot always be achieved during the current course. Piccinin et al. (1999) found a delayed effect, in terms of an increase on course ratings, 1–3 years after the initial SET consultation. In these cases, the feedback and consultation may not have an immediate effect on teaching behavior, but may still have an impact on the professor’s perception of his or her teaching, goals, attitudes, self-efficacy, and teaching strategies. Related to this is the notion that researchers on faculty evaluation often recommend the use of multiple sources of data to assess teaching quality (Benton and Cashin 2012). In this study, we solely used students’ evaluations of teaching. Even though student ratings have proven to be reliable and valid in many settings (see Marsh 2007b), it would be useful for future research to include additional outcome variables to gain insight in the full impact of this intervention.

In terms of scientific relevance, the present study complements recent non-experimental findings (e.g., Dresel and Rindermann 2011; Rindermann et al. 2007) with positive experimental results for a collaborative approach to SET consultation. Additionally, the SET instrument in the current study measured specific teaching dimensions relevant and appropriate to intermediate evaluation and comparable with ratings collected at the subsequent measurement occasions. Other researchers have stressed the importance of the relevance, specificity and comparability of student ratings. For example, Marsh and Roche (1993) conducted one of the few studies in which no effects of mid-term collaborative SET consultation were found. They suggested that the mid-term feedback may have been less effective compared to end-of-the term feedback as the course evaluation instrument used (SEEQ) contained inappropriate items (e.g., items relating to assignments and examinations) to evaluate mid-term teaching effectiveness. These authors suggested that this may have undermined the confidence in the intervention in the mid-term group. They did find positive effects of (more appropriate) end-of-the-term SET consultation on end-of-the-term ratings collected one semester later. L’Hommedieu et al. (1990, 2007) also questioned the generalizability of mid-term SETs to end-of-the-term SETs. We recommend researchers in this field as well as faculty developers to considerate of the relevance, specificity and comparability of the student feedback used in (research on) SET consultation.

A final important feature of this study is the use of multilevel regression analyses to take into account systematic variation between professors and students in their ratings on teaching effectiveness at baseline and over time, when investigating the effects on student ratings. We stress the importance of taking random effects into account in future research, as they were of significant influence on the results on each dependent teaching variable investigated in this study.

In summary, with regard to implications for future research, we conclude that the present results justify further research on this approach to SET consultation on a larger scale with a more complex design. The results are promising, but more experimental research in this field is necessary to corroborate these findings. The use of multilevel regression analyses in future investigations in this field is highly recommended. With regard to implications for future practice, the results of this study indicate that SET consultation is effective in university settings, when feedback is well timed, relevant and specific, and when consultation is collaborative and teacher-centered. Under these conditions, we observed that consultation does not need to be repeated often during a course in order to have a significant impact.

Footnotes

  1. 1.

    The deviance test is the likelihood ratio test to compare models; the −2*log-likelihood of one model is compared with the −2*log-likelihood of the other model. The difference has a Chi square distribution with degrees of freedom equal to the difference in the number of parameters estimated in the models being compared.

  2. 2.

    Tables of detailed results on the seven specific dimensions are available on request.

Notes

Acknowledgments

We thank Prof. Dr. Conor Dolan for valuable suggestions on the manuscript.

References

  1. Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50, 179–211.CrossRefGoogle Scholar
  2. Ajzen, I., & Fishbein, M. (1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  3. Bandura, A. (1977). Social learning theory. New York, NY: General Learning Press.Google Scholar
  4. Bandura, A. (1997). Self-efficacy: The exercise of control. New York, NY: W.H. Freeman.Google Scholar
  5. Benton, S. L., & Cashin, W. E. (2012). Student ratings of teaching: A summary of research and literature (IDEA Paper no. 50). Manhattan, KS: The IDEA Center. http://www.theideacenter.org/sites/default/files/idea-paper_50.pdf. Accessed 12 Mar 2012.
  6. Brinko, K. T. (1990). Instructional consultation with feedback in higher education. Journal of Higher Education, 61, 65–83.CrossRefGoogle Scholar
  7. Cohen, P. A. (1980). Effectiveness of student feedback for improving college instruction. Research in Higher Education, 13, 321–341.CrossRefGoogle Scholar
  8. De Neve, H. M. F., & Janssen, P. J. (1982). Validity of student evaluation of instruction. Higher Education, 11, 543–552.CrossRefGoogle Scholar
  9. Dresel, M., & Rindermann, H. (2011). Counseling university instructors based on student evaluations of their teaching effectiveness: a multilevel test of its effectiveness under consideration of bias and unfairness variables. Research in Higher Education, 52, 717–737.CrossRefGoogle Scholar
  10. Eagly, A. H., & Chaiken, S. (1998). Attitude structure and function. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (4th ed., pp. 269–322). New York, NY: McGraw-Hill.Google Scholar
  11. Hampton, S. E., & Reiser, R. A. (2004). Effects of a theory-based feedback and consultation process on instruction and learning in college classrooms. Research in Higher Education, 45, 497–527.CrossRefGoogle Scholar
  12. Hattie, J. A. C., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77, 81–112.CrossRefGoogle Scholar
  13. Hox, J. J. (2002). Multilevel analysis. Techniques and applications. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  14. Kember, D., Leung, D. Y. P., & Kwan, K. P. (2002). Does the use of student feedback questionnaires improve the overall quality of teaching? Assessment & Evaluation in Higher Education, 27, 411–425.CrossRefGoogle Scholar
  15. Knapper, C., & Piccinin, S. (1999). Consultation about teaching: An overview. In C. Knapper & S. Piccinin (Eds.), Using Consultants to Improve Teaching. New Directions for Teaching and Learning, Number 79. San Francisco, CA: Jossey-Bass.Google Scholar
  16. L’Hommedieu, R., Menges, R. J., & Brinko, K. T. (1990). Methodological explanations for the modest effects of feedback from student ratings. Journal of Educational Psychology, 82, 232–241.CrossRefGoogle Scholar
  17. Lang, J. W. B., & Kersting, M. (2007). Regular feedback from student ratings of instruction: Do college teachers improve their ratings in the long run? Instructional Science, 35, 187–205.CrossRefGoogle Scholar
  18. Lenze, L. F. (1996). Instructional development: What works? National Education Association, Office of Higher Education Update, 2, 1–4.Google Scholar
  19. Levinson-Rose, J., & Menges, R. J. (1981). Improving college teaching: A critical review of research. Review of Educational Research, 51, 403–434.CrossRefGoogle Scholar
  20. Madden, T., Ellen, P., & Ajzen, I. (1992). A comparison of the theory of planned behavior and the theory of reasoned action. Personality and Social Psychology Bulletin, 18, 3–9.CrossRefGoogle Scholar
  21. Marsh, H. W. (1984). Students evaluations of university teaching—Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707–754.CrossRefGoogle Scholar
  22. Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11, 253–388.CrossRefGoogle Scholar
  23. Marsh, H. W. (2007a). Do university teachers become more effective with experience? A multilevel growth model of students’ evaluations of teaching over 13 years. Journal of Educational Psychology, 99, 775–790.CrossRefGoogle Scholar
  24. Marsh, H. W. (2007b). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases and usefulness. In R. P. Perry & J. C. Smart (Eds.), The scholarship of teaching and learning in higher education: An evidence-based perspective (pp. 319–384). New York, NY: Springer.CrossRefGoogle Scholar
  25. Marsh, H. W., & Hocevar, D. (1991). Students’ evaluations of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period. Teaching and Teacher Education, 7, 303–314.CrossRefGoogle Scholar
  26. Marsh, H. W., & Roche, L. A. (1993). The use of students’ evaluations and an individually structured intervention to enhance university teaching effectiveness. American Educational Research Journal, 30, 217–251.CrossRefGoogle Scholar
  27. McKeachie, W. J. (1997). Student ratings: The validity of use. American Psychologist, 52, 1218–1225.CrossRefGoogle Scholar
  28. McLaughlin, M. W., & Pfeifer, R. S. (1988). Teacher evaluation: Improvement, accountability, and effective learning. New York, NY: Teachers College Press.Google Scholar
  29. Menges, R. J., & Brinko, K. T. (1986). Effects of student evaluation feedback: A meta-analysis of higher education research. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.Google Scholar
  30. Penny, A. R., & Coe, R. (2004). Effectiveness of consultation on student ratings feedback: A meta-analysis. Review of Educational Research, 74, 215–253.CrossRefGoogle Scholar
  31. Piccinin, S., Cristi, C., & McCoy, M. (1999). The impact of individual consultation on student ratings of teaching. The International Journal for Academic Development, 4, 75–88.CrossRefGoogle Scholar
  32. Prebble, T., Hargraves, H., Leach, L., Naidoo, K., Suddaby, G., & Zepke, N. (2004). Impact of student support services and academic development programmes on student outcomes in undergraduate tertiary study: A synthesis of the research. Report to the Ministry of Education, Massey University College of Education.Google Scholar
  33. Richardson, T. T. E. (2005). Instruments for obtaining student feedback: A review of the literature. Assessment and Evaluation in Higher Education, 30, 378–415.CrossRefGoogle Scholar
  34. Rindermann, H., Kohler, J., & Meisenberg, G. (2007). Quality of instruction improved by evaluation and consultation of instructors. International Journal for Academic Development, 12, 73–85.CrossRefGoogle Scholar
  35. SCO Kohnstamn Institute. (2002). Rapportage Uvalon. Amsterdam: University of Amsterdam.Google Scholar
  36. SCO Kohnstamn Institute. (2005). Jaarverslag Uvalon 2003 en 2004. Amsterdam: University of Amsterdam.Google Scholar
  37. Sheppard, B. H., Hartwick, J., & Warshaw, P. R. (1988). The theory of reasoned action: A meta-analysis of past research with recommendations for modifications and future research. Journal of Consumer Research, 15, 325–343.CrossRefGoogle Scholar
  38. Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Thousand Oaks, CA: Sage.Google Scholar
  39. Stes, A., Min-Leliveld, M., Gijbels, D., & Van Petegem, P. (2010). The impact of instructional development in higher education: The state-of-the-art of the research. Educational Research Review, 5, 25–49.CrossRefGoogle Scholar
  40. Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? In M. P. Theall, L. Abrami, & L. Mets (Eds.), The student ratings debate: Are they valid? How can we best use them? New directions for institutional research, 109 (pp. 45–56). San Francisco, CA: Jossey-Bass.Google Scholar
  41. Van den Putte, B. (1993). On the theory of reasoned action (Dissertation, University of Amsterdam, 1993).Google Scholar
  42. Vorst, H. C. M., & Van Engelenburg, B. (1994). UVALON, UvA-pakket voor onderwijsevaluatie. The Netherlands: Psychological Methods Department, University of Amsterdam.Google Scholar
  43. Weimer, M., & Lenze, L. F. (1997). Instructional interventions: A review of the literature on efforts to improve instruction. In K. R. Perry & J. C. Smart (Eds.), Effective teaching in higher education: Research and practice (pp. 205–240). New York, NY: Agathon Press.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Mariska H. Knol
    • 1
  • Rachna in’t Veld
    • 1
  • Harrie C. M. Vorst
    • 1
  • Jan H. van Driel
    • 2
  • Gideon J. Mellenbergh
    • 1
  1. 1.Department of Psychological MethodsUniversity of AmsterdamAmsterdamThe Netherlands
  2. 2.ICLON, Graduate School of TeachingLeiden UniversityLeidenThe Netherlands

Personalised recommendations