The participating schools were recruited by distributing a call in the researchers’ network and in online teacher communities. The teachers were informed that the study encompassed the conduction of a five-lesson project on the European Union (EU) and that the researchers would focus on students´ learning in small groups. To arrive at random allocation to conditions, each school was alternately allocated to the scaffolding or nonscaffolding condition based on the moment of confirmation. That is, the first school that confirmed participation was allocated to the scaffolding condition, the second school to the non-scaffolding condition, the third school to the scaffolding condition etcetera. Each school only had teachers from one condition; this was to prevent teachers from different conditions to talk to each other and influence each other.
Thirty teachers from 20 Dutch schools participated in this study; 17 teachers of 11 schools were in the scaffolding condition and 13 teachers of nine schools were in the nonscaffolding condition (never more than three teachers per school). Of the participating teachers, 20 were men and 10 were women. The teachers taught social studies in the 8th grade of pre-vocational education. The average teaching experience of the teachers was 10.4 years. Each teacher participated with one class, so a total of 30 classes participated.
During the project lessons that all teachers taught during the experiment, students worked in small groups. The total number of groups was 184 and the average number of students per group was 4.15. A total of 768 students participated in this study, 455 students in the scaffolding condition and 313 students in the nonscaffolding condition. Of the 768 students, 385 were boys and 383 were girls.
T tests for independent samples showed that the schools and teachers of the scaffolding and nonscaffolding condition were comparable with regard to teachers’ years of experience (t(28): .90, p = .38), teachers’ gender (t(28): .51, p = .10), teachers’ subject knowledge (t(24): 1.16, p = .26), the degree to which the classes were used to doing small-group work (t(23): −.87, p = .39), the track of the class (t(28): .08, p = .94), class size (t(28): −1.32, p = .20), duration of the lessons in minutes (t(28): −1.18, p = .25), students’ age (t(728): −.34, p = .74), and students’ gender (t(748): −1.65, p = .10) (see Table 1).
For this experimental study, we used a between-subjects design. In Table 2, the timeline of the study can be found.
All teachers taught the same project on the EU for which they received instructions. This project consisted of five lessons in which the students made several open-ended assignments in groups of four (e.g., a poster, a letter about (dis)advantages of the EU etcetera). The teachers taught one project lesson per week. Teachers composed groups while mixing student gender and ability. We used the first and last project lessons for analyses (respectively premeasurement and postmeasurement). In the premeasurement lesson, the students made a brochure about the meaning of the EU for young people in their everyday lives. In the postmeasurement lesson, the students worked on an assignment called ‘Which Word Out’ (Leat 1998). Three concepts of a list of concepts on the EU that have much in common had to be selected and thereafter, one concept had to be left out using two reasons. The students were stimulated to collaborate by the nature of the tasks (the students needed each other) and by rules for collaboration that were introduced in all classes (such as make sure everybody understands it, help each other first before you ask the teacher etcetera).
Scaffolding intervention programme
We developed and piloted the scaffolding intervention programme in a previous study (Van de Pol et al. 2012) and began after we filmed the first project lesson. The programme consisted successively of: (1) video observation of project lesson 1, (2) one two-hour theoretical session (taught per school), and (3) video observations of project lessons 2–4 each followed by a reflection session of 45 min with the first author in which video fragments of the teachers’ own lessons were watched and reflected upon. Finally, all teachers taught project lesson 5 that was videotaped. This fifth lesson was not part of the scaffolding intervention programme; it served as a postmeasurement.
The first author, who was experienced, taught the programme. The reflection sessions took place individually (teacher + 1st author) and always on the same day as the observation of the project lesson. In the theoretical session, the first author and the teachers: (a) discussed scaffolding theory and the steps of contingent teaching (Van de Pol et al. 2011), i.e., diagnostic strategies (step 1), checking the diagnosis (step 2), intervention strategies, (step 3), and checking students’ learning (step 4), (b) watched and analysed video examples of scaffolding, and (c) discussed and prepared the project lessons. In the subsequent four project lessons, the teachers implemented the steps of contingent teaching cumulatively.
Support quality: contingent teaching
We selected all interactions a teacher had with a small group of students about the subject-matter for analyses (i.e., interaction fragments). An interaction fragment started when the teacher approached a group and ended when the teacher left. Each interaction fragment thus consisted of a variable number of teacher and student turns,Footnote 2 depending on how long the teacher stayed with a certain group. In the premeasurement and postmeasurement respectively, the teachers in the scaffolding condition had 454 and 251 fragments and the teachers in the nonscaffolding condition had 368 and 295 fragments. We used a random selection of two interaction fragmentsFootnote 3 of the premeasurement and two interaction fragments of the postmeasurement per teacher for analyses and we transcribed these interaction fragments. Because we selected the interaction fragments randomly, two interaction fragments of a certain teacher’s lesson could, but must not be with the same group of students. This selection resulted in 108 interaction fragments consisting of 4073 turns (teacher + student turns).
The unit of analyses for measuring contingency was a teacher turn, a student turn, and the subsequent teacher turn (i.e., a three-turn-sequence, for coded examples see Tables 3, 4, 5 and 6). To establish the contingency of each of unit, we used the contingent shift framework (Van de Pol et al. 2012; based on Wood et al. 1978). If a teacher used more control after a student’s demonstration of poor understanding and less control after a student’s demonstration of good understanding, we labelled the support contingent. To be able to apply this framework we first coded all teacher turns and all student turns as follows.
First, we coded all teacher turns in terms of the degree of control ranging from zero to five. See Tables 3, 4, 5 and 6 for coded examples. Zero represented no control (i.e., the teacher is not with the group), one represented the lowest level of control (i.e., the teacher provides no new lesson content, elicits an elaborate response, and asks a broad and open question), two represented low control (i.e., the teacher provides no new content, elicits an elaborate response, mostly an elaboration or explanation of something by asking open questions that are slightly more detailed than level one questions), three represented medium control (i.e., the teacher provides no new content and elicits a short response, e.g., yes/no), four represented a high level of control (i.e., the teacher provides new content, elicits a response, and gives a hint or asks a suggestive question), and five represented high control (e.g., providing the answer). Control refers to the degree of regulation a teacher exercises in his/her support. Two researchers coded twenty percent of the data and the interrater reliability was substantial (Krippendorff’s Alpha = .71; Krippendorff 2004).
Second, we coded the student’s understanding demonstrated in each turn into one of the following categories: miscellaneous, no understanding can be determined, poor/no understanding, partial understanding, and good understanding (cf. Nathan and Kim 2009; Pino-Pasternak et al. 2010; see Tables 3, 4, 5 and 6 for an example). Two researchers coded twenty percent of the data and the interrater reliability was satisfactory (Krippendorff’s Alpha = .69). The contingency score was the percentage contingent three-turn-sequences relative to the total number of three-turn sequences per teacher per measurement occasion. This means that each class had a certain contingency score; that is, the contingency score for all students of a particular class was the same. The first author, who knew which teacher was in which condition, coded the data. We prevented bias by coding in separate rounds: first, we coded all teacher turns with regard to the degree of control; second, we coded all student turns with regard to their understanding. And only then we applied the predetermined contingency rules to all three-turn-sequences.
Independent working time
We determined the average duration (in seconds) of independent working time per group per measurement occasion (T0 and T1). We did not take short whole-class instructions (≤2 min) into account and we included this in the independent working time for each group. If the teacher provided whole-class instructions that were longer than 2 min, we started counting again after that instruction had finished and the duration of the whole-class instruction was thus not included in the independent working time for each group.
We measured students’ task effort in class with a questionnaire consisting of 5 items (cf. Boersma et al. 2009; De Bruijn et al. 2005). We used a five-point likert scale ranging from ‘I don’t agree at all’ to ‘I totally agree’. The internal consistency was high: the value of Cronbach’s α (Cronbach 1951) was .92. Kline (1999) indicated a cut-off point of .70/.80). An example item of this questionnaire is: “I worked hard on this task”.
Appreciation of support
We measured students’ appreciation of the support received with a questionnaire consisting of 3 items (cf. Boersma et al. 2009; De Bruijn et al. 2005). We used a five-point likert scale ranging from ‘I don’t agree at all’ to ‘I totally agree’. The internal consistency was high: the value of Cronbach’s α was .90. An example item of this questionnaire is: “I liked the way the teacher helped me and my group”.
Achievement: multiple choice test
We measured students’ achievement with a test that consisted of 17 multiple choice questions (each with four possible answers). We constructed the questions. An example of a question is: “The main reason for the collaboration between countries after World War II was: (a) to be able to compete more with other countries, (b) to be able to transport goods, people and services across borders freely, (c) to collaborate with regard to economic and trade matters, or (d) to be able to monitor the weapons industry. The item difficulty was sufficient as all p-values (i.e., the percentage of students that correctly answered the item) of the items were between .31 and .87 (Haladyna 1999). Additionally, the items were good in terms of the item discrimination (correlation between the item score and the total test score) as the mean item correlation was .33. The lowest correlation was not lower than .21; the threshold is .20 (Haladyna 1999). We used the number of questions answered correctly as a score in the analyses with a minimum score of 0 and a maximum score of 17. The internal consistency was high: the value of Cronbach’s α was .79.
Achievement: knowledge assignment
We additionally measured students’ achievement with a knowledge assignment. The knowledge assignment consisted of three series of three concepts (e.g., EU, European Coal and Steel Community (ECSC), and European Economic Community (ECC)). The students were asked to leave out one concept and give one reason for leaving this concept out. We developed a coding scheme to code the accuracy and quality of the reasons. Each reason was awarded zero, one, or two points. We awarded zero points when the reason was inaccurate or based only on linguistic properties of the concepts (e.g., two of the three concepts contain the word ‘European’). We awarded one point when the reason was accurate but used only peripheral characteristics of the concepts (e.g., one concept is left out because the other two concepts are each other’s opposites). We awarded two points when the reason was accurate and focused on the meaning of the concepts (e.g., ECSC can be left out because they only focused on regulating the coal and steel production and the other two (EU and ECC) had broader goals that related to the economy in general). The minimum score of the knowledge assignment was 0 and the maximum score was 6. Two researchers coded over 10 % of the data and the interrater reliability was substantial (Krippendorff’s Alpha = .83).
For our analyses, we used IBM’s Statistical Package for the Social Sciences (SPSS) version 22.
In our predictor variables, we only found seven missing values which we handled through the expectation–maximization algorithm. For the knowledge assignment and multiple choice test we coded missing questions as zero which meant that the answer was considered false, which is the usual procedure in school as well (this was per case never more than eight percent). For the task effort and appreciation of support questionnaire, we computed the mean scores per measurement occasion and per subscale only over the number of questions that was filled out. If a student missed all measurement occasions or if a student only completed the questionnaire or one of the knowledge tests at one single measurement occasion, we removed the case (N = 18) which made the total number of students 750 (445 in the scaffolding condition; 305 in the nonscaffolding condition).
We used a repeated-measures ANOVA with condition as between groups variable, measurement occasion as within groups variable and contingency or mean independent working time as dependent variable to check the effect of the intervention on teachers’ contingency and the independent working time per group. If both the level of contingency and the independent working time appear to differ systematically between conditions over measurement occasions, we will not use ‘condition’ as an independent variable in subsequent analyses because there is more than one systematic difference between conditions. Instead, we will use the variables ‘contingency’ and ‘independent working time’ to be able to investigate the separate effects of these variables on students’ achievement, task effort, and appreciation of support.
Effects of scaffolding
To test our hypothesis about the effect of contingency on achievement and explore the effects of contingency on students’ task effort and appreciation of support, we used multilevel modelling, as the data had a nested structure (measurement occasions within students, within groups, within classes, within schools). To facilitate the interpretation of the regression coefficients, we transformed the scores of all continuous variables into z-scores (mean of zero and standard deviation of 1). We treated measurement occasions (level 1) as nested within students (level 2), students as nested within groups (level 3), groups as nested within teachers/classes (level 4) and teachers/classes as nested within schools (level 5). In comparing null models (with no predictor variables) with a variable number of levels for all dependent variables, we found that the school level (level 5) was not contributing significantly to the variance found and we therefore omitted it as a level. For the multiple choice test only, the group-level was not contributing significantly to the variance found and we therefore omitted it as a level.
We fitted four-level models fitted for each of the dependent variables separately. The independent variables in the analyses were measurement occasion (premeasurement = 0; postmeasurement = 1), contingency, and mean independent working time. We included task effort as a covariate in a separate analyses regarding achievement (multiple-choice test and knowledge assignment) as task effort is known to affect achievement (Fredricks et al. 2004). For each dependent variable, the model in which the intercept, and effects for teachers/classes and groups were considered random, with unrestricted covariance structure, gave the best fit and was thus used. We included the main effects of each of the independent variables and all interactions (i.e., the two-way interactions between measurement occasion and contingency, measurement occasion and independent working time, and contingency and independent working time and the three-way interaction between measurement occasion, contingency and independent working time). To test our hypothesis regarding achievement, we were specifically interested in the interaction between occasion and contingency. To check whether differences in independent working time played a role in whether contingency affected achievement, we were additionally interested in the three-way interaction between occasion, contingency, and independent working time. Finally, as we wanted to control for task effort, we included task effort as a covariate in a separate analysis.
To explore the effects of contingency on students’ task effort and appreciation of support, we were also firstly interested in the interaction effect between occasion and contingency. Secondly, we also checked the role of independent working time by looking at the three-way interaction between occasion, contingency, and independent working time.
As an indication of effect size, we reported the partial squared eta (ηp2) for the manipulation check of contingency and independent working time and the explained variance the multilevel analyses (squared correlation between the students’ true scores and the estimated scores). We report only effect sizes for significant effects.