Pre-service teachers’ evidence-based reasoning during pedagogical problem-solving: better together?

This study investigates if collaboration and the level of heterogeneity between collaborating partners’ problem-solving scripts can influence the extent to which pre-service teachers engage in evidence-based reasoning when analyzing and solving pedagogical problem cases. We operationalized evidence-based reasoning through its content and process dimensions: (a) to what extent pre-service teachers refer to scientific theories or evidence of learning and instruction (content level) and (b) to what extent they engage in epistemic processes of scientific reasoning (process level) when solving pedagogical problems. Seventy-six pre-service teachers analyzed and solved a problem about an underachieving student either individually or in dyads. Compared with individuals, dyads of pre-service teachers referred less to scientific content, but engaged more in hypothesizing and evidence evaluation and less in generating solutions. A greater dyadic heterogeneity indicated less engagement in generating solutions. Thus, collaboration may be a useful means for engaging preservice teachers in analyzing pedagogical problems in a more reflective and evidence-based manner, but pre-service teachers may still need additional scaffolding to do it based on scientific theories and evidence. Furthermore, heterogeneous groups regarding the collaborating partners’ problem-solving scripts may require further instructional support to discuss potential solutions to the problem.


Introduction
Many voices, both from educational policy-making and from educational research, emphasize a need for evidence-based practice in teaching (EC 2013;Petty 2009). This means that teachers should be able to ground their decisions and the way they handle pedagogical problems on scientific evidence (Hetmanek et al. 2015;Spencer et al. 2012). For example, when noticing that a particular student seems to lack study motivation in a lesson, a teacher might draw on theory and research conducted in the context of self-determination theory to face this problem (e.g., she might decide to provide the student autonomy with respect to the kind of activity she wants to engage in to address her need for autonomy; Ryan and Deci 2000).
Developing skills related to evidence-based teaching is no doubt an ongoing task for teachers on their course towards professionalism. Yet, we argue that pre-service teacher education is of particular importance in this context: During university education, preservice teachers can be systematically exposed to typical (often simulated) problems from classroom practice and be scaffolded to engage in evidence-based teaching and reasoning activities that eventually might become part of their teaching repertoire in later phases during their career. Nevertheless, empirical research shows that pre-service teachers often have difficulties with the application of scientific knowledge (theories and evidence) to problems from educational practice (Klein et al. 2015;Yeh and Santagata 2015). Furthermore, they also often do not succeed in reasoning about pedagogical problems they encounter in a systematic, hypothetico-deductive manner (see Kiemer and Kollar 2018): Very often, they have difficulties in identifying the problems, and even if they do so, they often are not able to explain and solve those problems in a systematic way (Sherin and Van Es 2009).
One way to help pre-service teachers develop skills for evidence-based teaching is to have them reflect on cases from educational practice (e.g., Piwowar et al. 2018;. From research on collaborative learning, we may speculate that this might be especially effective when it is done in collaboration with a peer (e.g., Baeten and Simons 2014;Chi and Wylie 2014). Yet, even though there is a lot of research on the benefits of collaborative learning in other domains (for an overview, see Springer et al. 1999), there is not much evidence on whether collaborative learning will help the development of evidence-based teaching skills. Further, it is unclear how exactly collaboration should be organized to be effective in this context. One issue here refers to the effects of different group compositions: While some research suggests that heterogeneous groups typically succeed over homogeneous group, other findings seem to suggest the opposite (Bowers et al. 2000;Rummel and Spada 2005;Wiley et al. 2013). Also, it is an open question with respect to what characteristics groups should be homo-or heterogeneous. One dimension that has hardly been investigated in the past is the homo-/heterogeneity of group members' procedural knowledge on how to approach a given task. In the context of this article, such knowledge will be termed "problemsolving scripts" (Kiemer and Kollar 2018).
This article therefore has two aims: First, we are interested in how groups of pre-service teachers (in our case dyads) compare with individuals when it comes to how they solve authentic pedagogical problem cases. We are interested in the effects of collaboration both on (a) the extent to which they apply scientific evidence and (b) the extent to which they approach such problems in a systematic, hypothetico-deductive manner. Second, when solving such cases collaboratively, we look at the effects of homo-resp. heterogeneity of the collaborating pre-service teachers' problem-solving scripts (i.e., the procedural knowledge on problemsolving they apply during problem-solving).

Evidence-based teaching as a venue for scientific reasoning
In many professions, the ability to solve domain-specific problems in accordance with scientific evidence has been labeled "evidence-based practice" (e.g., Spencer et al. 2012). When conceptualizing evidence-based practice in the teaching domain, and when conceptualizing the underlying cognitive processes, we propose to heuristically differentiate between two different, but related, kinds of reasoning processes: (a) the application of scientific knowledge to the problem (content level) and (b) the reasoning about the problem in a systematic, scientific manner (process level). At the content level, when pre-service teachers solve authentic problems from their future practice, they need to be able to apply theoretical and empirical knowledge from their domain (Gruber et al. 2000). In this sense, teachers are expected to refer to relevant psychological and educational theories on learning, motivation, instruction, etc. and related research findings when they are confronted with a pedagogical problem (e.g., how best to scaffold an underperforming student; Csanadi et al. 2015;Voss et al. 2011).
At the process level, evidence-based reasoning can be viewed as an inquiry process (Klahr and Dunbar 1988) that is characterized by an engagement in certain epistemic processes while analyzing and solving a problem . Based on , we argue that when being confronted with problems from their actual practice, pre-service teachers should engage in a set of eight epistemic processes that are similar to those that scientists engage in when solving a research problem. They (1) need to identify the problem itself (problem identification), (2) ask questions or make statements that guide their further exploration on the problem (questioning), (3) set up candidate explanations for the problem (hypothesis generation), (4) take into account or generate further information necessary to understand or solve the problem (evidence generation), (5) evaluate the information in the context of their hypotheses (evidence evaluation), (6) plan interventions/solutions (constructing artefacts), (7) engage in discussions with others to re-evaluate their thoughts (communicating and scrutinizing), and (8) sum up their process to arrive at well-warranted conclusions on how to explain and/or solve the problem (drawing conclusions).
Nevertheless, as prior research has shown, pre-service teachers seem to have difficulties on both the content and process levels of evidence-based reasoning (Csanadi et al. 2015;Stark et al. 2009;Yeh and Santagata 2015). With respect to the content level, Lockhorst et al. (2010), for example, found that teachers indicated a lack of scientific knowledge to assess pupils' learning processes. Moreover, pre-service teachers seem to rarely refer to scientific sources (Csanadi et al. 2015), and they show difficulties with applying their knowledge to solve pedagogical problems (McNeill and Knight 2013). In rather rare cases in which they use scientific theories and models, they often do so in a superficial manner (Sampson and Blanchard 2012;Stark et al. 2009). With respect to the process level, van de Pol et al. (2011) showed that teachers often fail to collect enough evidence to properly make decisions on how to scaffold their pupils. Similarly, when pre-service teachers assess learning processes, for example, by watching videos of teacher-student interactions in the classroom, they seem to show relatively lower engagement in generating evidence-based hypotheses on their own compared with their performance after a methods course (Yeh and Santagata 2015).
Taken together, pre-service teachers show difficulties with respect to both dimensions of scientific reasoning, i.e., referring to scientific theories and evidence and performing epistemic processes, when it comes to solving pedagogical problems from their future professional practice. Thus, searching for ways to help pre-service teachers build up skills for evidencebased teaching is urgently needed.
Reasoning about authentic problem cases: Better together?
One potential way to foster pre-service teachers' evidence-based reasoning is to implement more opportunities for collaborative reflection on authentic problem cases in teacher education (e.g., Kolodner 2006). Through case-based reasoning (CBR), students may gradually develop new strategies to understand and solve new problems (e.g., Tawfik and Kolodner 2016). When applying these strategies to authentic problem cases, they can test and practice these newly acquired strategies in simulated real-world situations.
Very often, CBR is done in groups Kolodner et al. 2003). From a sociocognitive perspective (Chi 2009;, collaborating partners may mutually stimulate or "scaffold" each other by articulating and building on the knowledge they individually bring to the discussion (Asterhan and Schwarz 2009;Chi and Wylie 2014;. The potentials of collaboration for pre-service teachers have particularly been put forward by proponents of co-teaching (e.g., Rytivaara and Kershner 2012). This research argues that the engagement in collaborative discourse and the exchange of different perspectives and scientific knowledge may contribute to teachers' professional growth on becoming more reflective and evidence-based practitioners (Baeten and Simons 2014;Birrell and Bullough 2005;Korthagen and Kessels 1999). In one study, King (2006) surveyed 161 preservice teachers with respect to their views regarding the potentials and problems of "paired placement," i.e., of sending pre-service teachers during initial teacher education to schools in pairs rather than individually. Among others, more than 80% of the pre-service teachers preferred being in a paired placement over being sent to school individually, and more than 70% of them stated that they felt that they learned a lot from attending their peer's lessons. A more recent study by Simons and Baeten (2016) further found that also teacher mentors attribute a high potential to co-teaching to foster pre-service teachers' professional growth.
Unfortunately though, research on how exactly pre-service teachers collaborate while solving authentic problem cases seems to be very scarce. Yet, research from other areas and with different subjects revealed a wealth of evidence demonstrating strong potentials of collaborative vs. individual learning on knowledge acquisition and problem-solving (for an overview, see Springer et al. 1999). , for example, observed children working with a computer simulation either alone or in dyads. They had the opportunity to use different control keys to steer a "spaceship" while their main task was to find out the function of a "mystery key." This task gave the students the opportunity to develop hypotheses about the function of the mystery key as well as to test their assumptions. As a result, dyads engaged significantly more in hypothesizing than individuals did and produced proportionally more interpretive vs. descriptive talk than individuals. In another study (Okada and Simon 1997), male science major undergraduates participated in an online simulation environment and were asked to conduct molecular genetics experiments either as individuals or as pairs. Results showed that pairs of students engaged more frequently in hypothesizing and in evidence evaluation (planned experiments to test the initial hypothesis) than individuals. As the authors note, one reason for the heightened engagement in explanatory behavior that the dyads showed may be the need to communicate in a more explicit manner (Okada and Simon 1997). Such discussions can also be part of a coordination attempt for meaning-making and knowledge coconstruction (Roschelle and Teasley 1995;Weinberger et al. 2007) which has been described as a relevant aspect of scientific reasoning as well .
Despite these studies showing positive effects of collaboration on different epistemic processes, other studies demonstrate that collaboration does not always have positive effects.
In co-teaching, for example, planning the lesson with a partner requires coordination between the teachers (Jang 2008;Nokes et al. 2008). This might lead to difficulties, especially in case of disagreements between teaching partners (Bullough et al. 2002), as collaboration might become time-consuming (Jang 2008) and contribute to a higher perceived workload (Nokes et al. 2008). These findings resonate with evidence from further research on collaborative learning that has been conducted in other contexts. For example, Strijbos et al. (2004) as well as Weinberger et al. (2010) have shown that unclarified roles, i.e., who should do what and when during collaboration, may slow down or hinder the development of mutual understanding and constructive communication processes (Rummel and Spada 2005). Sociopsychological research further shows that groups often rely excessively on discussing what everybody in the group already knows anyway ("shared knowledge"), while knowledge and ideas that only single group members possess ("unshared knowledge") are often discussed to a much lesser extent (e.g., Wittenbaum and Stasser 1996).
Given these equivocal results on the effects of collaboration on problem-solving, it seems that further variables might have an impact on whether groups benefit from collaboration or not. In their review on the effects of co-teaching, Baeten and Simons (2014), for example, mention that the successfulness of collaboration between pre-service teachers may be threatened when there is a lack of compatibility between teaching partners (e.g., different conceptions on teaching). This implies that group composition (i.e., who collaborates with whom) may be a factor that may impact the successfulness of collaboration. In the following, we argue that in particular pre-service teachers' homo-vs. heterogeneity of their procedural knowledge on how to tackle problematic classroom situations (i.e., their problem-solving scripts) might be crucial in this regard.
Group composition as a factor influencing the effectiveness of collaborative learning: homo-/heterogeneity of collaborating teachers' problem-solving scripts The question whether group members should be similar (i.e., homogeneous) or dissimilar (i.e., heterogeneous) to each other (e.g., Bowers et al. 2000) in order to promote student engagement in high-level knowledge construction processes has already been addressed in pre-service teacher education. For example, studies in the context of pre-service teacher professionalization emphasize the importance of heterogeneous "paired placements" (Nokes et al. 2008) in order to foster a more reflective integration between scientific knowledge and practice (e.g., Korthagen and Kessels 1999). Further arguments favoring heterogeneous over homogeneous group composition can be derived from the broader research on collaborative learning. As Paulus (2000) described, in heterogeneous groups, reasoners may stimulate (Paulus 2000) each other by bringing new perspectives to a problem-solving task. A meta-analysis by Bowers et al. (2000) concludes that while homogeneous teams tend to perform better on lowcomplexity tasks that are well-defined (e.g., solving a puzzle), heterogeneous teams perform better on rather complex problem-solving tasks (e.g., business games). Similar findings have been reported by Canham et al. (2012): They asked university students to solve statistical probability problems either in homogeneous or in heterogeneous dyadic groups. Homo-/ heterogeneity of the dyads was varied as the type of training that group members received before collaboration. In homogeneous dyads, both members received the same type of training, while in the heterogeneous condition, the two members of each dyad received different types of training. After their training, dyads solved both problems that they practiced during their training (familiar problem) and novel ones (transfer problems). Results showed that although homogeneous dyads performed better on solving familiar problems, heterogeneous dyads performed better on transfer problems.
Yet, even though evidence for a superiority of heterogeneous over homogeneous dyads especially with regard to solving more complex problems seems to be strong, working in heterogeneous groups may also come with costs. For example, heterogeneous problem-solving partners may need more time and effort to reach a common understanding of the given problem and coordinate their strategies on how to solve that problem (Bullough et al. 2002;Rummel and Spada 2005). Taken together, findings indicate that in comparison with homogeneous pairings, heterogeneous learning partners may although need more time to coordinate their ideas with each other, they may be more reflective and evidence-based when solving complex problems.
Another open question is with respect to what criterion groups should or should not be heterogeneous. While previous studies focused on e.g., homo-/heterogeneity regarding expertise (Wiley and Jolly 2003;Wu et al. 2015) or demographic background (Curşeu and Pluut 2013), the present article is particularly interested in homo-/heterogeneity in terms of the partners' "problem-solving scripts." We define problem-solving scripts as individuals' knowledge and expectations on how to reason about problems from professional practice and assume that these scripts guide pre-service teachers' understanding and behavior in problem-solving situations Schank 1999). Evidence for the power of scripts during professional problem-solving comes especially from the medical field. For example, Charlin et al. (2007) showed that physicians possess clinical reasoning scripts, e.g., "illness scripts" that guide their medical professionals' diagnostic behavior. As an example, a physician meeting a pale patient may notice a problem, engage in a systematic evidence generation procedure (e.g., listening to the symptoms, conducting further investigations), and generate and revise plausible hypotheses (diagnoses) (Charlin et al. 2007). Similar problem-solving processes might also occur among pre-service teachers when analyzing authentic problem cases: For example, in the description of a hypothetical classroom situation, one pre-service teacher might detect that one student underperforms in a given task. This pre-service teacher may then generate hypotheses, i.e., explanations of the problem (e.g., that the student's overall ability is too low or that the teacher's explanations were too complicated); evaluate the evidence (e.g., by looking at the students' grades, she might decide that the student's ability may not play a role); and generate solutions (e.g., that the teacher in the scenario should improve her explanations so that students can comprehend them better). Yet, another pre-service teacher may hold a somewhat different problem-solving script. For example, she might start with mentally going through past experiences and then quickly jump to conclusions without conducting a further evaluation of the available information. If these two hypothetical pre-service teachers would work together when reflecting about an authentic problem case from teaching practice, their problem-solving scripts would have to be conceptualized as rather heterogeneous. Whether this heterogeneity would rather stimulate or hinder the pair from effective collaboration is one open question that we aim to answer in the empirical study we describe in the following.

Research questions
In our empirical study, we were interested in whether and how pre-service teachers engage in scientific reasoning while reflecting about authentic problem cases from their future professional practice. More specifically, we asked: & RQ1: Do dyads of pre-service teachers differ from individual pre-service teachers in the extent to which they (a) refer to scientific theories and evidence and (b) in their engagement in different epistemic processes of scientific reasoning when they reflect on authentic problem cases from teaching practice?
Based on earlier studies from co-teaching (Baeten and Simons 2014) as well as from psychologically oriented research on collaborative learning , we hypothesize that dyadic reasoners might be more reflective, i.e., engage in epistemic processes of scientific reasoning (Okada and Simon 1997) to a larger extent than pre-service teachers who engage in individual reasoning. Yet, due to possible coordination problems in groups (Strijbos et al. 2004) or biases in information sharing (Wittenbaum and Stasser 1996), we do not assume this difference to be particularly large.
& RQ2: Within dyads, does the degree of heterogeneity regarding group members' problemsolving scripts affect the extent to which dyads of pre-service teachers (a) refer to scientific theories and evidence and (b) engage in different epistemic processes of scientific reasoning during reflecting about authentic problem cases?
First, in accordance with findings that show that heterogeneous partners may mutually stimulate each other in complex problem-solving tasks (Bowers et al. 2000;Wiley et al. 2013), we assume that the level of dyadic heterogeneity with respect to its members' problemsolving scripts would stimulate (Bowers et al. 2000;Paulus 2000) a reflective dialogue. In other words, the more heterogeneous pre-service teachers are, the more they should engage in epistemic processes of scientific reasoning. Yet, as high levels of heterogeneity might also increase coordination demands (Bullough et al. 2002;Rummel and Spada 2005), we do not expect this effect to be very large.

Participants and design
Seventy-six teacher education students (59 female, M Age = 21.22, SD = 3.98) from a German university participated in the study, studying on their first (N = 41), second (N = 26), third (N = 4), fourth (N = 2), and fifth (N = 2) semesters (N = 1 missing). They received course credit for their participation. To answer research question 1, we varied the independent variable "reasoning setting" (individual setting vs. dyadic setting): Each participant was randomly assigned to either an individual (16 students, 13 female, M Age = 22.31; SD = 6.73) or a dyadic (60 students, i.e., 30 dyads, 46 females, M Age = 20.93; SD = 2.85) condition. To answer research question 2, we calculated an index to measure each dyad's degree of heterogeneity based on a test in which students were to individually describe how they approach authentic problems from pedagogical practice (see below). The resulting score on the heterogeneity index was then used as a predictor for the use of scientific theories and evidence as well as for the dyads' engagement in the different epistemic processes of scientific reasoning. Thus, homo-vs. heterogeneity was not experimentally manipulated, but rather computed post hoc.

Procedure
The procedure of the study consisted of four steps. Regardless of the condition (dyadic vs. individual), in the first three steps, every student participated individually. In the last (problemsolving) step, students participated either as dyads or individually depending on the condition they were assigned to. First, students filled in a computer-based questionnaire on demographic variables. After that, they were given a computer-based card-sorting task to measure their problem-solving scripts (see below). Then, they were given five minutes to read five printed out presentation slides that included scientific content information originating from their introductory psychology class and short descriptions of theories and concepts: on short-and long-term memory (Atkinson and Shiffrin 1968), the level of encoding (Craik and Lockhart 1972), and a classification of learning strategies (Wild 2006). After that, participants were presented an authentic problem from their future professional practice, which was the following: "You are a teacher in a school. One of your students receives low grades in comparison to others. The student looks motivated and it seems she understands the content. You know from the parents that she studies diligently at home. You as a teacher, please find possible reasons and a solution to the problem." For dealing with the problem, participants had 10 minutes. During these ten minutes, students in the individual condition were asked to think aloud while solving the problem, and dyads were asked to orally discuss the problem. All think-aloud data and collaborative discussions were audio-recorded and transcribed for further analysis (see below). Data from this problem-solving process were used to measure students' use of scientific theories and evidence as well as their engagement in the epistemic processes of scientific reasoning (see below). Finally, students were thanked for their participation and debriefed. The whole session took about one hour for each individual or dyad.

Independent variables
Reasoning setting As mentioned, reasoning setting was varied by randomly assigning participants to the problem-solving phase either as individuals or dyads. All dyads were randomly established.
Degree of homo-/heterogeneity of problem-solving scripts within dyads (dyadic heterogeneity) The degree of homo-/heterogeneity with regard to the students' problemsolving scripts in the dyadic reasoning setting was not manipulated experimentally, but rather assessed post hoc on the basis of the results of the card-sorting task that preceded the problemsolving phase and during which students participated individually. First, they were presented the practice-related problem case described above. Then, they were asked to use a set of prefabricated cards available on a MS PowerPoint slide to indicate what (epistemic) processes they would perform while solving the presented problem. The eight epistemic processes from  and five additionally selected "distractor" processes (e.g., "giving feedback," "improvising") were written on the cards that were presented to the participants. Besides that, five blank cards were provided to give participants the opportunity to add further processes if they wanted to. From the resulting process sequences, participants' problemsolving scripts were coded in the following way: First, we summed those epistemic process cards representing processes from the  model that were selected by both dyadic members. This number represented their shared knowledge component index (SKCI) on scientific reasoning. Then, we calculated a disagreement on position index (DPI) between dyadic members by calculating how many out of the epistemic processes they agreed on (SKCI) would need to be switched in position so as both members' selection shows the same sequence of the shared epistemic processes on scientific reasoning. We also calculated a pooled knowledge component index (PKCI) on scientific inquiry by summing the number of epistemic processes of the  model that at least one dyadic member had selected. Finally, a homogeneity index was calculated as (SKCI − DPI)/PKCI to account for agreements and, at the same time, controlling for disagreements between the two members of a group. Larger values on this index thus indicated more dyadic homogeneity, while lower values indicated more dyadic heterogeneity.

Dependent variables
Dependent variables were collected during the problem-solving phase that had students (dyads or individuals) solve the aforementioned authentic problem from their future professional practice either by thinking aloud (in the individual reasoning setting) or by discussing it (in the dyadic reasoning setting; see above).
All verbal data were transcribed. After that, we segmented the data into syntactical proposition-sized units (Chi 1997). Ten percent of the data were independently segmented by two researchers after a training on segmentation. Reliability was calculated as the proportion of agreement according to Strijbos et al. (2006). The estimation of the reliability of segmentation through one indicator (e.g., Cohen's κ) may lead to imprecise measurements due to the different segmentation strategies of different segmenters (Strijbos et al., 2006;Strijbos and Stahl 2007). Thus, in order to ensure segmentation reliability, we calculated reliability for both segmenters' perspectives and set the criteria to 80% of agreement (Strijbos et al., 2006). In 85.09% out of the total number of segment boundary indicated by segmenter 1, segmenter 2 also agreed that there is a segment boundary. At the same time, in 79.73% out of the total number of segments indicated by segmenter 2, segmenter 1 also indicated a segment boundary. These values showed a good reliability of the segmentation scheme. In the next step, one of the segmenters segmented the remaining data.
The content level: reference to scientific theories and evidence We developed a coding scheme to capture for each segment whether or not participants used scientific theories and/or evidence or not. Segments were coded as "use of scientific theories and evidence" if the speaker referred to scientific theories, concepts, or methods. Specifically, based on the frequently occurring topics, we created five content categories: "learning strategy," "anxiety," "motivation," "other scientific evidence," and "no scientific evidence." The code "learning strategy" was applied when participants mentioned ways of how the students in the case description may or should process the learning material ("elaborated strategy"; "recollection difficulties"). We coded a segment in the "anxiety" category when participants referred to test anxiety or emotional pressure that the student in the case description may experience. The code "motivation" was used when participants referred to motivational constructs such as "selfconcept" or "intrinsic/extrinsic motivation" in their reflection. The code "other scientific evidence" was used when participants referred to scientific evidence from other theories and lines of research that was not included in the preparation slides (such as "self-fulfilling prophecy"; "mobbing"; "mind map." If the above codes did not apply, the segment was coded as "non-scientific evidence." For each transcript (individual or dyadic), we merged the first four categories to calculate participants' overall engagement in use of scientific theories and evidence. Then, we divided this value by the total number of propositions. The reason for this was that preliminary analysis had shown that the two reasoning settings (dyads vs. individuals) were not equal regarding the total amount of talk, and consequently, working with raw sum scores would have biased our analyses. Five percent of the segments were coded by two independent coders with an agreement of Cohen's κ = .82.
The process level: engagement in epistemic processes A second coding scheme (see Table 1) was developed to capture the eight epistemic processes of scientific reasoning proposed by . We also added a code "non-epistemic process" that was applied when the participants engaged in an activity that did not fall into one of the eight categories proposed by . Since it proved impossible to reliably differentiate between the two activities "evidence generation" and "evidence evaluation," we merged these two categories into one: evidence evaluation. After the coding scheme had been developed and trained, two independent coders coded 10% of the data for the identification of epistemic "So it is about a student, // who has low grades" Questioning (Q) A question or statement orienting further inquiry.
"Ok, so what is the reason for that?" Hypothesis generation (HG) Any explanation of the problem case. "So if the reason is her learning method" Evidence generation (EG) (later merged with EE) Referring to case information; "She studies diligently at home" to scientific evidence; "There are different learning strategies…" to anecdotal evidence; "I know someone who has exam nerves" to lack of information; "We also do not know her age." or planning further information collection.
[to find out] "how much time she needs to do her homework." Evidence evaluation (EE) (later merged with EG) Evaluation of evidence (to support/falsify HG or GS).
"and then you can even exclude the problem of exam nerves" Generating solutions (GS) Planning an intervention, how to solve the problem.
"You should discourage her from using surface strategies" Communicating and scrutinizing (CS) Planning to engage others in the inquiry process.
"You can also talk to the parents" Drawing conclusions (DC) Concluding the outcomes of the earlier steps of inquiry.
"For me these would be the most important points to understand at all what her problem is." Non-epistemic (NE) Propositions that cannot be coded under the other codes.
"Ok, have you read it through?" processes. Inter-rater reliability was sufficient (Cohen's κ = .68). After coding each segment, the numbers of segments that fell in the same category were summed for each epistemic process: problem identification (PI), questioning (Q), hypothesis generation (HG), generating solutions (GS), evidence evaluation (EE), drawing conclusions (DC), communicating and scrutinizing (CS), and non-epistemic processes (NE). For the same reasons reported above (working with nonbiased proportional scores), the resulting sum score for each process was divided by the total amount of talk. Finally, as the data screening (see below) revealed that three epistemic processes (problem identification, questioning, and drawing conclusions) had relatively large non-coded ratios, we dummy coded the corresponding variables in each protocol by assigning 0 when there was no proposition coded under the given epistemic process and assigning 1 when there was at least one proposition coded under that epistemic process.

Statistical analyses
To answer the question whether dyads differ from individuals on their engagement in their use of scientific theories and evidence and in their engagement in epistemic processes (RQ1), ANOVAs and a MANOVA were conducted.
To answer the question if dyadic heterogeneity has an effect on pre-service teachers' use of scientific theories and evidence and their engagement in epistemic processes (RQ2), we analyzed only data from the dyadic reasoning setting and conducted linear regressions with "dyadic heterogeneity" as predictor and "use of scientific theories and evidence" as well as "engagement in SR activities" as criterion variables.
For all analyses, the unit of analysis was the segments of each transcript, no matter whether it came from an individual or a dyad. I.e., segments of transcripts from individual think-aloud transcripts were compared with segments of transcripts from dyadic discussions.
For the analyses, the alpha level was set to p < .05. For the regressions to answer the effect of dyadic heterogeneity on the engagement in epistemic processes (RQ2b), we set the alpha to a more stringent p < .01 criterion in order to control for the increased type I error rate due to multiple simultaneously conducted analyses (e.g., Chen et al. 2017). Although in such cases of multiple analyses, a form of correction (e.g., Bonferroni correction) is often recommended to account for the inflated alpha, other studies point out that such measures can be criticized as being too conservative (as they decrease the probability of type I error, but simultaneously increase the probability of type II error; Sinclair et al. 2013). For example, Streiner and Norman (2011) note that alpha corrections may be less appropriate in case of (a) studies with an exploratory focus or (b) hypotheses that are stated a priori. Since both apply to the present study, we decided to use an alpha of .01 instead of applying Bonferroni corrections.

Preliminary analyses
An initial preliminary analysis was conducted to compare whether the two groups of reasoning setting were equal on the total amount of talk. Results showed that dyads talked significantly more (M = 138.50, SD = 40.16) than individuals (M = 88.19, SD = 27.74) did (F 1,46 = 19.93, p < .001, partial η 2 = .31). Therefore, instead of using raw scores, we computed proportions of each code by dividing raw frequencies by the overall number of segments and used those proportions in the subsequent analyses.
With respect to the reference to scientific theories and evidence (RQ1a), dyads showed a positively skewed (z = 3.87) as well as a leptokurtic (z = 4.80) deviation from normality. Data screening indicated one outlier case (z = 3.50), which also accounted for the skewness in the data. Thus, we removed this participant from further analysis. Furthermore, Levene's test indicated unequal variances between individuals and dyads (p < .05). To answer RQ1a, we therefore used Welch's F-tests due to its higher robustness for the violation of homogeneity of variances in the case of unequal sample sizes (Field 2009, pp. 379-380;Grissom 2000).
With respect to the engagement in epistemic processes (RQ1b and RQ2b), the frequency distributions of three epistemic processes suggested very high frequencies of missing values for the variables drawing conclusions (67.39%), questioning (58.70%), and problem identification (39.13%). Consequently, we decided to exclude these variables from further parametric analyses. Instead, we dummy coded them and conducted chi-square tests to test the odds of engaging vs. not engaging in each process as a function of the reasoning context (collaborative vs. individual). Furthermore, as Levene's test suggested unequal variances for solution generation, Welch's F-test was calculated as a follow-up test for solution generation. Considering the assumption checks, we decided to conduct MANOVA with a robust statistic, such as Pillai's trace (Meyers et al. 2006, p. 378). Finally, for the dummy-coded problem identification, questioning, and drawing conclusion, nonparametric (logistic) regressions were conducted.

RQ1: Effects of reasoning setting on scientific reasoning
To test the effect of dyadic vs. individual reasoning on scientific content use (RQ1a), Welch's F-test was conducted with reasoning setting as the independent variable and reference to scientific theories and content as the dependent variable in the analysis. Reasoning setting showed a significant effect on the use of scientific theories and evidence (see Table 2 for details). Namely, individuals referred proportionally more to scientific theories and evidence than dyads did.
To answer the question whether dyads differed from individuals on their engagement in epistemic processes (RQ1b), we conducted a MANOVA with the four epistemic processes Table 2 Effects of reasoning setting on use of scientific theories and evidence as well as on the epistemic processes hypothesis generation, evidence evaluation, solution generation, communicating and scrutinizing, and non-epistemic propositions (ANOVAs) eligible for parametric analysis and non-epistemic propositions as dependent variables and reasoning setting (individual vs. dyadic) as independent variable. Reasoning setting had a significant strong multivariate effect on the engagement in epistemic processes (Pillai's trace = .40, F 5,40 = 5.26, p < .001, partial η 2 = .40). Follow-up ANOVAs as well as Welch's F-test for solution generation revealed significant effects of reasoning setting on the engagement in hypothesis generation, in evidence evaluation, in solution generation, and in the production of non-epistemic propositions (see Table 2 for the results). As seen in Table 2, dyads engaged significantly more in hypothesis generation and in evidence evaluation than individuals did. Also, they produced more non-epistemic propositions than individuals. On the other hand, individuals engaged more in solution generation than dyads did.
Finally, a nonparametric chi-square test indicated a significant relationship between reasoning setting and engagement in drawing conclusions: the odds of engaging in drawing conclusions were 5.35 times higher for dyads than for individuals. No further significant effect was found.
Tables 2 and 3 present all the results reported above:

RQ2: Effects of dyadic heterogeneity on scientific reasoning
To answer the question if dyadic heterogeneity has an effect on the extent to which pre-service teachers refer to scientific theories and evidence (RQ2a), we conducted a linear regression analysis with dyadic heterogeneity as the predictor and reference to scientific content as the criterion variable. Dyadic heterogeneity did not predict the reference to scientific theories and evidence (see Table 4 for details).
To answer the question whether dyadic heterogeneity had an effect on the engagement in epistemic processes (RQ2b), we computed separate linear regression analyses that included dyadic heterogeneity as a predictor, and solution generation, hypothesis generation, evidence evaluation, and communicating and scrutinizing separately as criterion variables (see Table 4 for the results). The regression models revealed that dyadic heterogeneity had a marginal significant effect on solution generation: The more heterogeneous dyads were, the less they engaged in generating solutions. Moreover, dyadic heterogeneity negatively, yet non-significantly, predicted the following epistemic processes: hypothesis generation, evidence evaluation, and communicating and scrutinizing. Finally, the outcomes of logistic regressions (see Table 5) suggested no effect of dyadic heterogeneity on problem identification, on questioning or on drawing conclusions. General discussion The first aim of our study was to answer the question whether collaboration case-based reasoning is a useful means to engage pre-service teachers in evidence-based reasoning. Furthermore, we were interested in the effects of homo-/heterogeneous problem-solving scripts of collaborating partners on our participants' evidence-based reasoning. With respect to evidence-based reasoning, we differentiated between (a) a content level (i.e., the extent to which participants referred to scientific theories and evidence during reasoning) and (b) a process level (i.e., the extent to which students engaged in the epistemic activities proposed by Fischer et al. 2014, during reasoning).
Regarding the content aspect of evidence-based reasoning, we found that dyads referred significantly less often to scientific theory and evidence than individuals. Based on research on the information pooling paradigm (Wittenbaum and Stasser 1996), one possible explanation for this finding might be that dyads-despite the fact that two students should typically possess more knowledge than an individual student-might have focused their discussion on scientific knowledge they both possessed, and have been hesitant to discuss unshared knowledge (e.g., knowledge about a theory that only one learner possessed) with each other. Possible reasons for this can be manifold and include aspects such as social anxiety or social loafing, and also production blocking (i.e., listening to a partner's explanation for a problem may undermine one's own thinking about the problem; see Paulus 2000). Research from pre-service teacher education further reports that pre-service teachers often are not well-prepared for collaborating with each other (Baeten and Simons 2014;Shin et al. 2016), which might further have increased coordination demands which in turn might have consumed time that dyads could not use for discussing what scientific knowledge to apply to the case (Strijbos et al. 2004).
In contrast to the results at the content level, regarding the process aspect of evidence-based reasoning, we found that dyads of pre-service teachers engaged more in hypothesis generation (explaining the problem) and evidence evaluation than individuals. Moreover, dyads were also more likely to draw final conclusions and to sum up their reasoning processes. These findings are consistent with prior research on scientific reasoning (e.g., Okada and Simon 1997;  1995) and with the many pleas for co-teaching (Baeten and Simons 2014;Nokes et al. 2008): Those studies indicate that when students solve problems together, they often want to share their understanding and thus articulate their point of view towards each other. Also, they often "scaffold" each other for such articulation by asking their reasoning partner for clarification and further elaboration on the expressed ideas. Similarly, in the context of co-teaching, researchers conclude that collaborative discussion of problem cases and teaching practices may facilitate the exchange of individual knowledge and the development of more positive attitudes towards a more reflective and evidence-based practice (Baeten and Simons 2014;Birrell and Bullough 2005;Nokes et al. 2008). As our results show though, this reflective discussion might however come at the cost of engaging less in generating solutions to the problem. In other words, our findings indicate that there might be a trade-off between more explanatory and more solution-oriented processes during collaborative problem-solving: It appears that the lower engagement in solution generation of dyads can be explained by their enhanced engagement in hypothesis generation and evidence evaluation. One explanation for this result might be that the epistemic processes of giving explanations, referring to evidence, and summing up the discussion reflect the epistemic need to coordinate between reasoning partners to develop a better (shared) understanding of the problem (Rummel and Spada 2005). As Okada and Simon (1997) argue, in a collaborative situation, reasoners often need to be more explicit in communicating their views to make their partner understand and/or accept their ideas. As a result, an extensive engagement in these explanatory processes might then prepare the choice of an adequate solution, which then would not need to be extensively debated anymore within the group.
Based on earlier findings (Bowers et al. 2000), we further assumed that collaborative problem-solving might benefit from the heterogeneity of the problem-solving scripts of teachers who collaborated in dyads. In contrast to this assumption, our results showed that the content aspect of evidence-based reasoning was not affected by group composition in that respect, i.e., homo-and heterogeneous dyads did not differ in the extent to which they referred to scientific theories and evidence. Again, that heterogeneous dyads did not exploit the potentials of their dissimilarity may either have to do with factors such as social or cognitive inhibition (Paulus 2000), or with the increased coordination demands that of course affect heterogeneous dyads more than homogeneous dyads, as the latter have less reason to extensively discuss how to approach the task.
As for the process aspect of scientific reasoning, we however did find (albeit small) effects of heterogeneity: The more collaborative partners differed from each other regarding their problem-solving strategy, the less they discussed possible solutions to alleviate the problem. Furthermore, although dyadic heterogeneity did not have a significant impact on other epistemic processes, in tendency, the more heterogeneous the dyads are, the more they seem to engage in "explanatory behavior." This tendency is in accordance with previous findings that dyadic heterogeneity might be beneficial in the case of complex problem-solving tasks (Bowers et al. 2000;Wiley et al. 2013), and they might explain why more heterogeneous dyads focused less on the solutions as they spent more "effort" in developing explanations to the problem. This can be regarded as evidence that heterogeneity of the problem-solving scripts of pre-service teachers can be expected to result in more reflective (explanatory and evaluative) practices. It might be speculated then if co-teaching and other collaborative approaches to pre-service teacher education would be especially effective when pre-service teachers would be grouped together based on the heterogeneity of their problem-solving scripts.

Limitations and conclusions
Even though our study revealed interesting results on the comparison of individual vs. dyadic problem-solving, as well as regarding the effects of dyadic heterogeneity, our findings are limited in several ways: First, since the presented empirical study implemented only one problem case for the problem-solving task, the interpretation of the results might not be generalizable to preservice teachers' reasoning on other topics. Second, our sample was recruited from just one university and thus may have had specific characteristics that may differentiate it from samples that might have been drawn from other pre-service teacher education programs. Third, in contrast to the approach taken by Canham et al. (2012), we did not experimentally manipulate the homo-/heterogeneity of our dyads. Instead, we used the natural variation of the problemsolving scripts and computed a post hoc heterogeneity index. While we would argue that our approach increased external validity of our study, it might well be that it possibly led to an underestimation of the effects of dyads' heterogeneity. It thus would be very interesting to run a subsequent study in which homo-/heterogeneity is experimentally manipulated (and possibly the degree of heterogeneity is increased compared with our study) to see whether effects of heterogeneity would increase as a result of this. Fourth, since the present research represents a lab study without a corresponding field context, ecological validity may be limited on that end. Therefore, future studies could consider addressing more authentic, and thus ecologically more valid, problem-solving scenarios (e.g., in a classroom), and also use extended samples coming from several universities. Also, it would be interesting to investigate if our results also generalize to in-service teachers, as co-teaching and similar approaches are also often used with this population. Finally, another limitation of the present study is the lack of a prior scientific content knowledge measure. As a result, we only know the extent to which our dyads were homo-/heterogeneous with respect to their problem-solving scripts, but not with respect to their prior content knowledge. Subsequent studies should thus also measure prior scientific content knowledge to disentangle the possible differential effects of within-dyad homo-/ heterogeneity regarding both the content and the process dimension of scientific reasoning.
After discussing our results, the question remains under what circumstances and with respect to what goals collaboration can be a useful context to promote pre-service teachers' evidence-based reasoning when solving pedagogical problems. Our findings paint quite a differentiated picture in this respect. They suggest that having pre-service teachers collaborate during reasoning about pedagogical problems is a good decision when the goal is to foster an analytical focus during reasoning. To reach this goal, our results further suggest to form groups that are heterogeneous with respect to their problem-solving scripts. However, if the goal is to promote students' references to scientific theories and evidence, our results seem to imply that having pre-service teachers collaborate is perhaps not such a good idea, at least if groups are not further scaffolded on how to apply scientific theories and evidence on case information. Approaches that may be effective in this respect have, for example, been described by Wenglein et al. (2016) and Raes et al. (2012). For future research, an important task is to accumulate evidence on how to combine different scaffolds to effectively support pre-service teachers in both the application of scientific knowledge and a systematic engagement in scientific reasoning processes.