Systematic design of a learning environment for domain-specific and domain-general critical thinking skills

Identifying effective instructional approaches that stimulate students’ critical thinking (CT) has been the focus of a large body of empirical research. However, there is little agreement on the instructional principles and procedures that are theoretically sound and empirically valid to developing both domain-specific and domain-general CT skills. The purpose of this study was to examine the effectiveness of systematically designed subject matter instruction in stimulating the development of domain-specific and domain-general CT skills, and to investigate the relationship between the two. The study employed a pretest–posttest quasi-experimental design with two conditions: 45 students participated in an experimental condition and 44 students in a control condition. A learning environment, in the context of a freshman physics course, was designed according to the First Principles of Instruction model. The experimental condition followed the designed learning environment, while the control condition followed regular subject matter instruction that was not designed according to the First Principles of Instruction model. The experimental condition scored significantly higher than the control condition on a domain-specific CT test. The results also showed that better performance on a domain-specific CT test explained a significant proportion of the variance on a domain-general CT test. However, the experimental learning environment did not result in a significantly greater pretest–posttest improvement in the acquisition of domain-general CT skills compared to the control learning environment. Instructional design principles that may contribute to the present understanding of the integration of CT skills within the regular subject matter instruction are discussed.

Introduction promotes the development of CT in specific domains of science and arts. In addition, evidence on the effectiveness of embedding CT skills in developing domain-general CT has been inconsistent. Some studies found that explicit CT instruction within subject matter domains is an effective way of developing domain-general CT skills (e.g., Bensley and Spero 2014;Dwyer et al. 2012;Solon 2007), whereas several others reported an insignificant effect (e.g., Anderson et al. 2001;McLean and Miller 2010;Toy and Ok 2012). Furthermore, it is unclear whether instructional intervention that aims to promote domain-general CT skills also improves students' ability to solve domain-specific CT tasks, and vice versa (Fischer et al. 2014;Siegel 1988). In view of the dearth and inconsistency of the existing empirical evidence, determining the features of instructional interventions that contribute to developing domain-specific and domain-general CT remains an important challenge in CT research.
Recent developments in cognitive psychology have influenced instructional design in various ways (Elen 1995;Jonassen 1991;Merrill 2002;van Merrienboer 1997). One of the influences has been on the conception of learning and instruction. Cognitive psychologists view learning as an active, cumulative, constructive, goal-oriented, self-regulated, and situated process of knowledge and meaning building (e.g., Elen 1995;Shuell 1986;van Merrienboer 1997). Instruction is viewed as a set of activities that aim to support and enable learning, and that means helping and guiding students to actively process information, monitoring their performance, and providing feedback with respect to the appropriateness of students' learning activities (Elen 1995;Merrill 2013). These moderate constructivist views on learning and instruction (Elen 1995) emphasize that learning and understanding go hand in hand (e.g., Shuell 1986). Echoing this view, Perkins and Unger (1999) argued that understanding a subject matter domain is a matter of being able to think critically and act competently with one's knowledge of the subject matter. This implies that meaningful subject matter learning in any domain inherently involves the development of relevant CT skills. From this follows the development of CT is essentially an implicit goal in all subject matter learning.
Despite the theoretical claim that subject matter instruction in any domain can stimulate the development of CT (Perkins and Salomon 1989;Resnick et al. 2010;Resnick 1987;Smith 2002;van Merrienboer 1997), the potential impact of the design of subject matter instruction has been overlooked in existing CT research. The development of CT is largely explored through loosely defined instructional interventions that consist of teaching general CT skills within less optimally designed subject matter instruction (Tiruneh et al. 2014). Research attempts to embed CT skills within subject matter instruction have not systematically built on instructional design research, and the link between the acquisition of domain-specific and domain-general CT skills appears to be vague. In sum, although it is unclear to what extent systematically designed subject matter instruction in itself promotes the development of domain-specific and domain-general CT skills, strong impact on the development of domain-specific CT skills is to be expected since they are an integral part of the domain-specific expertise that instruction aspires toward.
Drawing on past research on cognitive development (Glaser 1984;Perkins and Salomon 1989), we explored the question of whether systematically designed subject matter instruction may facilitate the acquisition of domain-specific and, to some extent, domaingeneral CT skills. The aim of this paper is therefore to examine the effectiveness of systematically designed subject matter instruction in promoting the development of domain-specific and domain-general CT skills, and to investigate the relationship between the two.

Teaching CT in higher education: state-of-the-art
What is CT?
Existing literature suggests widespread disagreement among educators and researchers with regard to the definition of CT and what is to be accomplished in teaching it. Ennis (1993) defines CT as logical and reflective thought that focuses on a decision on what to believe or do. Halpern (1998Halpern ( , 2014 defines CT as the use of thinking strategies that increase the probability of a desirable outcome. Together with her definition, Halpern identified five major categories of CT skills: verbal reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, and decision-making and problemsolving. Halpern argues that the use of CT skills in solving various cognitive tasks can increase the probability of 'a desirable outcome' (Halpern 1998, p. 450). McPeck (1990) defines CT as the appropriate use of reflective skepticism within the problem area under consideration, and he closely relates the problem areas to particular subject matter domains.
Some researchers (Facione 1990a;Halpern 1998;Norris 1989;Perkins et al. 1993) have moreover argued that in addition to mastery of a set of cognitive skills, a more meaningful and comprehensive understanding of CT must include CT dispositions. The latter refers to a person's inclination to use CT skills appropriately without prompting, and with conscious intent in a variety of settings, for instance, when faced with problems to solve, ideas to evaluate, or decisions to make (Ennis 1993;Halpern 1998). Researchers have arrived at a list of CT dispositions that in the main includes open-mindedness, inquisitiveness, systematicity, analyticity, truth-seeking, self-confidence, and maturity (Facione 1990a). Halpern (1998) also notes that a critical thinker demonstrates the following dispositions: (a) willingness to engage and persist in a complex task, (b) habitual use of plans and the suppression of impulsive activity, (c) flexibility or open-mindedness, (d) willingness to abandon non-productive strategies in an attempt to self-correct, and (e) awareness of the social realities that need to be overcome (such as the need to seek consensus or compromise) so that thoughts can become actions. (p. 452).
We used Halpern's (2014) classification of CT skills for the purposes of this study. After synthesizing the various conceptions of CT (e.g., Bailin et al. 1999;Ennis 1989;Halpern 2014;McPeck 1990;Resnick et al. 2010;Smith 2002), we defined CT as the proficiency a person demonstrates in using thinking strategies to accomplish a task in a reasonable manner. The thinking task in question may require specific subject matter expertise for it to be reasonably performed, and we call such proficiency domain-specific CT. On the other hand, the thinking task in question may not require specific subject matter expertise, but rather knowledge of everyday life. We refer to such proficiency as domain-general CT.

Specificity and generality of CT and its implications for instruction
The question of whether CT is a set of general skills that can be applied across domains or whether it is by and large specific to a particular domain has been the subject of heated debate (e.g., Bailin et al. 1999;Davies 2013;Ennis 1989;McPeck 1990;Moore 2011;Norris 1989;Paul 1985). This disagreement has had major implications for approaches to integrate CT in higher education curricula. Generalists (Davies 2013;De Bono 1991;Ennis 1989;Halpern 1998;Kuhn 1999) claim a set of CT skills exists that are general and applicable across a wide variety of domains. They contend that this set of general CT skills can be taught either as a specific curriculum subject (i.e., a stand-alone course), or be integrated explicitly into regular courses. On the other hand, specifists (McPeck 1990;Moore 2004Moore , 2011 argue that thinking is highly dependent on specific domain knowledge and that CT teaching should therefore always be pursued within the context of a specific domain. McPeck (1990) has strongly argued against the notion of general CT skills on the basis that the thinking skills required in one domain are different from those required in another. This specifist position implies that each domain will need to identify its own distinctive thinking skills, and students will learn those domain-specific CT skills while building up knowledge of that particular domain.
However, it seems that the generality versus specificity debate has recently shifted towards a synthesis of the two views (Davies 2013;Robinson 2011;Smith 2002). First, although the related content and issues differ from one domain to the next, a set of CT skills that are applicable across a wide variety of domains exists. Second, the ability to think critically on a particular task is understood to be highly dependent on knowledge of the task at hand as well as knowledge of relevant CT skills. This implies that effective CT instructional approaches need to target students' in-depth understanding of a domain and that of the relevant CT skills.

CT assessment
In tandem with the absence of a consistent CT definition, one of the main challenges in CT research has been the lack of uniform CT tests. Researchers have employed various kinds of CT tests that use a broad range of formats, scope, and psychometric characteristics to measure CT outcomes (for reviews, see Ennis 1993;McMillan 1987;Tiruneh et al. 2014). Some of the available standardized domain-general CT tests include the Cornell Critical Thinking Test (CCTT: , the California Critical Thinking Skills Test (CCTST: Facione 1990b), the Watson-Glaser Critical Thinking Appraisal (WGCTA: Watson and Glaser 2002), the Ennis-Weir CT Essay Test (Ennis and Wier 1985) and the Halpern Critical Thinking Assessment (HCTA: Halpern 2010). These domain-general CT tests use content from a variety of real-life situations with which test takers are assumed to already be familiar.
Except for the Ennis-Weir CT Essay test and HCTA, all the above-mentioned tests use forced-choice format items, which have been criticized for not efficiently measuring significant CT features such as drawing warranted conclusions, analyzing arguments, making decisions and systematically solving problems (Norris 1989). The HCTA is the only standardized measure of domain-general CT proficiency that uses two different types of item formats: forced-choice and constructed-response formats. Halpern claims that the constructed-response format of the HCTA measures CT dispositions (Halpern 2013).
A couple of domain-specific CT tests also exist in the science domain. The Lawson's Classroom Test of Scientific Reasoning (CTSR) is the most commonly administered test in the domain of science focused on measuring general scientific reasoning skills (Lawson 1978(Lawson , 2004. It is a multiple-choice test that measures scientific reasoning skills that include probabilistic reasoning, combinatorial reasoning, proportional reasoning and controlling of variables in the context of scientific domains (Lawson 1978). Respondents do not necessarily need to have expertise in a specific science domain, rather the test focuses on general science-related issues that students can reasonably be presumed to have acquired in specific science subjects. The test mainly targets junior and senior high school students, but it is also used to assess scientific reasoning skills among college science Systematic design of a learning environment for domain… 485 freshmen (Lawson 1978(Lawson , 2004. The other domain-specific CT test is the biology critical thinking exam (McMurray 1991). It is a multiple-choice test with 52 questions that aims to measure university students' CT skills in biology. The Critical Thinking in Electricity and Magnetism test (CTEM) is a domain-specific CT test that was recently developed and that aims to measure students' ability to draw valid inferences, analyze arguments, solve problems, make predictions, and analyze probabilities and assumptions with respect to thinking tasks that are specific to a freshman physics course (De Cock et al. 2015). The CTEM test consists of 20 items, two of which are forced-choice; the remaining are constructed-response format items. The items were designed to mirror the five CT structural components identified in the HCTA (Halpern 2010), and target the content of an introductory electricity and magnetism course. The CTEM test was validated to prompt students' ability to demonstrate the aforementioned domain-specific CT skills.
Despite the existence of a few domain-specific CT tests, the assessment of CT has thus far mainly focused on domain-general CT skills. CT has mainly been linked with everyday problem solving, and there is a general lack of experience among researchers and educators when it comes to testing for domain-specific CT skills. As discussed in the previous section, the embedded approach aims to teach desired CT skills as part of subject matter instruction. This approach is expected to result in the acquisition of both domain-specific and domain-general CT skills. Standardized tests that measure students' ability to think critically on issues and problems that are specific to a subject matter domain, however, were hardly ever administered in the various studies that adopted an embedded approach (for review, see Tiruneh et al. 2014).
Embedding CT within regular courses: instructional approaches Ennis (1989) divided the various approaches to embedding CT within subject matter domains into two types: Infusion and Immersion. In the Infusion approach, students are explicitly trained on how to apply CT skills as part of a specific subject matter domain instruction. Students are explicitly introduced to the desired CT skills and extensively engaged in domain-specific classroom activities that call for the application of the desired CT skills. The Immersion approach, however, aims to help students acquire the desired CT skills as they construct knowledge and skills of a subject matter domain, without explicit instruction about desired CT skills. The main assumption behind this approach is that proficiency in CT is by definition targeted in meaningful subject matter learning; it follows that students can learn relevant and transferrable CT skills when immersed in well-designed subject matter instruction (e.g., McPeck 1990). Given the limited empirical evidence on the effectiveness of well-designed subject matter instruction on the development of domain-specific and domain-general CT skills, the effect of an Immersion-based instructional intervention is the focus of the present study.

The present study
The central question in CT instruction appears to be identifying theoretically sound and empirically valid instructional design principles that foster the development of the desired CT skills (Perkins and Salomon 1989;van Merriënboer 2013). There are a few instructional design models that offer specific guidelines to develop learning environments that enable students to acquire complex cognitive skills. The First Principles of Instruction model is one of the instructional design models that offer explicit guidelines to designing learning environments that can promote the active and constructive acquisition of higher-order learning outcomes (Merrill 2002(Merrill , 2013. The model is a synthesis of the various instructional design models that emerged from research on the acquisition of subject matter knowledge and skills. Merrill systematically reviewed the different instructional design principles that claim to be empirically valid and abstracted five interrelated prescriptive instructional design principles: activation, demonstration, application, integration and problem-centeredness. This model emphasizes that subject matter instruction designed on the basis of those principles can result in effective, efficient and engaging learning that leads to students' acquisition of knowledge and skills that are necessary to complete complex real-world tasks (Merrill 2013).
Because of its comprehensiveness and strong theoretical foundation, the First Principles of Instruction model was chosen to guide the design of the learning environment for this study. No previous study, to our knowledge, has tested the efficacy of this model in designing instructional interventions that target the development of CT skills. A brief explanation of the First Principles of Instruction model and its implications for designing subject matter instruction is offered in the next section. A learning environment in the context of a freshman physics course was designed based on the model. The following research questions are addressed: (a) What is the effect of systematically designed subject matter instruction on the development of domain-specific CT skills? (b) What is the effect of systematically designed subject matter instruction on the development of domaingeneral CT skills? and (C) What is the relationship between performance on domainspecific and domain-general CT tests? In line with existing theoretical literature (e.g., Perkins and Unger 1999;Resnick et al. 2010), we hypothesized that subject matter instruction systematically designed according to the First Principles of Instruction model would produce a significantly higher acquisition of domain-specific and domain-general CT skills than regular subject matter instruction.

Method Participants
The study participants were first-year students with physics majors at two universities in northwest Ethiopia. Students at one of the universities formed the experimental group (n = 45), while those at the other university constituted the control group (n = 44). The experimental group was comprised of 24 women and 21 men between the ages of 19 and 23 years (M = 20.09, SD = .93), while the control group consisted of 23 women and 21 men between the ages of 19 and 24 years (M = 20.32, SD = .98).

Design and development of the Immersion-based instructional intervention
The intervention focused on a freshman introductory physics course, namely introductory electricity and magnetism (E&M). At both universities, this course was taught based on a harmonized national curriculum, with the same content and credit hours. The targeted course was taught during the second semester of the 2013/2014 academic year. The intervention focused only on the first five chapters of the course: electric field, electric flux, electric potential energy, capacitor and capacitance, and direct current circuits (as specified in the course textbooks of the two universities).
In recognition of the complex and multidimensional nature of CT, an effort was first made to acquire clearer understanding of the desired CT outcomes learners ought to demonstrate after the intervention. The CT skills that were the focus of our intervention were reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, and problem solving and decision-making. The targeted CT skills were split into sub-skills before the instructional intervention was designed. A more precise description of each of the domain-specific and domain-general CT outcomes was subsequently developed with respect to the post-intervention performance (see Table 1). Such an in-depth analysis of the CT outcomes that we wished our students to demonstrate helped us decide on the specific and relevant instructional strategies that should be targeted while the learning environment is designed and implemented.
After the desired CT outcomes were identified, the next important phase was designing a learning environment based on the First Principles of Instruction model. Table 2 offers a brief description of the principles, the implications for instructional design, and brief examples of what happened in the actual design and implementation phase of the learning environments. Two regular course instructors from the experimental university, two physics professors, one instructional psychology professor and one doctoral candidate collaborated in designing the experimental learning environment. Efforts were made to embrace the desired CT skills as part of the regular domain-specific classroom activities during this design process.

Implementation of the experimental and control interventions
Students in both the experimental and control conditions learned the same five chapters. The lessons were taught by regular instructors at the two universities. Two instructors (one as a main instructor and the other as an assistant for the tutorial sessions only) participated in the study at each university. In order to control for the teacher effect, we involved instructors from the two universities who had the same education level (all MSc in Physics) and similar years of teaching experience.

Training the experimental instructors
The two regular instructors received adequate training to be able to teach the experimental class. Their collaboration began during the design phase of the intervention, and they were both fully informed on the purpose of the intervention and what was required of them in implementing the designed lessons. For example, we initially asked them to comment on a draft version of the lessons designed for chapter one and both instructors provided useful feedback. Their involvement and feedback continued throughout the design process of the five chapters. On a number of occasions, they reported that some of the activities and questions in the draft versions were unclear or less relevant for the targeted students. A number of modifications were accordingly made.
Moreover, to facilitate implementation of the lessons as designed and provide the necessary theoretical knowledge base, the first author and the two experimental instructors participated in 5 h of face-to-face discussions over a period of 3 days. The instructors were briefed on the overall goal of the instructional intervention as well as the specific designed lesson activities of the full five chapters.

Experimental condition
The developed lessons were taught during the regular lecture hours. Students were divided into 10 groups of 4 or 5 students. Efforts were made to have groups that were evenly spread in terms of gender and academic performance (with the latter based on students' GPA in Table 1 Description of desired domain-specific and domain-general CT outcomes CT skill Domain-specific CT outcome Domain-general CT outcome In the context of E&M, the student will be able to: In the context of everyday situations, the student will be able to Identify key parts of an argument on issues related to E&M Judge the credibility of an information source Infer a correct statement from a given data set Criticize the validity of generalizations drawn from the results of an experiment Identify relevant information that is missing from an argument Identify key parts of an argument: e.g., given a conclusion, identify the reason(s) that support the conclusion. Provide an opinion, a reason and conclusion on issues related to daily life. Infer a correct statement from a given data set Criticize the validity of generalizations  Learning is promoted when learners acquire knowledge and skills in the context of real-world problems. Problems need to be comprehensive, challenging and representative of the problems learners will encounter in real life For each chapter, relatively complex, meaningful and comprehensive problems were carefully designed by seeing each chapter as a mini-course (based on the suggestion by Merrill 2013). An attempt was made to keep the tasks relevant to the lives of students (see Fig. 2 for a sample whole-task) and thus make them more motivating. A whole-task for a particular chapter was given one or two days before instruction began; students were subsequently asked to answer the questions in the whole-task by referring to the course textbook or consulting experts (or senior students with physics majors) Instruction was primarily 'topic-centered'. At the beginning of a new chapter, the instructor presented information related to that chapter (or subtopic). Students were sometimes shown solutions to one or two textbook problems related to the newly presented information. At the end of the lesson, students were given selected textbook problems as homework assignments.
Overall, the lessons were not designed to echo realworld problems.
Comprehensive problems with real-world significance that might prompt students' CT skills were not introduced at the beginning of a chapter  Relevant and challenging E&M tasks were designed that created multiple opportunities for the students to engage in applying newly presented information. When students were engaged in solving problems, activities that facilitated instructor coaching and guidance were clearly described and implemented. For example, the instructors provided partial solutions, halted at each group and observed students' discussions, provided hints as needed, acted as group members and asked thought-provoking questions, encouraged students to formulate questions using specific verbal prompts, and facilitated discussion among group members Students mostly listened to the instructor and took notes. They were not engaged in applying the newly presented information to solve new and meaningful E&M problems; rather the instructor gave them homework assignments to practice solving the traditional end-ofchapter problems. Moreover, there was no dedicated time for students to practice solving as many practical and comprehensive questions as possible during the lessons. Even when they were asked questions, the questions focused on recalling information and did not invite further elaboration and explanations from the students. Group activities took place during some of the sessions. However, the activities for small group activities were not adequately and purposely designed. The instructor did not adequately coach the group activities and feedback was limited Systematic design of a learning environment for domain… 491 the first semester). Students received guidance in performing both the individual and group activities that had been designed. At the beginning of each chapter, students were assigned contextually relevant E&M problems that required them to collaborate to find solutions. Throughout the intervention, students were made to observe well-scripted instructor demonstrations that modeled the important procedures and reasoning involved in solving various E&M problems. The demonstrations were followed by extensive opportunities for the students to practice solving E&M problems both individually and in small groups for a substantial amount of time. A number of activities that encouraged students to activate prior knowledge and communicate their ideas to both their group and the entire class were carefully designed and implemented. Both peer and instructor feedback was provided as needed. Overall, students were carefully assisted in developing an in-depth understanding Activities that encourage students to present their solutions either to group members or full class were designed, and both peer and instructor feedback was offered. At the end of each chapter, a two-hour tutorial session was organized. The sessions mainly focused on revising the main topics of each chapter by asking students to prepare a summary (e.g., by using concept maps) of the facts and concepts discussed in the chapter, and solving a few E&M problems. Students were required to attempt to solve all the problems in advance. During these sessions, students were asked to discuss their solutions in their respective groups, and the tutors acted as coaches during the group activities. Representatives from at least two groups were asked to present solutions to a particular question in front of the full class. Students in other groups were encouraged to ask questions, and the student presenters were asked to defend their solutions when challenged by their classmates or the instructors Students usually did not have the opportunity to present and defend their solutions to the full class. Interaction between the students during the lessons was very limited: they did not engage in exchanging ideas and explaining solutions to problems between themselves or to the instructor. At the end of each chapter, a two-hour tutorial session was arranged so that students could solve exercises in groups. The regular instructor and his assistant provided assistance to the students during the tutorial sessions. In most cases, however, the tutorial questions did not encourage students to apply what they had learned to solve new and meaningful problems. The questions usually promoted retention of information of the subject matter domain, and they were coached and supported in the acquisition of the CT outcomes through the various domain-specific instructional activities. The first author monitored overall implementation of the intervention, which lasted 8 weeks. Three lessons of 2 h each were taught every week. See Table 2 for a brief overview of the activities designed and implemented in the experimental class.

Control condition
Students in the control condition followed the regular subject matter instruction. Two instructors (one main instructor and one assistant for only the tutorial sessions) from the control university were responsible for designing and implementing the lessons. The lesson durations for this group were the same that for the experimental group: a total of 8 weeks with 3 lessons of 2 h each per week. This group was similar to the experimental group in terms of previous course and parallel courses enrollment during the intervention. However, the E&M lessons for this group were not designed according to the First Principles of Instruction model, and we will refer to the instructional method in the control class as ''regular'' E&M instruction. See Table 2 for a detailed comparison of the control and experimental learning environments. To obtain an overview of the instructional processes, the first author observed one of the control group's lessons. In addition, interviews were conducted with the E&M instructor on three separate occasions (at the beginning of the semester, a month after the semester, and at the posttest) to acquire additional information on the various classroom activities. A brief description of the instructional activities that took place in the control group is offered below. At the beginning of each chapter, the main instructor gave a brief overview of the general learning outcomes. He immediately proceeded by discussing the first subtopic of a chapter and asked oral questions between presentations that encouraged students to engage in discussions. However, students were not pushed to give more detailed explanations of their responses. In most cases, the instructor himself offered the explanations. He usually showed the solutions to one or two problems after a brief discussion of a particular topic. In most cases, students took notes and wrote down the solutions. Towards the end of the lesson, students were usually handed homework that was to be solved by the next lesson. The students, however, did not receive comprehensive and contextually relevant E&M tasks at the beginning of each chapter. The E&M problems solved by the teacher during class and those given as homework assignments were traditional end-of-chapter problems that focused on computation and gave students limited opportunities to engage in thoughtful discussions (see Fig. 1 for a comparison of E&M problems for the control and experimental conditions).

Instruments
The effects of an instructional intervention on the development of CT skills should be measured by using valid and reliable CT measures that are sensitive enough to capture the changes of targeted CT outcomes (Ennis 1993;Halpern 1993;McMillan 1987). The CTEM test was administered in order to measure students' acquisition of the desired domain-specific CT outcomes. The HCTA (Halpern 2010) was administered to measure the acquisition of domain-general CT outcomes. A pilot study was conducted to examine the applicability of the HCTA for use to the present participants. The test consists of 25 scenarios (5 scenarios for each domain-general CT skills targeted in the study), with variety of everyday health, education, politics and social policy issues. Each scenario is Systematic design of a learning environment for domain… 493 followed by questions that require respondents to provide a constructed response and to subsequently select the best option from a short list of alternatives (forced-choice items). Based on the findings of the pilot study, 5 scenarios (1 from each CT category) that were somewhat confusing and reduced the test's overall internal consistency in this particular context were omitted. As a result, 20 constructed-response and 20 forced-choice items were ultimately administered. Both the CTEM and HCTA focus on similar CT components, with the exception that the CTEM items focus on E&M tasks, while the HCTA items focus on thinking tasks drawn from everyday life that do not require specific subject matter expertise (see Fig. 3

Sample E&M problem in the control class
A parallel plate capacitor has a square plate of side 10cm, and separation 4mm. A dielectric slab of dielectric constant k = 2 has the same area as the plates but has a thickness of 3mm. What is the capacitance (a) without the dielectric, and (b) with the dielectric?

Sample E&M problem in the experimental class
You may be aware that many of the standard computer keyboard buttons are constructed of capacitors. The keys are spring-loaded and each key has a tiny plate attached to the bottom of it. When you press a key, it moves this plate closer to the plate below it. So, basically, when a key is pushed down, the soft insulator between the movable plate and the fixed plate is compressed, changing the capacitance. This change in capacitance helps the computer to recognize which key is pressed. Let us assume that the separation between the plates is initially 5.00 mm, but is reduced to 0.150 mm when a key is pressed. The plate area is and the capacitor is filled with a material whose dielectric constant is 3.50. Determine the change in capacitance detected by the computer. Explain the relationship among plate area, dielectric material, and capacitance.

Fig. 1 Sample E&M problems for the control and experimental condition
You have probably heard about the risk of lightning strike on human life. During a stormy day, your parents may have advised you not to walk outside on the street so that you may not encounter an electric shock from lightning strikes. On the other hand, you know that flying in an airplane during a stormy weather is completely safe with respect to the electric shock that you might encounter due to lightning. You know that most aircrafts skin consist primarily aluminum, which conducts electricity very well. In addition, you have been hearing that airplanes experience lightning strikes during flight, but apparently the electric shock from the lightning is not felt by passengers inside the plane. Although the external part of the aircraft conducts electricity, you have come to know that the electric current from the lightning strike remain on the exterior of the aircraft. Why do you think the presence of this electric charge on the external body of the airplane is not felt within the airplane? What explanations do you give to this phenomenon? Fig. 2 Sample whole-task for chapter three for sample CTEM and HCTA items). We computed the internal consistencies (Cronbach's alpha) of the administered tests in the present study: .74 for the CTEM, .76 for the HCTA constructed-response, .73 for the HCTA forced-choice and .77 for the HCTA overall test. Although a desirable value for internal consistency may vary as a function of the nature of the construct being measured, Cronbach's alpha values between .70 and .80 are considered acceptable (Cohen et al. 2007). Prior physics knowledge of the participants (physics scores from the Ethiopian Higher Education Entrance Examination) was collected from the student records offices of the two universities.

Procedure
The CTEM was administered as a posttest-only test a week after the end of the intervention. Because the CTEM items require prior knowledge of E&M, we felt it was reasonable to administer the test only at the end of the intervention. The HCTA test, on the other hand, was administered both to the experimental and control groups as a pretest, immediately before the beginning of the intervention and as a posttest a week after the end of the intervention. Due to practical reasons, the paper version of the HCTA test was administered since computer-based administration of the HCTA

Sample CTEM item:
Hanna does the following experiment: she brings a positively charged rod close to a metal can. Doing the experiment shows that the can is attracted to the rod.
Hanna is puzzled with the result of her experiment. She expected the negative electrons would be attracted to the rod while the positive nuclei are repelled, and opposite forces cancel out, which would mean that the can remains at rest.

How can you make Hanna's argument consistent with the experiment? Give an explanation.
Sample HCTA item: After a televised debate on capital punishment, viewers were encouraged to log on to the station's web site and vote online to indicate if they were "for" or "opposed to" capital punishment. Within the first hour, almost 1000 people "voted" at the website, with close to half voting for each position. The news anchor for this station announced the results the next day. He concluded that the people in this state were evenly divided on the issue of capital punishment.
Given these data, do you agree with the announcer's conclusion? Yes No Provide two suggestions for improving this study: First suggestion:________________________________________________ Second suggestion: ______________________________________________ was not possible. Participants were required to first answer all the constructed-response format items and then the forced-choice format items. Administration of the CTEM test lasted between 60 and 75 min, and the HCTA (both formats) between 70 and 90 min. Approximately 90 % of the experimental lessons were observed, and the experimental instructor was consulted after each lesson to reflect on challenges that surfaced as well as any other aspects that might improve implementation of the lessons as designed. Postlesson discussions focused on such issues as usage of instructional time, giving of support and feedback to groups within the allocated instructional time, oral questions used to prompt students to further elaborate on their answers, and overall evaluation of the implementation of the lesson in relation to the design. Instructors registered class attendance for each session both in the experimental and control conditions. Eighty-five percent of the experimental group students and approximately 80 % of the control group students attended more than 90 % of the sessions. There were two dropouts in the experimental group and one dropout in the control group. The pretest data of those three students were omitted from the results. This means that our analysis of the data from the two groups is based on 45 students for the experimental group and 44 students for the control group.

Screening of the data
The CTEM and HCTA scores were screened for accuracy of data entry, missing values and the assumptions for normality and homogeneity of variances. A separate overview of the experimental and control students' scores for each CTEM and HCTA items showed random missing data for a few items. However, the proportion of missing values per item was very limited (\5 %) and scattered over each of the 20 CTEM and HCTA items. Mean substitution was therefore used to estimate the missing data. The mean scores for each separate item for the experimental and control groups were calculated and the handful missing values were substituted with the respective group mean scores. Outliers were also separately sought in the experimental and control groups. Visual inspection of boxplots and inspection of the z scores for each of the CTEM and HCTA variables showed that there were no potential outliers.
Moreover, tests of assumptions for normality and homogeneity of variances were conducted through examination of the standardized residuals for the CTEM and HCTA scores. For the CTEM, a Shapiro-Wilk's test (p [ .05) and a visual inspection of the histograms, the Q-Q plot and boxplot suggested that the scores from the two groups were approximately normally distributed. Using the standardized residuals, the assumption of homogeneity of variances was tested and satisfied based on Levene's F test, F(1, 87) = 1.57, p = .11. For the HCTA scores, a Shapiro-Wilk's test (p [ .05) and a visual inspection of the histograms and boxplot showed that the HCTA pretest and posttest scores were also approximately normally distributed for both the experimental and control groups. Furthermore, the assumptions of homogeneity of variances were tested and satisfied based on Levene's F test for the pretest (F(1, 87) = .16, p = .69) and posttest scores (F(1, 87) = 1.36, p = .25).

Domain-specific CT performance: CTEM
Initial comparison of prior physics knowledge revealed no significant differences between the experimental and control group, t(87) = .15, p = .88. An independent sample t test was therefore conducted to compare the performance of the two groups on the domainspecific CT test. The results indicated that the CTEM mean score for the experimental group was significantly higher than that of the control group, t(87) = 7.15, p \ .001, d = 1.55. The effect size for this analysis was found to exceed Cohen's (1988) convention for a large effect (d = .80).
An analysis of covariance (ANCOVA) was conducted to examine whether the statistically significant mean score differences could be maintained after controlling for physics prior knowledge. The ANCOVA results showed that the CTEM mean score of the experimental group was significantly higher than that of the control group, F(1, 86) = 52.56, p \ .001, g 2 = .379. The results indicated that the intervention accounted for 37.9 % of the variance in the acquisition of domain-specific CT skills. Post-hoc power analysis by using G*Power (Faul et al. 2007) indicated that the power to detect the effect size observed in the present study (d = 1.55, p \ .001) was [.99. The a priori power analysis indicated that a total sample size of 84 would be sufficient to detect a large effect (d = .8; Cohen 1988) with a power of .95 (p = .05), and a total sample size of 210 would be sufficient to detect a medium effect (d = .5; Cohen 1988) with a power of .95 (p = .05). See Table 3 for descriptive statistics of the CTEM test.

Domain-general CT performance: HCTA
In order to examine the effect of the instructional intervention on students' domain-general CT performance, a 2 (groups: experimental and control) 9 2 (testing time: pretest and posttest) mixed design ANOVA was conducted. The results of the mixed design ANOVA revealed that the two groups together demonstrated a statistically significant improvement on the HCTA mean scores across the two time points, F(1, 87) = 4.61, p = .035, g 2 = .05. The effect size value (g 2 = .05) suggested a small practical significance. However, there was no significant interaction between the intervention type (experimentalcontrol) and the testing time (pretest-posttest), F(1, 87) = .14, p = .71. In other words, the HCTA mean score for the experimental group did not show a significant pretest-posttest improvement compared to the control group. This indicates that the experimental learning environment did not result in a significantly greater pretest-posttest improvement in the acquisition of domain-general CT skills compared to the control learning environment. The descriptive statistics of the HCTA scores are shown in Table 3.

Relationship between domain-specific and domain-general CT performances
Calculation of the Pearson's correlation coefficient showed a significant positive relationship between pretest HCTA and posttest HCTA scores (r = .29, p = .006). Moreover, the CTEM scores significantly correlated with the posttest HCTA scores (r = .38, p = .01). These findings show that when both groups are taken together, those students who scored higher on the pretest HCTA also tended to score higher on the posttest HCTA. Post-intervention comparison similarly indicated that those who scored higher on the CTEM test also tended to score higher on the posttest HCTA. A linear regression analysis also revealed that the CTEM test explained a significant proportion of the variance on posttest HCTA performance, F(1, 87) = 14.7, p = .05, R 2 = .145. The result shows that CTEM performance was a significant predictor, accounting for 14.5 % of the variance in posttest HCTA scores. Post-hoc power analysis using G*Power (Faul et al. 2007) indicated that the power to detect the observed effect at the .05 level was .94 for the regression in prediction of the posttest HCTA performance.

Discussion
In this study, we argued that the design of CT instructional interventions should be supported by the principles of instructional design research.
To that end, we tested an alternative method to address the challenge of CT development through the systematic design of subject matter instruction rather than explicit instruction on general CT skills. A regular physics course was systematically designed in accordance with the First Principles of Instruction model. We hypothesized that E&M instruction systematically designed in line with the First Principles of Instruction model would produce higher acquisition of domainspecific and domain-general CT skills than regular E&M instruction. Implementation of the lessons for the experimental condition was carefully monitored, and sufficient information was gathered with respect to the implementation of the lessons in the control condition. With regard to the first research question, we found that a systematically designed E&M instruction that implicitly targeted CT skills in various domainspecific classroom activities resulted in higher acquisition of domain-specific CT skills compared to regular E&M instruction. We focused on the systematic design of subject matter instruction (supported by valid principles of instructional design research) as previous CT intervention studies did not systematically explore how subject matter instruction in itself may stimulate learning of domain-specific CT skills. The instructional interventions designed and implemented as part of a couple of previous Immersion-oriented CT empirical studies (e.g., Barnett and Francis 2012;Garside 1996;Renaud and Murray 2008;Stark 2012;Wheeler and Collins 2003) appear to show significant limitations. The interventions focused mainly on a specific component of the learning environment (e.g., small group discussion only), and only minimally emphasized other important learning environment components such as the types of learning tasks/problems designed for discussion (e.g., are the learning tasks challenging enough to provoke discussion among students? Are the tasks authentic/contextually relevant?). They also paid scant attention to the adequacy of support, feedback and coaching offered during full-class and small group discussions. In most previous CT studies, the desired CT outcomes learners were expected to demonstrate after instruction were moreover barely described or articulated during the design phase. It is next to impossible to evaluate the extent to which the various designed tasks and instructional activities were relevant in stimulating the acquisition of the desired CT outcomes.
For the present study, efforts were made to design a learning environment that addressed the limitations of previous studies. First, the desired domain-specific and domain-general CT outcomes were operationalized and described. A learning environment that could stimulate the acquisition of the desired CT outcomes was subsequently systematically designed. In accordance with the theoretical claim that meaningful subject matter learning inherently involves development of relevant CT skills (e.g., Glaser 1984;Resnick 1987), the E&M instruction was systematically designed in such a way that it provided students with the opportunity to engage in a number of domain-specific classroom activities. It is important to point out that previous studies already implemented one or two of the instructional strategies implemented in the present study. For example, the discussion method of teaching (e.g., Wheeler and Collins 2003), and teacher modeling (e.g., Anderson et al. 2001) are among the most commonly employed instructional strategies in previous Immersion-oriented CT studies. However, for this study, we designed a comprehensive intervention that integrates most of the empirically validated instructional design principles. The findings with regard to domain-specific CT skills suggest that systematic design of subject matter instruction based on a combination of empirically valid instructional principles promotes the acquisition of domain-specific CT skills. CT development, this study argues, involves both domain-specific and domain-general dimensions. It demonstrates that acquisition of domain-specific CT skills can be improved through systematic design of subject matter instruction without explicit teaching of general CT skills. This finding is consistent with the result of a recent meta-analysis of strategies for teaching CT (Abrami et al. 2015) as well as previous theoretical claims (e.g., Glaser 1984;McPeck 1990;Resnick et al. 2010;Resnick 1987) that underlined the importance of learning environments systematically designed in accordance with relevant instructional principles.
For the second research question, however, the findings showed that the experimental learning environment did not result in a statistically significant improvement for domaingeneral CT skills compared to the control learning environment. Gains in domain-specific CT proficiency found in the experimental condition were not accompanied by gains in domain-general CT proficiency. The two groups together demonstrated improvement in the acquisition of domain-general CT skills, between the pretest and posttest scores. The same test was administered both prior to and after the intervention, and the observed pretestposttest improvement might simply be a test-retest effect.
On the other hand, we found that domain-specific CT proficiency significantly predicted posttest domain-general CT proficiency. This suggests that when a domain-general CT test that presumably required similar thinking skills was administered to the participants, performance on a domain-specific CT test was a significant predictor of performance on a domain-general CT test. To a degree, this reveals a tendency to transfer the acquired domain-specific CT skills in solving domain-general CT tasks. This finding is consistent with previous psychology studies in which higher performance on a psychological CT test also predicted higher performance on a domain-general CT test (e.g., Williams et al. 2004).
A number of reasons may explain why the designed learning environment did not have a significant effect on the acquisition of domain-general CT skills. The absence of an explicit focus on the desired CT skills during the E&M instruction may have kept students from abstracting the domain-specific CT skills and applying them in solving domain-general tasks. This suggests that a great emphasis on systematic development of domain-specific knowledge alone may not be sufficient to facilitate transfer of domain-specific CT skills to everyday problems. Perhaps a worthwhile approach to CT instruction may be to explicitly emphasize desired CT skills within specific subject matter instruction. Proponents of the embedded approach often claim that explicitly teaching CT skills within subject matter instruction is the best way to stimulate development of transferrable CT skills (Davies 2013;Halpern and Hakel 2002;Halpern 1998). For example, some generalists have argued that students must be aware that they are being taught CT skills during specific subject matter instruction and they will be expected to use those skills to solve everyday problems or issues they will come across. However, the main criticism that has been directed at generalists is that they largely see CT as everyday problem-solving that is detached from domain-specific CT proficiency (see Bailin et al. 1999;Resnick et al. 2010;Smith 2002). To date, there is no agreement on how specific subject matter instruction can be optimally designed to develop both domain-specific and domain-general CT skills. An important area for future studies would therefore be to evaluate the effectiveness of explicit teaching of CT skills within well-designed subject matter instruction to develop both domain-specific and domain-general CT skills. It could prove interesting to compare an Immersion-based learning environment with an Infusion-based learning environment in which CT skills are explicitly trained within systematically designed subject matter instruction.
Another possible explanation for the insignificant effect on the acquisition of domaingeneral CT skills may relate to the longstanding debate around the specificity and generality of CT skills. As noted in our above analysis of existing CT literature, generalists (e.g., De Bono 1991;Ennis 1989;Siegel 1988) view CT skills as applicable across domains, whereas specifists (e.g., McPeck 1990) argue against the existence of general CT skills on the grounds that thinking always amounts to thinking about something and that specific knowledge of a subject matter is necessary for CT. In this study, students in the experimental condition were intensively engaged in acquiring deeper understanding of E&M through an implicit emphasis on the desired CT outcomes. These students performed significantly better than the control group students on domain-specific CT tasks. However, the acquired domain-specific CT proficiency did not transfer when the same students were confronted with domain-general CT tasks (viz., the HCTA). Following the specifist view, it could be argued that the study participants perhaps lacked adequate knowledge of the content used in preparing the HCTA test. This reinforces the notion that the ability to think critically is mainly content dependent (e.g., Bailin et al. 1999;McPeck 1990;Smith 2002). The findings revealed that, compared to the control group, the experimental group students were able to demonstrate proficiency in using CT skills for E&M-specific thinking tasks. However, those CT skills were not applicable when they were presented with domaingeneral CT tasks. Students' failure to transfer the acquired domain-specific CT skills may therefore spring from the HCTA itself. An important area for future study would therefore be to evaluate the effectiveness of CT-embedded instructional approaches through administration of at least two domain-general CT tests that were designed based on different everyday content yet focused on similar CT skills.
A third possible explanation for the unimproved domain-general CT skills may relate to the brief duration of the intervention: 8 weeks and with a focus on just 50 % of the E&M course content. Perhaps the intervention was too short to produce a substantial change in participants' modes of thought, which made it impossible for them to transfer the acquired domain-specific CT skills to other domains than the E&M problems. Moreover, the experimental group students were also simultaneously following other courses in which subject matter instruction appeared to be less systematically designed. This may have resulted in limited opportunities for students to extensively practice the desired CT skills in other subject matter domains, and hence hindered their transfer. An important implication of this finding is that transfer of domain-specific CT skills to everyday problems may not automatically occur during a brief instructional intervention, but may instead require a conscious and systematic design of all subject matter instruction toward CT.

Study limitations
The findings of this study are based on a comparison of two intact classrooms at different universities taught by different instructors. Although the initial plan was to use two intact groups at the same university, the number of first-year students with major physics at the targeted university was very limited with just one intact group. To minimize the effects of having two different instructors and institutions, efforts were made to recruit instructors from the two universities with similar education levels and equivalent years of teaching experience. Efforts were also made to closely monitor the implementation of the lessons at both the experimental and control universities. However, it is important to interpret the findings from the present study by taking into consideration the limitations that sprang from having different institutions and instructors. Moreover, random assignment of the two intact groups into an experimental and control condition was not feasible. The first author is affiliated with one of the two universities. Since we expected to intensively collaborate with the regular instructors and to make the close follow-up more convenient, the group at the affiliated university was purposely assigned to the experimental condition.

Conclusion
This study explored the effectiveness of systematically designed subject matter instruction on the development of domain-specific and domain-general CT skills. It demonstrated that a typical freshman course systematically designed based on the First Principles of Instruction model-with an implicit focus on the desired CT outcomes as an integral part of the domain-specific classroom activities-can stimulate the development of domainspecific CT skills. This finding suggests that systematic design of subject matter instruction needs to be made an important component of teaching and learning in undergraduate education if students are to demonstrate domain-specific CT proficiency. Although this study's instructional intervention failed to provide evidence of the transfer of the acquired domain-specific CT skills to everyday problems, this does not mean that domain-general CT skills cannot be systematically taught. Our hope is that the present study will encourage researchers and instructional designers to pay attention to systematic design of subject matter instruction as a valuable approach to addressing the challenges of CT development. The following observations with regard to CT research in undergraduate education were particularly important. First, we showed that both the domain-specific and domain-general CT outcomes that we wish students to demonstrate need to be identified and precisely articulated before any attempts at teaching CT. Second, through a systematic design of regular subject matter instruction, useful empirical evidence was presented that supports the longstanding theoretical claim that meaningful subject matter learning in a domain can result in the development of domain-specific CT skills. Third, following the argument that embedding CT within subject matter domains should result in the acquisition of both domain-specific and domain-general CT skills, CTEM and HCTA tests were administered respectively to evaluate the effectiveness of the designed instructional intervention. Accordingly, empirical evidence that establishes the relationship between acquisition of domain-specific and domain-general CT skills, a barely examined research question, was validated. Our starting point was that instructional interventions for CT are not sufficiently supported by the principles of instructional design research. Through this study, we hope to have demonstrated how the two largely detached fields of CT and instructional design research can systematically be integrated. We moreover argued that the instructional principles behind various instructional design models are not sufficiently attuned to specific instructional settings. In this study, we hope to have shown how those empirically valid instructional design principles can be translated into usable instructional design prescriptions that are also relevant to CT. Zohar, A., & Nemet, F. (2002). Fostering students' knowledge and argumentation skills through dilemmas in human genetics. Journal of Research in Science Teaching, 39(1), 35-62. doi:10.1002/tea.10008.
Dawit Tibebu Tiruneh is a doctoral student at the Faculty of Psychology and Educational Sciences of the KU Leuven, Belgium, and instructor at Bahir Dar University, Ethiopia. His main research interest is the design and development of learning environments for critical thinking.
Ataklti G. Weldeslassie received his Ph.D. in Physics from KU Leuven and currently a teaching staff at the Science, Engineering and Technology Group, campus Group T of the KU Leuven, Belgium. His main research interest is in physics education.
Abrham Kassa is a lecturer in the department of Physics at Bahir Dar University, Ethiopia. His main research interest is in the teaching and learning of physics in higher education.
Zinaye Tefera is a lecturer in the department of Physics at Bahir Dar University, Ethiopia. His primary research interest is inquiry learning and the design of instructional strategies for promoting understanding in physics.
Mieke De Cock is associate professor in the department of Physics and Astronomy of the KU Leuven, Belgium, where she is responsible for the physics teacher training program. Her research focusses on conceptual understanding in physics, student use of mathematics in physics and integrated STEM education. She is teaching both introductory physics courses and teacher training courses.
Jan Elen is a professor of educational technology and teacher education at the Faculty of Psychology and Educational Sciences of the KU Leuven, Belgium. His main research interest is in the field of instructional design. He teaches both introductory and advanced courses in instructional psychology and educational technology. He is the senior editor of Instructional Science.