Introduction

An important yet nebulous aspiration of higher education involves cultivating future leaders for a complex and unknown tomorrow. Writing about the future, Paul noted, “Governmental, economic, social, and environmental problems will become increasingly complex and interdependent… The forces to be understood and controlled will be corporate, national, trans-national, cultural, religious, economic, and environmental, all intricately intertwined” (1993, p.13). More recently, Wheatley depicted the world as an “interconnected planet of uncertainty and volatility” where changes in one area can dramatically and surprisingly impact changes in an interconnected area (2005, p. 114). As problems continue to grow in complexity and connectivity, individuals in search of innovative ideas and solutions are being asked to exhibit systems thinking capabilities characterized by an ability to see the world as a complex interconnected system where different parts can influence each other and the interrelationships determine system outcomes (Senge, 2006; Sterman, 2000).

To prepare for such complex challenges, governmental bodies, industry, and funding agencies have all pressed colleges and universities to offer experiences that prepare students to tackle broad problems. Students should be able to integrate and connect ideas and perspectives across multiple disciplines and content areas to discover and invent new solutions (e.g., National Academy of Engineering, 2004; National Academy of Sciences, 2004; National Institutes of Health, 2006; National Research Council, 2012). The National Research Council’s (2012) Education for Life and Work: Developing Transferable Knowledge and Skills in the 21st Century identified “systems thinking” as an area of strong overlap between discipline-based standards and deeper learning/twenty-first century skills (i.e., knowledge or skills that can be transferred or applied in new situations). Systems thinking has also been identified as a broader core competency in sustainability research and problem solving (Wiek et al., 2011) and sustainability literacy (ACPA, 2008; Connell et al., 2012; Dale & Newman, 2005; Svanström et al., 2008).

Despite widespread consensus around the importance of this competency for graduates, colleges and universities have several challenges to overcome to support its development. First, discipline-based organizational structures that typically organize curricula (Warburton, 2003) can hinder student development of systems thinking and interdisciplinary problem solving skills (Svanström et al., 2008; Warburton, 2003). Second, systems thinking is a challenging competency to assess, and in the absence of other validated measures, programs have traditionally relied on students’ self-assessments as a form of evidence. Our paper addresses this latter challenge and problematizes the reliance on self-report assessments for systems thinking. We present comparisons of engineering students’ self-assessments of systems thinking and several related competencies (i.e., critical thinking, interdisciplinary skills, contextual competence) and their performance on two newly developed scenario-based assessments of systems thinking to address the following research question: Are students’ scores on scenario-based assessments of systems thinking related to their scores on self-report assessments of related competencies? Our findings demonstrate that the self-assessments do, indeed, relate to one another, but we do not see a relationship with performance on either scenario-based assessment. These results raise important questions about what information is obtained through each assessment method.

Self-Report Assessments

Self-report assessments are frequently used in educational research and assessment, particularly in the context of co-curricular student activities (e.g., service learning, study abroad) and national surveys of student outcomes (e.g., National Survey of Student Engagement, NSSE). Prior research on self-report assessments has argued both for and against their use in educational research and assessment (Bowman & Hill, 2011; Chan, 2009; Miller, 2012; Porter, 2011). Many of these arguments are based on surveys that ask students to self-report learning gains (i.e., how much they have learned between time A and time B) or changes in attitudes over time. For example, Bowman and Hill (2011) explored social desirability bias (i.e., tendency of students to respond in a way that would be viewed favorably by others) and halo effect (i.e., where a general positive or negative impression of an experience influences responses on specific items) in NSSE respondents and found that these issues were significant in first-year college students, but negligible in later years. Similarly, studies of the Wabash data set, a national data set that combines institutional data (e.g., enrollment and test scores) with student survey data, revealed differences between student self-reported gains and longitudinal measures of development, including biases that varied across institutional type and student characteristics (Bowman, 2010, 2011). Based on these results, Bowman (2011) argues that students with more reason to reflect on their learning (e.g., if they are concerned about their performance) may report their learning gains more accurately. Bowman (2011) suggests that using self-reported learning gains in research is questionable but may be overcome by accounting for known biases.

Porter (2011) takes a stronger stand, arguing that self-reported learning gains are not valid, using NSSE as an example. Building on research on instrument development, memory, and recall, Porter claims that students cannot be expected to estimate how their learning has changed over time accurately. In other studies, Porter has shown that students struggle to report more concrete pieces of information accurately, such as books used in courses and performance in courses (Porter, 2011; Rosen et al., 2017). Further, Porter (2013) has suggested a theory to explain student approaches in responding to survey questions asking them to self-report learning gains. This theory builds on the idea that students follow a complex seven-step process when trying to answer such a question accurately, and thus, students may actually treat these questions as attitudinal rather than factual, (i.e., “how do I feel about my learning” rather than objectively assessing learning gains). Although these perspectives provide helpful insights into the challenges of using self-report assessments to measure student learning gains (i.e., change over time), measuring gains over time is not the same as measuring attitudes or competence at a particular moment in time, which is the focus of the current study.

Taking a broader view of self-report surveys beyond those focused on learning gains, Chan (2009) suggests that the traditional concerns about self-report assessments are too broad (e.g., social desirability bias, common methods variance)—asserting that these critiques are valid in specific cases but not for every study. In many cases, self-report assessments may be as valid as other measures and, occasionally, may be even more so (e.g., when studying individual perspectives or attitudes; Chan, 2009). Pike (2011) presents a more nuanced view of self-report assessments, suggesting that they can be used effectively in educational research when they are supported by intentional use of theory in their design and when interpreting results. Most validation studies of self-report measures focus on assessing criterion validity; that is, they compare the self-report results to other outcomes that are expected to align with the construct being measured. In some cases, this expected alignment is clear (e.g., self-reported learning in a course could be compared to final grades). However, when the construct of interest is a more abstract concept (e.g., critical thinking), theory is necessary to support the researcher’s selection of the comparison measures that are used in the validation study (Pike, 2011).

Research on self-assessment of competence suggests that experts can more accurately assess their competence level than novices (Kruger & Dunning, 1999). Several explanations for this effect have been suggested, including (but not limited to) the following: (a) greater expertise produces enhanced metacognitive skills, which allow people to judge their level of competence more accurately (Dunning & Kruger, 2002; Ehrlinger et al., 2008; Kruger & Dunning, 1999); (b) different perceptions of difficulty in a task lead to different assessments (Burson et al., 2006); and (c) lacking self-confidence in a task results in more arbitrary assessments (Händel & Dresel, 2018). This self-assessment pattern has been identified across a variety of domains and populations, including college students assessing their performance on exams (Händel & Dresel, 2018). Because students are novices in their fields, they may be vulnerable to inaccurate self-assessment, especially in competence areas that could be expected to develop over the course of a career. Finding alternatives to students’ self-assessment of such competences is therefore particularly important. Indeed, initial studies of self-report surveys of students’ competence have revealed biases in their results. For example, Anderson et al. (2017) compared self-reports to situational judgment tests (SJTs, whereby respondents are provided a problem or scenario and asked how they would respond) and discrete-choice experiments (DCEs, whereby respondents are asked to state their preferences or choices across a set of options), finding that both SJTs and DCEs may mitigate some of the bias issues in self-report surveys assessing interpersonal and intrapersonal skills.

The purpose of our study is to compare the results of self-report assessments and scenario-based assessments of a more abstract concept, systems thinking. In alignment with Pike’s (2011) suggestions, we provide theoretical support for the scenario-based assessments we use. Through our analysis, we make assertions about the validity of self-report surveys for assessing this competency and contribute to the ongoing discussion about when and how to use self-report assessments in educational research and assessment.

Theoretical Perspectives Framing Systems Thinking Assessments

Systems thinking is the ability to see the world as a complex interconnected system where different parts can influence each other and the interrelationships determine system outcomes (Senge, 2006; Sterman, 2000). Such a perspective results in seeing multiple stakeholders, focusing on trends rather than single events, and considering unintended consequences of actions intended to improve a system outcome. Because systems thinking is a concept that has been developed across disciplines, there are a variety of definitions and conceptualizations (Mahmoudi et al., 2019). Our own orientation as systems thinking researchers is that we embrace the methodological pluralism that gives rise to myriad definitions and tools. We believe not only that this diversity of approaches is beneficial but also that it is to be expected largely because of the nature of the problems systems thinking aims to solve—namely, ill-structured and so-called wicked problems. Such problems are often characterized by their ambiguity and socio-technical complexity, where no clear single solution exists and where problem-solvers must make sense out of insufficient and/or overwhelming extensive information to scope out and implement solutions which may in turn give rise to new problems.

Aligned with this framing, we believe systems thinking to be a metacognitive competency, marked by the ability to critically and flexibly reason through complexity and multiple dimensions in any decision-making or problem-solving context. Although we believe that some skills transfer across domains within systems thinking, specific knowledge domains (e.g., systems engineering field) still warrant their own catered definitions and approaches. Further, we acknowledge that this methodological pluralism leads to significant and, at times, confusing discrepancies between definitions of systems thinking as well as overlap with other types of thinking discussed in literature such as critical thinking, creativity, and design thinking. Our understanding of systems thinking as theoretically related to these other constructs informed our selection of the assessment tools that we used, such as assessments of critical thinking and contextual competence. Although these assessments do not all use the terminology of systems thinking, the construct definitions and items indicate similar concepts.

Considering the varied definitions of systems thinking and associated tools, a recent systematic literature review by Dugan et al. (2022) mapped out a range of assessments being used in engineering education, identified 27 assessments in total, and categorized both assessment types and formats. A majority of the assessments (19/27) identified were behavior-based, which involve assessing knowledge or skills from responses or artifacts of the participant, while the other types included preference-based, self-reported, and cognitive activation. The specific formats of the assessments included mapping formats (e.g., concept mapping), scenario-based responses, oral, fill-in-the-blank, multiple-choice, virtual reality, cognitive activation via functional near-infrared spectroscopy (fNIRS), or open-ended in format (i.e., participant responses are not based on prepopulated language or options).

In this paper, we employ two behavior-focused, scenario-based assessments which measure different aspects of students’ understanding of systems and then compare those measures to the several existing self-reported assessments of systems thinking and other competencies which literature suggests are related. Each of these scenario-based assessments is based on a theoretical framework of systems thinking, as described in the following sections. As outlined above, we believe in methodological pluralism and that there is great value in a wide range of assessments for systems thinking. However, of the types and formats identified by Dugan et al. (2022), we find value in interrogating the relationships between behavior-based and self-reported assessments given that these two modalities most naturally mimic the teaching and learning environments of university education. Specifically, courses regularly use behavior or performance-based metrics to assess learning, and co-curricular and/or program assessment in universities often use self-reported assessments as a quick, convenient way to try to capture pre/post change in self-reported attitudes, values, beliefs, or behaviors. Further, in a comparison study of cognitive activation and self-reported assessments, Hu and Shealy (2018) found no correlation and highlighted that further efforts to study relationships between self-reported assessments and other types of assessment would be fruitful.

Theoretical Framework 1: Dimensions of Systems Thinking

Grohs et al. (2018) describe a three-dimensional framework of systems thinking (Fig. 1). The problem dimension considers both technical elements and contexts in analyzing a complex problem, including assumptions, goals, and constraints. The perspective dimension considers multiple perspectives or “frames” of the problem across potential stakeholders. The time dimension considers the history of a situation and considers potential short- and long-term unexpected consequences of each possible action.

Fig. 1
figure 1

Dimensions of Systems Thinking framework (reproduced from Grohs et al., 2018)

This framework was developed based on the literature of systems thinking research and is an example of a framework of systems thinking as a general perspective. A scenario-based assessment of systems thinking based on this framework was developed by Grohs et al. (2018) which presents participants with a short scenario and asks them to respond to six questions related to problem identification, information needs, stakeholder awareness, goals, unintended consequences, and implementation challenges. A rubric is used to score these six responses individually and also for alignment across responses (Grohs et al., 2018).

Theoretical Framework 2: Systems as Webs of Interconnections

Another approach to understanding systems thinking is to consider the basic criteria that differentiate a systems approach from its alternatives. From this perspective, systems thinking focuses on understanding systems as a whole and recognizing interconnections between different system parts (Meadows, 2008; Senge, 2006). When thinking about systems this way, one can see that problems are often related and changing overtime (Ackoff, 1971, 1994; Senge, 2006) and uncover circular chains of causes and effects (feedback loops) within the system (Senge, 2006; Sterman, 2000). It is this foundational understanding that can help individuals move from a focus on individual events to changing patterns over time and from blaming individuals or external enemies to seeing ourselves as part of the system potentially contributing to the problem (Meadows, 2008; Senge, 2006). By seeing interconnections, people recognize that for every action, there will be a reaction from the system. In contrast, a non-systems thinker sees events linearly, X causing Y, and looks for easy and fast solutions to fix symptoms of problems (Forrester, 1971). Such simplistic solutions often fail due to the system’s resistance and result in unintended consequences (Forrester, 1971; Senge, 2006). The recognition of interconnections and feedback structures is the foundation of many systems thinking schools of thought, such as system dynamics (Ghaffarzadegan & Larson, 2018; Randers, 2019; Richardson, 2011; Sterman, 2018).

Using this framework, we can compare and contrast individuals’ mental models to understand whether they see a complex socio-environmental problem as a problem caused by a single factor/player or are able to recognize how the actions of different players are related. To that end, a scenario-based assessment based on a real-world complex case was designed and used to assess individuals’ evaluation of the problem (Davis et al., 2020). Similar to the assessment described in the previous section, participants are presented with a scenario describing the situation and are asked to explain what went wrong. A scoring process is then used to identify the number of variables, causal links, and feedback loops described in the response.

Methods and Results

This paper describes data collected from two related studies, each built on one of the previously described theoretical frameworks of systems thinking. For both studies, participants were students enrolled in a spring semester first-year engineering course called Global Engineering Practice. Because the university has a common first-year engineering program, all participants were General Engineering majors at the time of data collection, typically in their first year in college following high school (with a small number of older transfer students); those students then joined different engineering departments the following year (e.g., mechanical engineering, civil engineering, etc.). The demographic composition of engineering students who self-select into this class tend to be more diverse than the College of Engineering, particularly with respect to gender, as about half of the class enrollment was women compared to about 20–25% of the college more broadly. Students from racially minoritized groups were slightly more represented in the class relative to the college. All students included in the sample consented to participate in this study, which has been approved by the university IRB office. We present the Methods and Results of Study 1 first followed by the Methods and Results of Study 2 since we followed a sequential approach to data collection and analysis.

Study 1 Data Collection and Analysis (Dimensions of Systems Thinking Framework)

The data for this study were taken from the 2017 (n = 123) and 2018 (n = 140) iterations of the course for a total of 263 participants. An instrument was administered to all students in class that included a scenario-based assessment of systems thinking aligned with the Dimensions of Systems Thinking framework. This assessment presents students with a one-paragraph description of a complex situation facing the Village of Abeesee. Students then complete six open-ended questions aligned with the dimensions of the framework. This scenario is scored using a rubric where students are rated between 0 (irrelevant response) and 3 (strong response) on each of the six questions and also on the extent to which their responses align logically across questions. All assessments were scored by a single researcher who had undergone training from the developers of the instrument. This researcher discussed multiple subsets of the scored data with the instrument developers as a formal peer audit process. Strong responses are characterized by a holistic framing of the problem, including both technical and contextual details, an acknowledgement of short-term and long-term considerations, and the inclusion of a variety of stakeholders. Intermediate responses include only some of these aspects and are typically limited in their analysis of the scenario. For a detailed description of the scenario, the rubric, and their development process, see Grohs et al. (2018). The text of the scenario is included in Appendix 1.

This scoring process yields seven variables for analysis purposes (i.e., problem identification, information needs, stakeholder awareness, goals, unintended consequences, implementation challenges, and alignment; an overall score is not calculated). Students also completed four self-report assessments related to systems thinking (Lattuca et al., 2013; Moore et al., 2010; Ro et al., 2015; Sosu, 2013). The Systems Thinking Scale (Moore et al., 2010) was chosen because of its explicit focus on systems thinking. The other chosen assessments measure what we expect to be related constructs given the literature on complex ill-structured problem solving (described earlier in "Theoretical Perspectives Framing Systems Thinking Assessments" section) . Specifically, Jonassen (2010) highlights casual reasoning, analogical reasoning, and epistemological beliefs as cognitive skills that describe individual differences in ability to solve ill-structured problems. Thus, we would expect that there may be relationships between some of these assessments and behavioral measures of systems thinking ability. These instruments and their scales/subscales are shown in Table 1. The items for each scale are included in Appendix 2.

Table 1 Self-report assessments used for comparison in Study 1

We conducted both correlation and regression analyses. First, we calculated a correlation matrix comparing the scenario assessment scores with the total scores and scale and/or sub-scale scores for each of the self-report assessments. Because we made multiple comparisons, we adjusted p values for family-wise error rate using the Holm correction, choosing this option because it provides a balance between reducing the risk of Type I errors and maintaining statistical power (Field et al., 2012). Although much of our data are Likert-scale survey responses, we used the Pearson correlation method as this has been shown to be robust for use with both ordinal and non-normal data (Norman, 2010). Next, we conducted multiple regression analyses with the scenario scores as the dependent variable and the systems thinking self-report scores as the independent variables. For each regression analysis, we checked for multicollinearity using the variance inflation factor (VIF) and for independent errors using the Durbin-Watson test. All VIF values were under 2 and Durbin-Watson values were between 1.75 and 2.25, both of which are well within the recommended values (Field et al., 2012).

Results for Study 1

The correlation matrix comparing the scores for the seven dimensions of the Dimensions of Systems Thinking scenario assessment with the scale and sub-scale scores for the four self-report assessments is shown in Fig. 2 (the table of p values for this matrix is in Appendix 3). There are few significant correlations between the scores on the different questions of the scenario assessment: a medium correlation between unintended consequences and implementation challenges and a weak correlation between goals and alignment (where weak correlation is 0.1 < r < 0.3, medium is 0.3 < r < 0.5, and strong is r > 0.5; Field et al., 2012). There is no specific expectation from the Grohs et al. (2018) instrument that strong correlations would be seen across these constructs in a mixed population of respondents. For an expert respondent, it would make sense that all constructs would be similarly high, but it is not clear that any prescribed patterns should exist for novice respondents. The relationship observed here between unintended consequences and implementation challenges could suggest a pattern worth investigating, but it could also be explained by those two constructs both being scored from respondent text to the same prompt. One takeaway from our analysis is that for undergraduate student respondents, we do not see strong correlations between constructs, and this observation may suggest that the constructs are indeed measuring different things. Even more distinct is the complete lack of significant correlations between the scenario assessment scores and the self-report scales while also observing strong correlations among the self-report scales themselves (particularly sub-scales within the Critical Thinking Disposition Scale). The Contextual Competence Scale was the most differentiated from the others, with only medium correlations across the board. Overall, however, we observe that students’ scores on the self-report scales align with each other but not with the Dimensions of Systems Thinking scenario assessment scores.

Fig. 2
figure 2

Study 1 correlation matrix

Multiple linear regression was conducted with each of the seven dimensions as the dependent variable and the scores for the four self-report assessments as independent variables. As shown in Table 2, only the regression for the problem identification dimension revealed any significant relationships (three of the dimensions are shown as examples given that the results were similar across dimensions. The results of the remaining dimensions are included in Appendix 4).

Table 2 Regression results for three of the dimensions of systems thinking

Our main finding from these analyses is that the overall models are not statistically significant for any of the seven dimensions and the adjusted R-squared values are all quite small, indicating that the models explain almost none of the variation in students’ scores on the Dimensions of Systems Thinking assessment. Although one significant predictor was identified, the overall model performance clearly indicates that students’ scores on the self-report assessments do not predict their scores on this scenario-based assessment.

Study 2 Data Collection and Analysis (Systems as a Web of Interconnections)

The data for this study were taken from the 2019 (n = 155) iteration of the course. This study followed a similar structure to Study 1 except that the scenario assessment used was based on the Systems as a Web of Interconnections framework described earlier. The Lake Urmia Vignette (LUV) provides a four-paragraph description of a lake that has dried up over time and related economic, environmental, social, and political events and outcomes connected to the lake (Davis et al., 2020). Students are asked to respond to one question asking them to “Describe the problems facing Lake Urmia in detail and explain why the lake shrank over the years.” Most students write about a paragraph in response to this question, and these responses are analyzed to identify constructs related to the Systems as a Web of Interconnections framework. First, students receive points for each unique variable they identify as part of the Lake Urmia system (e.g., local population). Second, they receive points for connecting these variables together through causal links (e.g., the population uses lake water for irrigation). Finally, students receive points for identifying feedback loops where the causal relationships connect to each other (e.g., the population uses lake water for irrigation, which increases the available food, resulting in an increase in the population). Each students’ points are totaled to calculate their overall score on the scenario. In our scoring process, each response was scored by two independent raters who then compared their results and discussed until scores were agreed upon. For more information about the development and scoring of the LUV scenario, see Davis et al. (2020). The text of the scenario is included in Appendix 5.

In addition to the LUV scenario, one of the systems thinking self-report assessments and a social desirability scale (Steenkamp et al., 2010) were included in the survey used in Study 2. These instruments and their subscales are shown in Table 3. The items for each scale are included in Appendix 6.

Table 3 Self-report assessments used for comparison in Study 2

These changes to the self-report assessments included from Study 1 to Study 2 were informed by the results of Study 1. Because the various self-report assessments were strongly correlated with each other in Study 1, we decided only one was needed in Study 2. The results of Study 1 also suggested that perhaps a construct like social desirability could influence students’ responses on the self-report assessments; that is, students may choose answers based on what they wish to be true rather than what they believe to be true. To explore this possibility, we included the Balanced Inventory of Desirable Responding (BIDR) in Study 2. If scores on the BIDR significantly correlated with the self-report assessment scores, this relationship would suggest that there may be a social desirability bias in students’ responses.

Lastly, to explore other possible skills that might relate to strong responses on the scenario-based assessment, we collected a few more variables. First, we asked students to self-rate their math ability relative to the average engineering student because both math and systems thinking involve complex thinking. Second, we had students respond to a basic question about feedback loops to determine their familiarity with concepts from the Systems as a Web of Interconnections framework. Third, we counted the number of words in students’ responses, as it is possible that students who wrote more would achieve better scores based on the scoring system used in this study.

We followed similar approaches as in Study 1 to analyze data. In addition, in Study 2, we conducted the same initial regression analysis and then two additional analyses: (1) adding the social desirability scales and (2) adding the background knowledge and word count variables as independent variables. As discussed previously, we used the Holm correction in the correlation analysis to adjust for multiple comparisons (Field et al., 2012). For each regression analysis, we checked for multicollinearity using the variance inflation factor (VIF) and for independent errors using the Durbin-Watson test. All VIF values were under 2 and Durbin-Watson values were between 1.75 and 2.25, both which are well within the recommended values (Field et al., 2012). We also included demographic variables in the regression models in an attempt to account for known differences in engineering student responses to these kinds of measures as demonstrated in prior research (e.g., Knight, 2014). We used demographic data that were collected by the institution, which at the time of data collection used male or female gender categories. We support changes in practices to how this demographic information is collected in the future to recognize that gender is a non-binary social construct.

Results for Study 2

The results from Study 1 led to our decision to only include one self-report assessment in Study 2 and include the social desirability scale to explore one possible explanation for the lack of alignment between the Dimensions of Systems Thinking scenario and self-report results. Another possible explanation is that the scenario and self-report assessments are not assessing the same or hypothesized related constructs. We therefore attempted to use a different theoretical framework and scenario assessment (the LUV scenario) in Study 2 to see if we would observe a different result.

The correlation matrix showing the relationships between the LUV scenario scores, a self-report assessment, a social desirability scale, and the background knowledge and word count variables is shown in Fig. 3 (the table of critical p values for this matrix can be found in Appendix 7). There are few significant correlations between the LUV scenario scores and the other variables. The only variable that is significantly related to the LUV scores is the word count of the students’ responses. Students who wrote longer responses also identified more variables and more causal links (although word count is not correlated with the number of loops they identified). There are also significant correlations between the sub-scores for the LUV scenario, but once again, loops are less strongly related to the other sub-scores. Although the critical thinking disposition scores do not relate to the LUV scores, it is notable that they are somewhat related to students’ self-rated math competence. There are also significant correlations between the subscales for both critical thinking disposition and social desirability.

Fig. 3
figure 3

Study 2 correlation matrix

Multiple linear regression analyses were conducted with each of the LUV scenario sub-scores and total score as the dependent variables. Three regressions were run for each of the four dependent variables: (1) demographic variables (age and gender); (2) adding the self-report and social desirability instruments; and (3) adding the background knowledge and word count variables. The background knowledge variables were the only significant predictors across all analyses, although this finding varied somewhat between the LUV sub-scores. Table 4 shows the results of these analyses for the LUV total score variable.

Table 4 Regression results for LUV total score

Because the results were similar for the sub-scores (variable, causal link, and causal loop identification), we do not show those results here (see Appendix 8). The most notable difference in results across the dependent variables was that the regression 3 model was only significant at the p < 0.05 level for the causal loop variable, and the adjusted R-squared was negligible.

The regression analyses revealed that word count was a primary factor in students’ scores, but even after accounting for this relationship, students’ scores on a basic feedback problem were still a significant predictor of their total LUV score. In considering multiple regression of the sub-scores of the LUV assessment shown in Appendix 8, students’ self-rated math competence barely meets the p < 0.05 criteria as an additional significant predictor for the number of variable sub-score and students’ scores on the feedback problem related to both their identification of variables and causal links (but not loops). It is promising that the feedback problem score is a relevant variable because it suggests that the LUV scenario is capturing some understanding of concepts related to the Systems as a Web of Interconnections framework. However, it remains unclear what factors may influence students’ ability to identify loops within the LUV scenario.

Discussion

Our study explored the relationship between students’ scores on self-report assessments of constructs expected to be related to systems thinking and scenario-based assessments of systems thinking ability. Following Pike’s (2011) suggestion to use theory to inform analysis of self-report assessments, we used scenario-based assessments that were developed based on two different theoretical frameworks of systems thinking. Through two sequential studies following the same research approach (with different scenarios), our results revealed no significant relationships between students’ performance on these scenarios and their scores on the self-report assessments. These findings remained consistent in both correlation and regression analyses. In Study 2, we found that students’ performance on a feedback loop problem and word count of their scenario response significantly related to their scenario assessment scores. These variables accounted for a large portion of the variation in the scenario assessment scores, whereas the self-report assessment scores were insignificant. These results could indicate that the scenarios assess a different construct than the self-report assessments, or that they are assessing the same construct at a different level of granularity. In either case, these two forms of assessment do not appear to be in alignment with each other despite their theoretical linkages related to students’ systems thinking ability.

This study contributes to the ongoing discussion about the effectiveness of self-report measures in educational research. We build on prior work suggesting that self-report measures may be reasonable in some contexts and for some constructs but not for others (Chan, 2009). Previous discussions have focused on self-reporting learning gains and attitudes, suggesting that the former is not effective, whereas the latter may be best assessed through self-reports (Chan, 2009; Porter, 2011). In this study, we explored competence as another type of learning outcome that is often assessed using self-report assessments, but which has been explored less thoroughly in the literature. Such instruments are common in educational research beyond the systems thinking and problem-solving space that we focused on in this study. For example, outcomes like intercultural competence (e.g., Braskamp et al., 2014; Hammer et al., 2003), leadership (e.g., Novoselich & Knight, 2017), and civic attitudes and skills associated with community engagement (e.g., Kirk & Grohs, 2016; Moely et al., 2002; Reeb et al., 2010) are also frequently assessed using this approach. Other authors have pointed out the lack of evidence for construct validity for many of these instruments (Lattuca et al., 2013). Some prior work has revealed that students with more experience with the competencies in question actually decline on the self-report assessments, suggesting that as they become more familiar with the subject, students realize how little they actually know. This aligns with the more general findings that experts can more accurately assess their competence than novices (Ehrlinger et al., 2008; Kruger & Dunning, 1999). Prior studies of intercultural competence have revealed similar results to the current study, for example, comparing self-report scores to both scenario-based assessments and qualitative analysis of student journals and finding that the self-report scores did not correlate with these more direct forms of assessment (Davis, 2020). One study even found significant negative correlations between scores on a scenario-based assessment and scores on self-report assessments for practicing engineers (Jesiek et al., 2020). In conjunction with this prior work, our study has the potential to inform the use of self-report assessments for both assessment and research purposes.

Limitations

One limitation of this study is that we do not have data comparing students’ performance on the two scenario-based assessments, so we can make no claims about whether these two instruments are assessing the same aspects of systems thinking. Other recent research has begun to make such comparisons (e.g., Joshi et al., 2022), but more work is needed in this direction. A related second limitation is that with the exception of one of the assessments, most of the self-report assessments used in this study do not purport to measure systems thinking but rather related constructs. Thus, an alternate explanation for the lack of observed relationships between the self-report assessments and the scenario-based assessments could be that that the scenario-based measures do not have enough validity evidence beyond their original publications or that systems thinking ability does not have a relationship with ill-structured problem solving cognitive skills as hypothesized by Jonassen (2010). Further, our study used selected self-report assessments, but countless other tools could have been compared. The recent work of Dugan et al. (2022) systematically identifying a range of systems thinking assessment tools can inform future investigations exploring relationships between tools aiming to measure the same or similar constructs.

A third limitation is the sample used in our studies, which includes only first-year engineering students. As discussed in the literature review, more novice systems thinkers may be more inclined to overestimate their abilities on self-report assessments, so our data from first-year students could include greater inflation of their abilities than if we included more advanced students. On the other hand, our student sample may not differ much from a sample of other students when compared to much more advanced systems thinkers such as, for example, professional engineers (Mazzurco & Daniel, 2020; Mosyjowski et al., 2021). Our sample for both studies was also more diverse than the college of engineering in which it was situated and engineering programs broadly, especially in terms of gender. We have no reason to believe that systems thinking competence is related to gender, and gender was not a significant predictor in any of our regression models in Study 2. Nevertheless, it may be important to consider this aspect of our sample when comparing our findings to other contexts. Finally, our sample includes only engineering students, which represents only a small subset of students with whom these assessments can be used. Although we have no indication that engineering students would be better or worse at self-reporting their own abilities than other students, future research should expand on this work to explore whether there are differences across disciplines.

Broader Implications

This study suggests that further research is needed to understand self-report assessments of competence. In this study, we compared the self-report assessments to scenario-based assessments, but more expansive assessments could be pursued to further explore these results, such as having students complete a more in-depth systems thinking activity or project. A second need is for similar studies to be conducted with other self-report assessments of competence, such as for intercultural competence, critical thinking, or creative thinking. Anderson et al. (2017) provide one example of a study that reveals weaknesses with self-report assessments for both global citizenship and creative thinking. However, further research is needed to support both the claims of biases in self-report assessments and determine if these biases are constant or variable across self-report assessments for different competencies. Third, researchers should pursue the development and validation of alternative assessment approaches (e.g., scenario-based assessments, situational judgment tests) for competencies such as systems thinking. This paper builds on the development of two scenario-based assessments that require further validation through broader use and research. Such assessments have the benefit that they can be used as an instructional tool in addition to an assessment method, providing additional benefits that are lacking with self-report assessments.

Beyond assessment, however, our study and the other literature exploring this topic suggest that perhaps educators’ attempts at assessment could be informed by a better understanding of competence. One understanding of competence presented by Lucia and Lepsinger (1999) suggests that it is made up of a combination of inherent aptitudes and characteristics, learnable skills and knowledge, and manifested behaviors. Although self-report assessments could be used to assess some aspects of this definition (e.g., knowledge), they do not provide the ability to assess others (e.g., behaviors). This framework for understanding competence also suggests that certain aspects of competence may be context-specific, whereas others are applicable across contexts (Lucia & Lepsinger, 1999). Future research that explores the nature of competence and competence-development may also be needed before we can understand what we are assessing using different approaches and ensure that we are interpreting our various assessments accurately (Figs. 4 and 5, Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14).

Conclusion

This study explored the relationships between students’ performance on self-report assessments and scenario-based assessments of systems thinking, finding that there were no significant relationships between the two assessment techniques. These results call into question the extensive use of self-report assessments as a method to assess systems thinking and other related competencies in educational research and evaluation. Future work should explore these findings further and support the development of alternative formats for assessing competence.