A complex adaptive system approach to evaluation: application to a pay-for-performance program in the USA


Evaluators frequently confront situations in which local programs struggle to meet the expectations and requirements specified by the external program funder. How can evaluators meaningfully evaluate programs (for both the funder and grantee) in situations in which the external program logic clashes with local complexities? This paper discusses complex adaptive system (CAS) evaluations as one method that addresses this question. To exemplify a CAS evaluation approach, we use the case of a pay-for-performance program, the Teacher Incentive Fund (TIF) program, a United States federal program implemented in numerous jurisdictions. Evaluation findings generated through a complex adaptive system approach have the potential to inform policy as well as assist the local program with ongoing improvements.

Evaluators frequently confront situations in which local implementers struggle to meet the expectations and requirements specified by an external program funder (Patton 2011; Liket et al. 2014). Especially when the program funder is a government agency, programs are often driven by an overarching policy agenda, and program grant requirements are often based on this overarching agenda (Ostrower 2004). Irrespective of local conditions, the resulting program logic is often simple, linear, and at odds with the local program context (Rogers 2008). In this paper, we ask how evaluators can meaningfully evaluate programs (for both the funder and grantee) in situations in which the overarching program logic fails to capture the most salient and useful aspects of the local implementation reality. This paper investigates the complex adaptive system (CAS) approach as one productive way for evaluators to combine the funder’s program logic with the complexity of local program implementation (Patton 2011). We show how a complex adaptive system approach may promote ongoing improvements to the local program while also communicating important program outcomes to external funders, thus yielding evaluation findings relevant for both parties.

Towards this end, the paper first introduces the complex adaptive system approach to evaluation and then applies this approach to a local Teacher Incentive Fund (TIF) program that was funded by the U.S. federal Department of Education. Since all federal TIF grants included money for program evaluation, the TIF program spawned a wide variety of evaluation studies. Comparing these studies with each other enables us to discern the unique strengths of the CAS approach to evaluation.

The complex adaptive system approach to program evaluation

“An evaluation program must match the dynamics of the system to which it is applied,” (Eoyang and Berkas 1999, p. 313). Evaluations that follow program logic models are useful, but they tend to show a linear causal path between program inputs and outcomes (Rogers 2008). Evaluators anticipate implementation problems for most programs, but a linear program logic assumes that implementation problems are mere variations of the main logic and that local implementers are capable of adjustments consistent with the logic (Barnes et al. 2004).

By contrast, a CAS evaluation approach (Eoyang and Berkas 1999) assumes that program implementation unfolds in “a dynamic network of many interacting parts, continuously acting and reacting. The results of these interactions are dynamic, emergent, uncertain, and unpredictable,” (Patton 2011, p. 253), to the point that the program itself may undergo substantial adaptations at the local level. A program being implemented by an organization may be considered a complex adaptive system if outcomes vary for different participants, are not predictable in advance, or involve unforeseen processes that connect to intended outcomes. Thus, according to Morell (2010), a complex adaptive system approach to evaluation addresses unintended consequences, irreproducible effects, lack of program fidelity in implementation, multiple paths to the same outcomes, and difficulty in specifying treatments. In other words, a CAS approach allows evaluators to capture the dynamic and unanticipated aspects of a program that are essential for ongoing programmatic improvement.

Eoyang and Berkas (1999) enumerate CAS evaluation principles that distinguish this approach from others. CAS evaluations, they claim:

  • Capture an emerging model of causal relationships.

  • Capture the expected as well as the unexpected.

  • Take into consideration the possibility of shifts in goals and outcomes.

  • Trace and use evaluation to reinforce feedback mechanisms.

  • Describe the program in developmental stages over time.

  • Track patterns and pattern changes rather than behaviors of specific sets of actors.

  • Revise and re-evaluate the evaluation design at each stage of the program.

  • Collect a variety of data in order to triangulate.

  • Make information about the evaluation open and accessible to all stakeholders.

Several evaluation models overlap with the qualities of a CAS evaluation, for example the CIPP model (Stufflebeam 1983), developmental evaluation (Patton 2011), appreciative inquiry (Cooperrider and Srivastva 1987), and empowerment evaluation (Fetterman et al. 1996). They all have in common that they prioritize understanding the local program context, work closely with stakeholders, are open to the unexpected and unpredictable quality of intervention or implementation processes, and utilize multiple data sources.

Empowerment evaluation focuses on local actors and uses evaluation techniques to help them become self-determined actors. A funder’s program logic is of secondary concern here, while in a CAS evaluation, the program as a whole is grasped as intervening in a complex social system. “Developmental evaluation applies …complexity concepts to evaluation to enhance innovation and use” by local actors (Patton 2011, p. 256). There is quite a bit of overlap between this approach and CAS. Both try to capture the uncertainty of outcomes; however, in a developmental approach, the evaluator works in an environment in which programmatic outcomes have yet to be established. The work of the developmental evaluator is to use evaluative techniques to assist the local actors with developing appropriate outcomes. In a CAS evaluation, the evaluator may be working in a context in which established programmatic outcomes may be uncertain or inappropriate for a local program, but programmatic outcomes have been established. A CAS approach allows the evaluator to understand how these outcomes may or may not be appropriate, as well as discover new outcomes more relevant to the local context.

While developmental and empowerment evaluations privilege the perspective of local actors in pursuit of outcomes they value, CAS evaluations in our interpretation can be positioned in the middle between funders or sponsors and local implementers as CAS evaluations are uniquely positioned to capture the discrepancies between externally set outcomes and local realities, thus revealing the unexpected in patterned changes within systems over time. Being mindful of funders’ logics and local logics and teasing out the systemic nature of the changing implementation process over time is useful for both stakeholders. While local actors are often times engulfed in the day to day, or micro, challenges of pursuing their goals, funders, especially governments, may be interested in a systemic, or macro, perspective to improve program design.

An implementation process of an intervention is complex when it is beset with recursive relationships between variables and unpredictable feedback loops and when outcomes are emergent rather than fixed. When highly complicated interventions are implemented in complex systems of human action, such as schools, simple program logics may fail to capture what may make an intervention useful for either program funders or users, and its usefulness fundamentally plays out in particular local contexts. Specification of an upfront program logic is therefore difficult (Eoyang and Berkas 1999; Rogers 2008).

Rogers (2008) introduces a useful distinction between complexity, complication, and simplicity with respect to evaluations of human service organizations. Simplicity reigns when the organizations in which interventions are introduced are oriented towards clear goals that can be accomplished within fixed and predictable production processes, and when the interventions themselves consist of few components addressing few stakeholders and few organizational levels. Evaluations become more complicated when the links between means and ends in the organization are uncertain, when multiple causal strands or pathways may lead to the desired outcomes, when multiple stakeholders are involved, and when the intervention itself involves many components that interact with specific local contexts. As we already mentioned above, in complex systems, ways of ordering or specifying the relationship between the intervention and the receiving organization move even beyond this level of complication as goals are multiple, ambiguous, or even contradictory, causal links are dependent on feedback loops emanating from unpredictable sources or dynamics, and outcomes are emergent. This poses an enormous challenge for evaluation. CAS evaluations need to reflect the complex dynamic of the social system to be evaluated. One could say that CAS evaluations are highly complicated in order to capture the complexity of the unfolding change dynamic. But at the same time, some intervention with respect to some outcome needs to be evaluated, and for this purpose, certain simplifications are necessary. One way to simplify is the use of causal diagramsFootnote 1 that can capture dynamic causal relationships throughout the program’s life (Eoyang and Berkas 1999). Causal diagrams reveal the major process variables and multiple desired outcomes associated with a complex program. The diagrams display processes and outcomes in a visual way (Hawe et al. 2009). How causal diagrams can help make sense of shifting dynamics of systems of human action is the focus of this paper.

The US Teacher Incentive Fund grant—a complex program of teacher performance improvement

In 2006, the US Department of Education created a federally funded program aimed at reforming teacher and principal compensation systems. This program, named the Teacher Incentive Fund (TIF), provided districts and schools with monetary support to create innovative teacher evaluation systems that were to be linked to bonus pay for high performance. The program was initiated during the Republican Bush administration during the heyday of high-stakes accountability systems under the No Child Left Behind law, but continued with the same intensity under the Democrat Obama administration. The TIF program survived the repeal of the No Child Left Behind law. In fact, teacher evaluations, career ladders, compensation reform, etc. were strongly encouraged by the Obama Department of Education as ways for state governments to opt out of increasingly dysfunctional regulations of NCLB before the US Congress finally managed to repeal the law. The last TIF awards of 70 million dollars were made in 2016 to 13 recipients for 5 years (https://innovation.ed.gov/what-we-do/teacher-quality/teacher-incentive-fund/). Accountability, including the NCLB law itself, teacher evaluations, compensation reform, and the like have been planks in a bi-partisan agenda of reforming public management along neo-liberal lines. This policy agenda stipulated that teacher evaluation systems should be both formative and summative, grounded in evidence-based performance measures, and used for high-stakes decisions as well as leading to improving feedback and professional development (Weisberg et al. 2009).

The US Department of Education awarded the TIF grant to multiple districts and local entities that varied in size, existing resources, and students served. Grantees fulfilled government requirements by creating a teacher evaluation system that had to incorporate multiple measures to which incentives were attached. Student achievement measures needed to represent a significant portion of the measures (Glazerman et al. 2011). The local grantee was required to link these student measures to individual teachers, ideally resulting in a value-added scoreFootnote 2 at the teacher level. Additionally, the local grantee was to combine these value-added scores with other evaluation measures, such as classroom observations, into a composite score representing a single evaluative judgment about a teacher’s overall effectiveness. At the time when the TIF program was created, connecting teacher evaluations to student test score gains had a strong momentum. This crest subsided in subsequent years as statistical and practical obstacles became increasingly apparent (David 2010).

The local TIF grant explored in this paper was awarded to, and directed by, a nonprofit organization (“the provider”). This provider developed a local version of TIF and supported implementation at three charter schools. Charter schools are publicly funded and governed by some state laws and regulations applying to public schools such as testing, accountability, and teacher certification, but are semi-privately managed. Teachers in these privately managed schools did not have tenure and were not represented by a union. Their salary schedule allowed for differential pay beyond seniority. Contract renewal from year to year depended on performance. Two of these charter schools host grades 9–12 (school A and school B), and the third school also has a middle grade (6–8) and elementary school division. We focused in school C on teachers in secondary school grades (6–12). The director of the nonprofit as well as the principal in charge of school A at the time of grant writing conceptualized and wrote the TIF grant.

Once the award was received, the local TIF teacher evaluation system was designed by the nonprofit director and school leaders from all three schools. A system consisting of 16 indicators of quality was designed, among which, for a number of reasons, teaching evaluations became one of the most essential components while the other components were beset with design challenges that rendered them increasingly less salient for the schools over the life of the program. The paper, therefore, concentrates on teaching evaluations as one main feature of the TIF-inspired management system.

Teaching evaluations were to be conducted with the help of classroom visits and the submission of a sample video at the end of the school year that was scored externally by the provider. Classroom observations were formative, resulting in a Formative Evaluation of Teaching (FET) score, and summative, resulting in a Summative Evaluation of Teaching (SET) score, later renamed Sample of Effective Teaching score.

The evaluating team from the university, consisting of a professor and three doctoral students, came on board during the first year of planning the system and was paid by a contract with the provider who allocated evaluation funds from the federal government grant. From the beginning, the evaluators were beholden to two stakeholders, the local school level grantees, the schools and the provider, and the federal program officers who communicated their expectations and requirements to the evaluator, mostly through the provider. The evaluators submitted to the stakeholders a design for an evaluation that promised three things: the evaluators would collect information and report on federal program components, but would not audit compliance; they would accompany local design and implementation with data collection; and they would help with local design elements. Upon being selected as the evaluator, the university team became a regular presence at the leadership level. In the planning year, support in the design effort was more central, but by the first implementation year, the evaluators took a less active stance and became keen observers.

Pay-for-performance evaluation approaches

Several approaches have been used to evaluate pay for performance programs. The intent of this section is not to provide an exhaustive review of the existing pay for performance evaluation studies (for a review, see Yuan et al. 2012). Rather, this section discusses the relationship between evaluation designs and findings that come into view as a result of the chosen design. It aims at positioning the CAS approach in this spectrum and weighing strengths and weaknesses of each approach.

The six studies discussed here differ in purpose and intended use. Theoretically, one could imagine evaluations that test the funder’s program logic in a linear way by ascertaining a connection between outcomes, for example student achievement, and a few relevant intermediate variables, for example teacher beliefs or perceptions of practice changes. The evaluation, in this case, may be fixed to the funder’s program logic, i.e., program outcomes and a straightforward theory of change. To speak with Rogers, the evaluation assumes relative simplicity. For a multi-faceted program such as TIF that expects from grantees the implementation of a multi-indicator performance management system, evaluations may try to capture the intricacy of the change process in given contexts by operating with plentiful intermediate variables and studying these variables over time in several intervals, yet the evaluations may retain the frame of the funder’s program logic. With Rogers, we would say that these types of evaluations capture the complications of the change process. A different approach is taken when evaluators embed themselves in the local design development and implementation of the program and allow for program logics to shift and outcomes to emerge in unpredictable ways. Such evaluations open up to what Rogers would define as the complexity of the social system.

The evaluation studies discussed here deal with program complexity to varying degrees. One important reason that the studies differ in their approaches is that they pursue evaluations with different purposes. Some evaluations primarily serve funders’ interests. They examine if the funder’s program logic worked and if it produced the outcomes for which funders provided the resources. Some expand this interest towards understanding how the process worked within the program logic of the funder. Others want to understand how a program was interpreted locally, adapted by local leaders, and used towards achieving locally valued ends or outcomes that may partially, or even wholly, differ from funders’ original intent in respect to both program logic and outcomes. As discussed above, CAS evaluations are especially suitable for the latter purpose.

Because a CAS approach is intended to explore unintended consequences and multiple pathways to the same outcome (Morell 2010; Patton 2011), this approach does not correspond well with experimental or quasi-experimental approaches which necessitate linearity. Experimental and quasi-experimental approaches are often preferable because they allow for a causal interpretation. Causal inferences in a CAS approach are not as straightforward. However, if the program intervention outcomes and moderating variables are unclear, a randomized controlled trial or quasi-experimental design may not be appropriate because a clear counterfactual could not be accurately identified, thus limiting the usefulness of findings (Cook et al. 2002). As discussed, the purpose of an evaluation is not always to show a straightforward summative effect with clear-cut causality; often, the evaluation purpose is to develop a deeper understanding of the program itself. Therefore, while a CAS approach may be limited in its ability to render summative findings in a clear causal logic, it may yield more useful findings when a program is highly dynamic or early in its development.

We discuss the six examples of evaluation studies of US performance management systems in the following way. We begin by presenting a program logic, developed by Mathematica (Glazerman et al. 2011), that underlies many of the TIF evaluation studies and that also informs our own evaluation. We then summarize the six studies that differ according to the degree to which they capture the complexity of the social system that is to implement a teacher evaluation and bonus pay system (some of them not funded by TIF). The Mathematica (Glazerman et al. 2011) evaluation of the TIF program was to help address funders’ concerns related to program design, logic, and impact. The program logic (see Fig. 1) pivots on the power of incentives and summative evaluation; through accurately evaluating teachers and awarding a bonus salary based on performance scores, teachers are motivated to improve their practice and thus improve student achievement. Performance scores and corresponding bonus salary signal information for recruitment and retention of effective teachers.

Fig. 1

The federal TIF program logic adapted from Glazerman et al. (2011)

We summarize the six evaluation studies in a table that distinguishes them according to the degree to which they capture the complexity of the implementation or adaptation process. The table begins with studies with “simple” designs followed by studies that become increasingly more complicated (Table 1).

Table 1 Six evaluation studies of performance management in schools

In looking over the types of evaluation approaches that are represented by the six studies reviewed here, we find multiple purposes. When the purpose is to evaluate funders’ intent with respect to program logic and outcome, evaluation designs such as the ones by Springer et al. (2012) and Mathematica (Glazerman et al. 2011; Max et al. 2014; Chiang et al. 2015) are useful. They ascertain outcomes with quasi-experimental designs and pick up on intermediate factors relevant to the program logic. Some studies make simplicity assumptions and include just a few relevant factors (e.g., Springer et al. 2012); others allow for more complication (e.g., Max et al. 2014; Marsh et al. 2011). But because the evaluation’s intention is to determine the merit or worth of the program, outcomes and funders’ program logic are the parameters of the evaluation. Yet at a certain level of complication, the limits of these parameters become visible and organizational complexity shines through (Marsh et al. 2011). The weight shifts towards understanding local implementation processes in the Rice et al. (2012) study. The study captures these processes with great detail. In the case of Rice, outcomes are no longer the focus. Evaluation studies, such as the one by Rice et al., are capable of showing implementation impediments, breakdown of funders’ program logics, and local adaptations. In this way, they are akin to CAS evaluations. But in contrast to CAS evaluations, they still stay within the confines of testing a theory of action, in Rice et al., one that tests whether local actors can mitigate program flaws by attending to recognized implementation difficulties.

A CAS evaluation is similar to the Rice et al. study. The evaluator is involved with the design process and uses evaluation to help inform ongoing improvements. The CAS evaluator aims at a systematic ordering of the data according to specific constellations over time. This helps trace major fault lines or tensions in the funders’ program logic and opens up to the possibility that the program logic shifts from the funders’ logic to local leaders’ logic. In this way, CAS evaluations pick up on nuances that other evaluation approaches may overlook, nuances that are germane for program success through the eyes of local implementers and ultimately for the improvement of the funder’s program.

Shifting program logics and evaluating TIF with the CAS approach

CAS evaluations may be indicated for a variety of circumstances that are characterized by organizational complexity. Two characteristics for TIF stood out. From the start, we assumed that the TIF program logic was not linear, but beset by two fundamental tensions in the design of the program, one between incentives versus inducements and the other between summative evaluation versus formative evaluation. Both are discussed below. These tensions, we hypothesized, create complexity from the start—complexity that would result in different constellations of forces, program adaptations, and emergent outcomes. CAS is the evaluation approach that can capture these complexities.

TIF as incentive or inducement

Although the name suggests that the Teacher Incentive Fund, as an experimental policy instrument proffered by the US federal government, is mainly about incentives, for the jurisdictions that are involved in the program, TIF might be an inducement, at heart. Often times, the terms inducement and incentive are treated as synonyms, but for evaluation purposes, a clear distinction is necessary. Inducements are transfers of money to agencies in return for the production of certain goods that the government values. Inducements often come with regulations spelling out activities that recipients are expected to carry out. Hence, they obligate specific procedures or practices (McDonnell and Elmore 1987). With respect to teaching evaluations, schools, in a logic of inducement, would participate in video-based externally scored summative evaluations that result in differential scores. Doing so would qualify the schools to receive supplemental funds, the disbursement of which is supervised by the local provider. But this quid pro quo would not necessarily substantiate a performance management system that pivots on incentives. In a logic of inducement, whether evaluation scores and bonus pay actually manage performance would depend on the managers’ or local leadersintent. In other words, without local leaders’ intent, incentives would be relatively powerless. Program implementation would simply unfold, from the start, in a different logic (i.e., implementing procedures in return for extra money for the schools), a logic not driven by incentives. Alternatively, if we assume a program logic pivoting on incentives, according to fundersintent, we would study bonus pay and evaluation scores as incentives, and we would ascertain whether incentives were powerful enough to shape work and performance, or not.

Formative and summative evaluation of teaching

Teaching evaluations have a summative and formative purpose. Summatively, teaching evaluations are to assure quality by creating fixed performance statuses based on teacher effectiveness measures that can potentially undergird accountability and supervisors’ decisions about promotions or dismissals. Formatively, evaluations render diagnostics and feedback that help employees learn, grow, and improve (Blase and Kirby 2008). Formative and summative purposes are difficult to integrate with one another (Darling-Hammond 2013). However, current evaluation reform initiatives, including the US federal TIF initiative, attempt to combine formative and summative approaches for the purposes of both teacher professional development and high-stakes salary and personnel decisions (Bill and Melinda Gates Foundation 2013). It was questionable, therefore, whether local grantees would succeed in this endeavor.

Appropriateness of CAS

It stands to reason that the tensions inherent in the program logic between inducement and incentive and between summative and formative purposes of teaching evaluations suggest an upfront complexity that cannot be captured with evaluation methods that are closed to multiple and shifting dynamics over time. To speak with Eoyang and Berkas (1999), the tensions inherent in the program upfront constituted for the evaluators a “sensitive dependence on initial conditions” (1999, p. 327). However, we do not claim that a CAS evaluation is superior for all evaluation purposes, merely that at a certain level of complexity of program and implementation context, a CAS evaluation is indicated as a means to render insights that remain invisible to other evaluation methods. We again discuss strengths and weaknesses of the CAS approach in relationship to other evaluation approaches after we have presented the findings for this study.


Given the “perpetual, but unpredictable dynamic behavior of a CAS” (Eoyang and Berkas 1999, p. 316), evaluation methods need to be selected that capture the changing patterns within the system. Predefined end points and long time lags in data collection schedules need to be avoided in favor of openness and ongoing data collection. Since CAS evaluators must not miss important feedback loops and turning points, they usefully become inside actors who directly participate in the development of the intervention and may shape it through feedback. Evaluation feedback may actually trigger or reinforce feedback loops in the system that has the potential to push the whole system in different and unforeseen trajectories.

Not surprisingly, as Eoyang and Berkas point out (see also Patton 2011, pp. 253–257; Morell 2010), data collection strategies are often times heavily qualitative, try to capture a relatively continuous stream, and cannot assume a fixed data collection schedule. Moreover, the insider position of the evaluator in the midst of relative unpredictability poses similar challenges for reliability and validity of findings that qualitative and action researchers are familiar with (Coghlan and Brannick 2005, ch 2; Guba and Lincoln 1989). A robust data collection plan, observation and interview protocols that are both close-ended and open-ended, critical self-reflection, searching out disconfirming evidence, and collective interrogation of findings within the evaluation team and with the recipients of the evaluation are ways to shore up the dependability of interpreting the evidence. Guba and Lincoln also suggest that dependability of interpretations is enhanced when researchers or evaluators systematically track or “audit” (1989, p. 242) the link between conceptual abstractions, operationalizations, data collection protocols, and field data. In the research team, we double-checked these links repeatedly.

The evaluation team worked with the schools and the provider over a period of 4 years, a planning year and three implementation years. The data for the evaluation consisted of four sets of data: teacher surveys administered after the first introductory year of the project and in the spring of year 3 (survey 1: 52 respondents, 70% response rate; survey 2: 64 respondents, 90% response rate); two waves of lesson observations (total 62 lessons); a total of 105 interviews with teachers in eight waves over 4 years plus 25 interviews with leaders; and 65 h of meeting observations. The unfolding of the CAS was captured by the qualitative data. Select items from the survey at beginning and end points give us some idea about baseline and outcome.

Codes for analysis of interviews were developed following the concepts derived from the relevant literatures (Miles et al. 2013). The interviews were coded using a qualitative data analysis software (Dedoose). We developed 47 distinct codes, the majority of them derived from theory. Some additional codes captured emergent phenomena. The theory-derived codes were defined, operationalized, and illustrated with representative quotes from interviews. We trained a team of four coders for inter-rater reliability. Twenty percent of interview excerpts were coded by two coders, and the discrepancies were treated collaboratively in order to clarify the concepts among the coders. The codes were grouped in five conceptual complexes: evaluation of teaching, performance-contingent payment bonus, teacher learning, and school professional culture (see Appendix Table 2).

Cases and sites

Data were collected in three charter schools, all three relatively small in size. Two of the schools are high schools, one combines middle and high school grades. The schools are located in a mid-size city in northern California. High proportions of students are low-income, of color, and immigrants. By US standards, all three schools educate highly disadvantaged student populations. The demographics are similar across the three schools and across these charter schools and regular public schools located in the disadvantaged neighborhoods of the city. The three schools were independently managed, but belonged to a loose network of schools that subscribed to a philosophy of social justice, access to college, student centeredness, community participation, teacher professionalism, and rich collegiality. Within this overall orientation, one of the three schools strongly stressed academic success (C); another stressed youth social development (B); and a third placed the focus on critical citizenship and student empowerment.

A strategic site for the evaluation was the project’s steering committee that met once a month. It was regularly attended by the provider, the principals of the schools, and other instructional leaders from the schools on a rotating basis, as well as by members of the evaluation team, most often the professor and assistants on a rotating basis. Two meetings were held with federal program monitors. Regular meetings allowed the evaluators to engage in discussions with the steering committee regarding implementation challenges, program goals, and ongoing program development. The steering committee was also the place where most of the feedback from the evaluation was shared. A few times, the evaluators were invited to directly share data with school faculties, but for the most part, feedback was ongoing and ran through the steering committee.

From the start, the evaluators positioned themselves to serve two clients: the funder and the local designers/implementers. Both want to know if the program “worked.” Local designers and implementers also want to know how to make “it ‘work, and they may have shifting ideas of what constitutes the ‘it.” This in turn has consequences for program design, knowledge that is of interest to the funder. In our case, over the course of the program life cycle, implementer intent became increasingly more salient as the intent of the government agency became increasingly untenable. The evaluators were sensitive to the needs of both clients, the funder’s desire to see an effective performance management system put in place that increased instructional quality and student achievement, and the local leaders’ desire to reward teachers financially and to spur learning about instruction through more precise feedback from data and evaluations. The evaluators were mindful of evaluation studies from other TIF initiatives that became available as the project unfolded, some of them discussed in the previous section. The steering committee, nevertheless, approached its task with a hopeful attitude, banking on their autonomy in designing their own evaluation system within the broad requirements of the TIF program.

In the stream of ongoing data collection, three distinct phases were clearly distinguishable that roughly followed the 3 years of implementation when the TIF program held sway. Identifying these distinct patterns enabled the evaluation team to understand how and why the performance management system was (or was not) working and to use these findings to suggest programmatic changes that helped local implementers to move forward and pursue valued ends. The causal diagrams show the complicated interaction between two sets of elements, one set following from the funder’s logic, in the diagrams displayed with white boxes, and one set following from the locally emerging logic, in the diagrams displayed with shaded boxes. The diagrams capture complexity by showing how interactions between local leaders, teachers, and the procedures that the program obligated undermined and transformed the funder’s performance management logic over time. The three diagrams reduce the complexity of the whole process by presenting patterns prevalent at three points in time.

The diagrams show three distinct constellations over three time periods: time 1 2011–2012, time 2 2012–2013, and time 3 2013–2014. The constellations are marked by an alternation of inducement and incentive dynamics which intersected with an emphasis on summative versus formative evaluation. These were the fault lines or “cracks” in the federal program logic that were theorized to generate complexity to begin with. The full complexity of the process might be captured by an intricate diagram that stretches over the entire length of time and represents the shifting dynamics from one time to another. We abstained from such a stretched-over diagram because it seemed to mask the important turning points that marked possible strategic reorientations on the part of the school leaders.

Time 1: TIF as inducement

Adoption of the TIF project began when the principals and the provider received the TIF grant. From the beginning, this group of leaders entertained two motives: first and foremost, to garner additional resources for the schools and teachers, and secondarily, to implement a performance management model that had the potential to improve teaching and student outcomes. While the TIF money was clearly a strong motive, the management philosophy behind TIF, its intent to improve instruction by rewarding high performers, made sense to the adopting leaders. Data, evidence, evaluation, incentives, and rewards, the buzzwords of the new performance management system, held a certain appeal for the charter school leaders and did not seem to be in conflict with the strong collegial cultures that had been established in the schools and held dear. On the first survey, high proportions of queried teachers indicated their agreement in general terms with differential pay for performance, but also indicated that they were skeptical about some of the pillars of the system, notably the state assessments and external teaching evaluation tools.

Observation tools and video

Concurrent with adoption, a system for evaluating teachers had to be crafted. TIF provider, school administrators, and university-based researchers collaborated to design an observation tool. Several teacher focus groups gave input. Early on, the idea held sway that a relatively simple tool that aimed at basic effectiveness of lessons and that allowed for precise feedback was needed for purposes of evaluation. Even though this idea conflicted with the critical and constructivist teaching philosophy that permeated the schools, both provider and administrators opted for this approach—as a start. TIF also introduced the regular use of lesson videos into the lives of the three schools. Lesson videos were to be produced at least once a year for summative scoring but could also be used for formative learning throughout the year.

Formative evaluation and learning

While the teachers appreciated that they received formative feedback in a supportive and constructive spirit from their supervisors, when TIF began they voiced the wish that supervisors, coaches, or mentors would look at teaching more closely and provide more precise feedback after observations. Teachers also applauded that because of TIF, their administrators would be required to be in classrooms more frequently and engage in formative evaluation and quarterly conferences.

The idea had been that quarterly formative feedback was to lead to a summative performance evaluated with the SET. Instructional supervisors, however, were reluctant to include SET criteria when they conferenced with teachers. Rather, formative feedback was, as it had been in the past, informed by the broad standards of the California Standards for the Teaching Profession that placed little constraints on supervisors’ choices for conversation topics and required less technical skill in lesson observation and analysis which the principals may have lacked. At time 1, then, teachers were orientated towards formative evaluation and appreciated a higher frequency of classroom visits, but feedback lacked clinical precision as of yet. Independent classroom observations, not related to the performance management system, were conducted by the evaluation team and revealed that many teachers were indeed challenged to teach lessons with basic effectiveness.

Bonus money

Extra money is always welcome was the pervasive sentiment at the beginning of TIF. Most teachers considered monetary rewards as recognition and validation for effort already expended on their work, and not as an incentive to motivate more effort or new behaviors. Money was an inducement to keep doing what one had been doing all along, and this inducement generated positive feelings. Administrators and the provider framed the TIF project in this way as well. TIF was an opportunity for the schools to attract more funds to augment salaries. Educators agreed with their administrators that rewards were deserved. TIF money was a token recognition of this universal deservingness.

Time 1 CAS diagram

Over the course of year 1 of implementation, TIF-inspired teacher evaluations seemed to be off to a good enough start. The idea of performance management meshed with a desire for formative evaluations yielding better feedback and with the idea that teachers universally deserved bonus pay due to their service to students in challenging living circumstances. Far from raising evaluative threat or defensiveness, TIF-inspired performance management was seen as an inducement to learning and a reward for good work. This constellation of forces is captured in the time 1 CAS diagram (Fig. 2). As was mentioned, the white boxes represent the funder’s program logic, and the shaded boxes represent the local leaders’ program logic. Figure 2 shows that local leaders prioritized a logic of inducement. Leadership intent, teachers’ initial motivation, and TIF-obligated procedures impacted how the evaluation system was communicated. During time 1, improved student outcomes remained the main goal. Although the stress was on formative purposes, summative purposes of evaluation were simultaneously present.

Fig. 2

Time 1 CAS diagram

The role of the evaluator during this phase was to repeatedly point out to school instructional leaders that if they wanted to serve all purposes of the system, i.e., rewarding teachers, improving on feedback on teaching, and motivating through summative evaluation, they needed to ensure that the measures of the system were well understood, accepted, even internalized as quality criteria, given the instructional challenges that independent observations had revealed. Capturing complexity in diagrams was more important for the evaluators than for the local actors. The evaluators needed to translate insights from discovered patterns into the deliberations of the steering committee in practical day-to-day terms. This is where we saw the most effect of our role.

Time 2: TIF as incentive

Events occurring among the leaders of the project, most often transpiring in the monthly TIF steering committee meetings, almost always preceded subsequent responses among teachers. A far-reaching decision was made early on in the project that the substantial funds, paid by the federal TIF grant for capacity building around TIF metrics and implementation, were rolled over to the schools in a lump sum and were no longer available to the provider. All three schools used good portions of these funds to compensate for coincidental state budget cuts so that they could keep or hire additional staff or buy essential equipment. In an inducement logic, this decision made sense. As long as the recipient of funds engages in obligated procedures, the funds are justified. In an incentive logic, the decision was detrimental because it made it difficult to find time to familiarize faculties with the performance management system. The lack of familiarity with the systems’ measures and judgments came into play once the performance scores were released.

Implementation quality

Implementation was affected by principal turnover, not an infrequent occurrence in schools in disadvantaged settings. In school A, a new co-principal in charge of TIF appeared to be unwilling to support TIF and was only reluctantly complying with obligated procedures. In school B, the new co-principal was simply overwhelmed with multiple duties.

It became apparent to provider and school administrators that data processing and data dashboard design were tasks much more demanding than envisioned. The vendor responsible for data management seemed ill equipped to deal with this complexity, yet no funds were available to procure additional services. The result was that principals were left to carry a large part of the technical side of the performance management system, a role that absorbed much energy for TIF. The whole management system consisted of 16 performance measures applied to the many varied roles staff play in schools. To find metrics that would create equitable opportunities for rewards across all these varied job responsibilities was a task at which the TIF steering committee eventually failed. The task resembled one of fighting a many-headed serpent, but in the early stages, the leadership was still confident that it could prevail.

All efforts, in this regard, however failed when it became apparent towards the end of time 2 that the state government would abandon its state assessments that were the linchpin of its accountability system, and with it the mechanics of the US federal No Child Left Behind sanctions system. Nationally, a whole new orientation to educational performance, codified in new Common Core Standards, came to hold sway among policy makers that eschewed the earlier push towards basic skills and emphasized higher order thinking skills and preparation for college. In the state of California, the political winds were shifting against a heavily punitive approach to accountability.

The state standard tests were the source of the majority of the 16 local TIF performance measures, and the signal of the state that it would replace the old standards tests with new ones within a time span of 2 years deflated the performance management system. However, two of the measures, summative and formative evaluations of teaching (as well as being paid extra for participation in professional development and work teams), were least affected by the crumbling of the intricate performance management architecture because teaching evaluations were controlled by the schools in collaboration with their support providers. And these two measures also entailed relatively simple data processing.

Summative evaluation

While in the early phases, teachers were primarily concerned about the clarity of the metrics and the process of generating and submitting videos, the situation changed dramatically after the first data dashboard release in the fall of 2012 when the first performance scores were released and teachers had to cope with performance judgments. SET scores were summative, based on observable behavioral indicators, and externally generated. FET scores were formative, internally generated by principals who composed a holistic and more impressionistic score from a wide variety of performance facets, such as instruction, collegiality, professionalism, and so on. FET scores were on the whole higher than SET scores.

When SET scores were released, they were considered conspicuously low, despite the fact that 61% of participating teachers received a score of “applying” that qualified them for a monetary award. For some participating teachers, a score below the reward threshold was not troublesome because they considered themselves novices or learners. But for more senior teachers who considered themselves effective or had been rated highly effective by FET metrics and had the reputation from years past, the SET scores provoked disbelief. The prevailing sentiment was to question clarity, validity, fairness, and usefulness of the evaluations, a familiar pattern documented in much research on teacher evaluation and pay for performance, and with it the motivation to learn from the information which the tool or the evaluations could potentially provide vanished. When the principals sensed that the tide of teacher sentiment turned against the performance management system, they retreated. They ignored the SET tool in their own instructional conferencing and discouraged discussion of summative scores.

Bonus money

Attitudes towards monetary payouts need to be seen in relationship to all performance measures of the system. Differential bonuses were not openly discussed in any of the three schools, so it took a while for information on pay differences to seep put. For the principals and the provider who had designed the system, the flaws became apparent when teachers whose performance was largely evaluated on nonstandardized and internal school measures received substantially higher bonuses than those whose scores were drawn from external state standard tests. SET evaluation scores were not affected by this difference, as all teachers were treated in similar fashion, but the intransparency of the system and its suspected unfairness as a whole, in combination with unexpectedly low teaching evaluation scores, made all summative external judgment dubious. When it came time to ramp up for the year 2 SET video submissions, the skepticism towards the summative evaluation had spread across faculties and became a collective stance that expressed itself in teachers withdrawing and refusing to submit a video for scoring.

Time 2 CAS diagram

In time 2, the TIF-inspired performance management system revealed itself to teachers as one of generating summative judgment, differential rewards, and incentives for presumably better work performance. For the most part, summative judgments and incentives were rejected while formative learning and the sense of universal deservingness were frustrated. School leaders, rather than risking conflict with their faculties, retreated from the idea of managing by incentives. They themselves quietly distanced themselves from the system, a pattern that has been documented in the literature on pay for performance (Marsh et al. 2011).

The time 2 CAS diagram (Fig. 3) represents this constellation. The program emphasis moves to incentives, and summative evaluation becomes the focus. Furthermore, student outcomes become a murky picture and with the end of state testing this dimension fades into the background for the TIF system. Local forces reshape the way the system is communicated. Disbelief and opposition, later followed by buffering or protecting teachers from the sting of negative judgment, move the idea of incentives to the sidelines while evaluation tools and procedures are no longer associated with formative learning. This constellation made it clear to the provider that the local TIF system was in serious trouble and that a course correction was needed. The CAS diagram suggested that a way would have to be found to reconnect teaching evaluations to teachers’ initial motivation to learn and to receive more precise feedback on their lessons through conferencing, coaching, and work in inquiry groups.

Fig. 3

Time 2 CAS diagram

The evaluators documented the backlash against the system over year 2, but their role became critical at the end of this phase. When school leaders and provider voiced their perplexity and helplessness in the face of an overwhelmingly complicated task that begged easy solutions, the evaluator validated the leaders’ concerns and encouraged the leaders to acknowledge failure where it had occurred. Out of these deliberations, the word emerged, and then subsequently spread, that TIF was just a “piñata,” in which bonus dollars were not meaningfully associated with performances.

Time 3: TIF as obligated procedure

As in the first two periods, new developments originated in the TIF leadership team. Provider, school administrators, and evaluators came together in jointly acknowledging that the incentive function of the performance management system had been a failure. The technical side of the TIF system was in shambles, aggravated by the phasing out of the state-standardized tests and the crumbling of the No Child Left Behind accountability regime. Participation in summative evaluations and submission of SET videos, the only part of the system that had any viability, had shrunk. As a new director took the helm of the provider organization, the provider changed course. From now on, a portion of every steering committee meeting was dedicated to analyzing videos submitted by teachers for summative evaluation. The purpose was for instructional leaders to analyze the videos with the help of the SET clinical observation tool and to sharpen the feedback given to teachers in formative or coaching conferences. The provider made it clear that compliance with TIF as a pre-condition for disbursement of money obligated schools to continue to engage in summative evaluations and submission of SET videos.

Incentives and inducements

In time 3, teachers and leaders in the three schools had found ways to insulate themselves from the discomfort of summative evaluations and the divisionary effects of differential pay for unwarranted performance. Differential bonuses and summative performance scores were still present, but their incentivizing function had been blunted. They were ridiculed or simply treated with silence. The desire to keep the TIF money flowing induced leaders to obligate teachers to engage in SET submissions and videos. But the money no longer connected to the “innocence” of universal deservingness. Instead, it became clear to teachers that strategic action was required to maximize payouts. In time 3, the system was a hybrid of incentivizing strategic behaviors and inducing obligated procedures. These procedures were used by instructional leaders to orient internal professional development, de-emphasizing the summative quality of videos and the clinical SET tool.

Observation tools and videos

While the use of artifacts associated with teacher evaluation—namely the SET observation instrument and the videos—receded for summative and incentive purposes, their use in learning events advanced. When leaders viewed lessons submitted by their teachers, they became aware of problems, and analyzing these lessons with the SET tool created new capacities. Leaders, finding renewed value and interest in using the artifacts required by TIF, initiated a series of professional development sessions with a focus on lesson study or other forms of lesson inquiry. Leaders at schools A and B were especially receptive.

At school A, the school principal and an instructional coach organized a professional development series in a lesson study format. The SET tool, however, was merely in the background until it came close to SET submissions. The school used an internally crafted lesson observation tool and inquiry guide that picked up on some main dimensions of an effective lesson, but the clinical precision of the SET tool was not maintained, and no attempt at summary scoring was made. School B paid closer attention to the clinical nature of the SET. Led and organized by the school’s instructional coach, lesson study was structured by the different components of the SET tool (e.g., opening phase, modeling/co-constructing phase, etc.) and supported by video records of lesson segments that teachers would supply and analyze systematically. Again, summative scoring was ignored.

Time 3 CAS diagram

In time 3, TIF metrics faded as performance incentives, but they still incentivized strategic behavior that would maximize bonus payout. TIF induced the continuous use of evaluation procedures, most notably the clinical SET observation tool and the use of video artifacts. Instructional leaders improved their capacity to use the tool in analyzing videos and giving feedback. They selectively adapted the idea of analyzing lessons for their schools’ professional development. They deemphasized the summative and precise nature of the tool, but the submission of videos at year’s end was still in the room as a performance requirement.

The time 3 CAS diagram (Fig. 4) represents this constellation. The program focus moves back to inducement, and the connection between inducement and incentive becomes tenuous. Furthermore, as the motivation to learn in a formative way reconnects to the evaluation procedures, the tools enable more precise and clinical conversations. The local TIF system had improved in its formative learning function, and the program, despite being in crisis, could still be directed towards meaningful purposes. The evaluator played a key role during this phase in highlighting to the school leaders what their original intent had been at the inception of TIF and in supporting video analysis with the help of the SET observation instrument so as to motivate the school leaders to insert the original concern for lesson quality into the schools’ ongoing professional development.

Fig. 4

Time 3 CAS diagram


As was mentioned, on the first survey, high proportions of queried teachers agreed with differential pay for performance as an idea, but were skeptical about the state assessments and the SET process. The whole performance management system was oriented towards the state assessments. When the state during year 2 of implementation abandoned them, this outcome orientation was lost. This left the Summative Evaluation of Teaching and the quest for more effective lessons. In year 3, when instructional leaders and supervisors had begun to rediscover the SET videos and instruments as a way to create problem awareness about the state of lesson (in) effectiveness, one third of surveyed teachers indicated that they considered the SET a useful tool that helped them plan, raise expectations for themselves, and set a minimal standard of effectiveness for all.

As to lesson quality, the project started in the first year of implementation with 61% of lessons rated as effective and qualifying for a reward. That number declined to 53% in the third year. The average score of SET evaluations was 2.51 in the first implementation year and declined to a score of 2.40 in the third year. Of those teachers (n = 13) who submitted in year 1 and in year 3, however, scores increased from 54% “effective” and qualifying for a reward to 76%. Thus, the effect of the program in the end is mixed and success is partial.


We began this paper with the claim that a CAS approach to evaluation was a preferred option when the evaluator is faced with a highly complex program that is beset by inconsistencies and tensions in its program logic, requires high implementation capacity from local actors, and is implemented in shifting organizational environments. The TIF program under study in this paper fulfilled all of these criteria. From the start, the program design allowed for a “logic of incentives,” the official funder’s logic, and a competing “logic of inducements.” We saw that local leaders considered the program first and foremost as a way to capture additional monies and to motivate teacher learning. Incentives were in the picture, but as soon as opposition among teachers spread, leaders abandoned the idea of managing with incentives. We saw that summative and formative purposes for teaching evaluations were in tension with each other, as could be expected from the literature on teacher evaluations, and that these two purposes overshadowed each other during distinct developmental phases of program implementation.

In a logic of inducement, engagement in obligated procedures persisted when the incentive logic had been rejected by local implementers. One important feature of a CAS evaluation is the active participation of evaluators with local implementers. CAS evaluations afford feedback loops for and joint reflection of findings with local stakeholders. It was the critical feedback from the evaluators to the provider and the school leaders that enabled the TIF project steering committee to change course, dispense with the incentive feature of the program, and seek new formative learning opportunities in the program’s obligated procedures.

If we had evaluated the program at time 1, we would have had to conclude that the program was off to a good start, perhaps worrying a bit about the schools’ tendency to consider the program as a welcome infusion of new money when state finances were cut back. But this worry would have been offset by the desire among teachers to use the program to improve on instructional supervision. If we had evaluated the program at time 2, we would have confirmed the pattern found in the many evaluation studies: implementation difficulties coupled with teacher skepticism or opposition undermined the motivating role of incentives. At time 3, we would have highlighted the role of program-obligated procedures in professional development and teacher learning and we would have had to conclude that the program was partially successful. Success would have been partial because program implementation was never strong enough to establish a causal connection between evaluations plus money to better teaching, and ultimately better student achievement. Clear outcome expectations got lost. But in the end, there was at least a chance for the schools to return to their initial learning motivation that was associated with the program. Bringing the three time periods together in a CAS evaluation helped local leaders to shift course and pursue TIF with their own logic, a logic that would not be predicted by the funder’s incentive-centered program logic, but gave local leaders the opportunity to use the program towards locally valued ends. Evaluators working with a complex adaptive system framework move beyond pre-specified logics and communicate local realities that may not only be relevant to the funders but also guide local program improvements.

A CAS evaluation in comparison

It is safe to say that a whole host of evaluations of pay for performance schemes in US education have found that these schemes are largely ineffective and that the program logic running from incentives through changes in beliefs, attitudes, and finally practices to recognition of teacher effectiveness and improved student outcomes largely breaks down (see Yuan et al. (2012) and an earlier review by Murnane and Cohen (1986)). Other studies, reviewed for this paper, such as the studies by Malen and Rice and associates, warrant a similar conclusion. The final evaluation report from the Mathematica study (Chiang et al. 2017), a study that used a quasi-experimental design with a limited number of quantitative impact and process variables, paints a slightly more positive picture: miniscule improvement in test scores and small improvements in lesson observation scores, slightly more negative perceptions about the schools’ professional culture and attitudes towards pay for performance during the first two years in the treatment schools; an evening-out of perceptions and attitudes between treatment and control schools beginning in the third year; and more positive responses in the treatment schools with respect to receiving feedback.

Our findings, regarding the TIF implementation we investigated, are consistent with this pattern, though as to test score gains, our study must necessarily be silent due to the state’s policy shift away from the old standard tests midstream during program implementation. We found as well that after year 2, attitudes shifted and improved, and teachers especially appreciated the feedback on their teaching. As in Chiang et al. (2017), teachers also improved their lesson observation scores between the first and the third year in our cases. Chiang et al. (2017) also found that teachers on the whole were not dissatisfied with the pay for performance scheme. In their case, almost everybody (around 70%) received some bonus. In our case, teachers welcomed additional money throughout despite misgivings, and about two thirds of the teachers received the reward.

While there are some striking consistencies in patterns between the Mathematica study and our study, the methodologies differ quite a bit. The Mathematica study is quasi-experimental, large scale, and mostly summative. Our study was in pursuit of both formative and summative purposes. It is a multiple-case study that traces the program as it unfolded within a complex adaptive system. A funding agency wanting to establish summative effect with relative certainty is better served with a Mathematica-style evaluation. Comparison between randomly assigned treatment and control cases allows causal inferences on the outcome indicators that the funding agency privileges, but the agency will receive very little information about how to change policy design when results appear disappointing or whether to abandon the approach altogether.

Consider these selected main findings reported by the Mathematica report:

“Most teachers and principals received a bonus, a finding inconsistent with making bonuses challenging to earn.”

“Many teachers in treatment schools did not understand that they were eligible to earn a performance bonus, and their understanding did not improve after the second year of implementation.”

“Most teachers received similar performance ratings from one year to the next.”

“Many 2010 TIF districts reported that sustainability of their program was a major challenge, and slightly fewer than half planned to offer pay-for-performance bonuses after their grant ended, but many more intended to continue measures of effectiveness and professional development for lower performing teachers.” (All quotes Chiang et al. 2017, Executive Summary, pp. xxi-xxxi)

What are we to make of these findings in conjunction with the findings on outcomes? Somehow, the program did not function as intended. It seems not to have been an incentive program since most teachers received a bonus, effectiveness ratings did not move much, and teachers did not seem to have bothered to understand the mechanics of the bonuses. But they slightly improved on test and observation scores and liked the changes they perceived around feedback on their teaching. Many district administrators recognized some positive aspects of the program, but would not prioritize sustaining the pay for performance component.

A study, such as the one conducted by Mathematica, can only state that the program logic did not fully come to fruition and can only speculate on underlying reasons for it. Our CAS approach began with the possibility that, given structural complexities of program and implementation contexts, the funder’s program logic may be offset by a contradictory logic that recipients entertain from the inception of the program. In our case, it was the possibility that fundamentally the program would be received as an inducement, an “already deserved” infusion of funds regardless of measured performance, a sentiment that was widely shared among teachers, principals, and the support provider. We understand from our CAS study that teachers, upon encountering unwanted negative side effects of performance ratings, actively relegated the performance management system to the periphery of their attention and blunted the potential sting of incentives by declaring it a lottery.

But the lottery was overall appreciated as it did bring additional monies to teachers and the schools. For a seriously underpaid profession, as it was widely considered by participants in our CAS study, these results should not come as a surprise. Money is welcome, no matter its source, and teachers, a profession with relatively little control over the contextual conditions of their work (Ingersoll 2009), may be used to accepting the “Good with the Ugly,” so to speak. This was an organizational-cultural response that took place in all three schools, and it occurred even though the schools could have been considered fertile ground for the power of incentives, being that in charter schools, labor was much more de-regulated than in traditional public schools.

Yet, certain aspects of the system were appreciated after the first experience of conflict and dissatisfaction: more firmness in quality criteria and feedback mechanisms. In our CAS study, we saw how local actors had to actively rearrange and reinterpret the program over time so that they could reconnect to their pre-existing quest for better and more regular feedback on teaching. Improvements in test scores and observation scores, especially slight ones, can always be attributed to an expedient narrowing of teaching to tested items and a mere strategic accommodation to observation criteria in order to garner bonuses. We need process data to plausibly validate impact. A CAS evaluation delivers these process data. We saw that in some schools, some teachers took the evaluation tool seriously and began to structure their collective learning around it. (But the observed learning was too halting and too incipient to expect major improvements to result from it.)

It is the nuanced approach that a CAS evaluation facilitates that helps us uncover causal links between leaders’ intent, teachers’ sentiments, and collective action. These links influenced how obligated procedures (in this case teaching evaluations) were taken up and eventually put to productive local uses while disrupting the funder’s incentive program logic. It is through the CAS evaluation that evaluators could help local implementers understand how to use a program to serve locally valued ends while at the same time satisfying program requirements in the midst of an unstable policy environment.

We see the various evaluation approaches, discussed in this paper, not as mutually exclusive. As we saw, quasi-experimental and randomized evaluation studies and CAS evaluations can complement each other. The strength of CAS evaluations is, from our vantage point, that they allow to see the system, a concern for policy designers, and to facilitate empowerment, a concern for local implementers.


  1. 1.

    In other disciplines, causal diagrams refer to apriori-specified diagrams that inform quantitative analyses. The “causal diagram” term used in this paper refers to an evaluation tool that changes as a result of a program’s evolution. Despite differences in defining causal diagrams, both interpretations of causal diagrams are means of reflecting on complexity.

  2. 2.

    Value-added scores are a way to link student test scores to teacher/school effectiveness. The term refers to student growth or academic gain attributed to a teacher or school, as opposed to using unadjusted mean levels of achievement or percent of proficient students.


  1. Barnes, M., Sullivan, H., & Matke, E. (2004). The development of collaborative capacity in health action zones: a final report from the national evaluation. Birmingham, U.K.: University of Birmingham.

    Google Scholar 

  2. Blase, J., & Kirby, P. (2008). Bringing out the best in teachers: what effective principals do. Thousand Oaks, CA: Corwin Press.

    Google Scholar 

  3. Bill and Melinda Gates Foundation. (2013). Ensuring fair and reliable measures of effective teaching. Seattle, WA: Bill and Melinda Gates Foundation.

    Google Scholar 

  4. Chiang, H., Wellington, A., Hallgren, K., Speroni, C., Herrmann, M., Glazerman, S., & Constantine, J. (2015). Evaluation of the Teacher Incentive Fund: implementation and impacts of pay-for-performance after two years. Washington, DC: Mathematica Policy Research.

    Google Scholar 

  5. Chiang, H., Speroni, C., Herrmann, M., Hallgren, K., Burkander, P., & Wellington, A. (2017). Evaluation of the teacher incentive fund: final report on implementation and impacts of pay-for-performance across four years. NCEE 2018–4004. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, US Department of Education.

    Google Scholar 

  6. Cook, T., Campbell, D., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin.

    Google Scholar 

  7. Cooperrider, D., & Srivastva, S. (1987). Appreciative inquiry in organizational life. Research in Organizational Change and Development, 1, 129–169.

    Google Scholar 

  8. Darling-Hammond, L. (2013). Getting teacher evaluation right. New York, NY: Teachers College Press.

    Google Scholar 

  9. David, J. (2010). Using value-added measures to evaluate teachers. Educational Leadership, 67 (8), retrieved at http://www.ascd.org/publications/educational_leadership/

  10. Eckert, J. (2010). Performance-based compensation: design and implementation at six Teacher Incentive Fund sites. Seattle, WA: Bill & Melinda Gates Foundation .Retrieved from http://www.tapsystem.org/publications/eck_tif.pdf.

  11. Eoyang, G., & Berkas, T. (1999). Evaluation in a complex adaptive system: a view in many directions. In M. Lissack & H. Gunz, Managing complexity in organizations. Westport: Quorum.

  12. Fetterman, D., Kaftarian, S., & Wandersman, A. (1996). Empowerment evaluation: knowledge and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.

    Google Scholar 

  13. Glazerman, S., McKie, A., & Carey, N. (2009). An evaluation of the Teacher Advancement Program (TAP) in Chicago: year one impact report. Final report. Washington, DC: Mathematica Policy Research, Inc.

    Google Scholar 

  14. Glazerman, S., Chiang, H., Wellington, A., Constantine, J., & Player, D. (2011). Impacts of performance pay under the Teacher Incentive Fund: study design report. Washington, DC: Mathematica Policy Research.

    Google Scholar 

  15. Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation. Newbury Park, California: SAGE Publications.

    Google Scholar 

  16. Hawe, P., Bond, L., & Butler, H. (2009). Knowledge theories can inform evaluation practice: what can a complexity lens add? New Directions in Evaluation, 2009(124), 89–100.

    Article  Google Scholar 

  17. Ingersoll, R. M. (2009). Who controls teachers’ work?: power and accountability in America’s schools. Harvard University Press.

  18. Liket, K. C., Rey-Garcia, M., & Maas, K. E. (2014). Why aren’t evaluations working and what to do about it: a framework for negotiating meaningful evaluation in nonprofits. American Journal of Evaluation, 35(2), 171–188.

    Article  Google Scholar 

  19. Marsh, J., Springer, M., McCaffrey, F., Yuan, K., Epstein, S., Koppich, J., et al. (2011). A big apple for educators New York City’s experiment with schoolwide performance bonuses. Santa Monica, CA: RAND Corporation.

  20. Max, K., Constantine, J., Wellington, A., Halgren, K., Glazeman, S., Chiang, S., & Speroni, C. (2014). Evaluation of the Teacher Incentive Fund: implementation and early impacts of pay-for-performance after one year. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, US Department of Education.

    Google Scholar 

  21. McDonnell, L. M., & Elmore, R. F. (1987). Getting the job done: alternative policy instruments. Educational Evaluation and Policy Analysis, 9(2), 133–152.

    Article  Google Scholar 

  22. Miles, M. B., Huberman, A. M., & Saldaña, J. (2013). Qualitative data analysis: a methods sourcebook (2nd ed.). Thousand Oaks, CA: Sage.

    Google Scholar 

  23. Morell, J. A. (2010). Evaluation in the face of uncertainty: anticipating surprise and responding to the inevitable. New York, NY: Guilford Press.

    Google Scholar 

  24. Murnane, R. J., & Cohen, D. K. (1986). Merit pay and the evaluation problem: understanding why most merit pay plans fail and a few survive. Harvard Education Review, 56(1), 1–17.

    Article  Google Scholar 

  25. Ostrower, F. (2004). Attitudes and practices concerning effective philanthropy: survey report. Washington, DC: Urban Institute.

    Google Scholar 

  26. Patton, M. (2011). Essentials of utilization-focused evaluation. Thousand Oaks, CA: Sage.

    Google Scholar 

  27. Rice, J. K., Malen, B., Baumann, P., Chen, E., Dougherty, A., Hyde, L., & McKithen, C. (2012). The persistent problems and confounding challenges of educator incentives the case of TIF in Prince George’s County, Maryland. Educational Policy, 26(6), 892–933.

    Article  Google Scholar 

  28. Rogers, P. (2008). Using programme theory to evaluate complicated and complex aspects of interventions. Evaluation, 14(1), 29–48.

    Article  Google Scholar 

  29. Springer, M. G., Pane, J. F., Le, V.-N., McCaffrey, D. F., Burns, S. F., Hamilton, L. S., & Stecher, B. (2012). Team pay for performance: experimental evidence from the Round Rock pilot project on team incentives. Educational Evaluation and Policy Analysis., 34, 367–390.

    Article  Google Scholar 

  30. Stufflebeam, D. (1983). The CIPP model for program evaluation. In G. Madaus, M. Scriven, & D. Stufflebeam (Eds.), Evaluation Models (pp. 117–141). Boston, MA: Kluwer-Nihjoff.

    Google Scholar 

  31. Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K. (2009). The widget effect: our national failure to acknowledge and act on differences in teacher effectiveness. Brooklyn: New Teacher Project.

    Google Scholar 

  32. Yuan, K., Le, V.-N., McCaffrey, D. F., Marsh, J. A., Hamilton, L. S., Stecher, B. M., & Springer, M. G. (2012). Incentive pay programs do not affect teacher motivation or reported practices: results from three randomized studies. Educational Evaluation and Policy Analysis, 35(1), 3–22.

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Rick Mintrop.



Table 2 Main codes for each complex

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mintrop, R., Pryor, L. & Ordenes, M. A complex adaptive system approach to evaluation: application to a pay-for-performance program in the USA. Educ Asse Eval Acc 30, 285–312 (2018). https://doi.org/10.1007/s11092-018-9276-6

Download citation


  • Complex adaptive system evaluation
  • Evaluation method
  • Pay for performance
  • Teaching evaluations
  • Formative evaluation
  • Summative evaluation
  • Accountability
  • Performance management