Introduction

The reliance on randomised control trials (RCTs) for the impact evaluation of development projects is growing, but their cost levels exceed what is required, and the alternative methods are routinely underfunded. Our aim is to bridge from problem to data to action (Morgan and Olsen 2007, 2008). Bridging to action means making sure that evidence gathered can enable warranted arguments to emerge from the evaluation process. That process must have elements of action, participation and monitoring, as well as data gathering. If there are control groups, and there need not be, development practitioners may not be aware of the control group contrastive findings until the very latest stages of the project when the action research and survey research elements are brought together. Yet important points about how to intervene, what works where, and who is being affected, should be brought to the eyes of the development organisation as soon as possible—not after a long delay. Thus, using RCTs alone is often going to be unethical and too slow, as well as being unwise.

To put the debate into perspective, I have broken down the impact evaluation process into stages (Figs. 1, 2, 3). These are the stages that the scientific dialogue usually takes. When we are being honest and transparent, the scientific method itself generates arguments that may take the same structure.

Fig. 1
figure 1

The evaluation Stages 1 and 2: the arena stage, data stage, and conclusion

Fig. 2
figure 2

Source Derived from a Web of Science literature search on development impact evaluation methods, 2018

Development impact evaluations use qualitative methods too infrequently!

Fig. 3
figure 3

The evaluation stages, showing the arena stage and the feedback loops of reasoning and discussion

Premises, problems and arguments is the arena and reasoning data and evidence interpretation are the debating stage. Yet, even at the stage of identifying the problem to be solved by the intervention, arguments will occur. Opinions will differ about how best to proceed. Two disciplines may disagree on what this is all about. Two sets of practitioners may disagree on the best means of changing development practice. However, we agree to disagree, having settled the terms of our debate. This early stage includes ontological exploration: what exists, what pre-exists the present, what is considered to be relevant amongst all the extant entities, and therefore what areas of life matter. (This will then imply which disciplines will be brought to bear.)

In this paper, I argue that the second and concluding stages involve collective reflection and reflexivity, so that we actually return to Stage 1 to reconsider the framing and then generate even more data, that are both new and different. Therefore, Fig. 3 is a better representation (having feedback loops) than Fig. 1.

Evaluation Stage 1

In the arena stage, we are reading the literature, talking about our knowledge, and discerning what are the objects, histories, entities, narratives and meanings that we are going to investigate. We find out the black lines that distinguish difference in the world. The entities to which we refer are real: examples such as caste, class, gender are all controversial, but the controversies are about the world, not just a bunch of voices arguing in a void. The “ontic” is what exists. The project’s ontology is its sense of where it is, what cases are involved, who are the actors, what processes they start off, what histories matter.

Next, the impact evaluation project goes on to do some activities, and people learn from these, and generate both data and reasoning about the findings. The stage of data interpretation is not only about interpreting quantitative data but it is also about learning (Befani et al. 2014). Crucially, it is also a debate about what words to use, what concepts to apply, what are true narratives and what are misleading ones. A secondary stage, a third stage and a fourth stage, iteratively ask why these narratives, why these errors, how do we resolve disagreements? (Fig. 3).

An example of how an evaluation agenda can be ‘open’ is found in Brunie et al. 2014:

Conduct a deep formative assessment to understand the needs, challenges, and values of the multiple potential programme stakeholders including direct and indirect beneficiaries (Brunie et al. 2014, p. 119).

The meaning of drawing a ‘conclusion’ is that the concluding claims rest firmly upon a mixture of all the preceding reasoning and data. If we take the advice of practical field researchers, such as Brunie et al., we will “Engage local partners for multi-sector interventions” (ibid.), and, through discussions, reach agreement on what is ready for change.

If so, then multiple conclusions are both possible and feasible. There is more than one warranted argument, based on the same original arena of debate. Now, we can have a scientific debate involving closer recourse to evidence. This does not mean depersonalising the debate. Some evidence will be heavily and intrinsically personal (as I will explain in the abduction section).

However, for a sophisticated study worthy of respect, the ‘arena’ stage must be carried out carefully—integrating a review of literature—and the analysis stage must involve transparent evidence (Byrne and Ragin 2009). This has been recognised through the development of alternatives to RCTs: systematic reviews, meta-analysis and other initiatives, such as “What Works…”, which offer useful tools to learn from previous experience.

Rebutting Three Misunderstandings

Some researchers have doubts about realism because it may seem to be too much of a move from epistemological to ontological foundations. One thing that is often misunderstood about ‘the real’ and realism is whether or not the entities you look at, like fishing or Lake Victoria, are changing or fixed over time. Of course, they are changing. Nevertheless, it is worth naming them. To name them is to select what is important from the background mass. A growing literature offers realist impact evaluation (Befani et al. 2014; Allmark and Machaczek 2018; Olsen 2009). The background has been set out explicitly by Maxwell and Mittapalli (2010), and implicitly by the case-comparative school (Rihoux and Grimm 2006; Rihoux 2006; Ragin 2008; Snow and Cress 2000).

Another misunderstanding occurs if one thinks realism is foundationalist, or that it favours the material world over the social world of social constructions. Far from it: Maxwell and Mittapalli (2010) expand on this topic. Social constructions are real in the specific sense that the group labels found in a society have causal influence. There are causal mechanisms embedded in them. For example, the phrase ‘ethnic groups of fisheries workers’ represents both the fisheries workers’ ethnic identity as a cognitive self-reflection and the social grouping that calls their group by an ethnic name, as well as the real patterns of behaviour and culture behind ethnic differences. Therefore, both labels and causes may be relevant. It is not foundationalist to be realist in our approach, but it does imply that we might at some point examine the evidence (by which I mean records of experience) to discern what these things are. In summary, although realism appears foundationalist, what is found to be ‘real’ is contingent on many contextual factors.

As an example, in the excellent analysis of the livelihood impacts of cash transfers across sub-Saharan Africa, Fisher et al. (2017, p. 306) pays attention to “Identity/status and inclusion/exclusion from [one’s] network”, by which they mean people’s awareness of assigned ethnic and tribal groups, and social exclusion. These are not just social constructions: the society and its social institutions have real effects upon people, and these effects were brought to attention during focus groups.

In epistemological terms, just because the intervention evaluation is realist, it does not mean it is naively so. It does not mean that we ignore social constructions, labels, naming conventions, translations or discourse. As Maxwell and Mittapalli (2010) explain, we use a scientific, not an “objectivist”, realism. There is a sophisticated form of realism (often called scientific realism or transcendental realism) and there are forms of naïve realism (details found in Layder 1993; Olsen 2012; Olsen, forthcoming). The world should be seen dynamically.

Murshed-e-Jahan et al. (2018) illustrate realism in a sophisticated value-chain study of fisheries in Nepal and Bangladesh. The realism in a study can be implicit, yet very helpful; and social constructionism of a weak kind can help in the participatory process by giving us a high awareness of meanings and discourses, not just what is factually asserted. By advocating participatory research, these authors advocate conversations.

The setting of an arena is also not a fixed, once-for-all stage. We can revisit Stage 1 after starting Stage 2; see Fig. 3. This is commonly recommended in qualitative textbooks (Blaikie 1993; 2000). The arena lets us talk about our agenda.

Eight illustrations show how flexible iterative mixed methods have been proposed to be used in development intervention processes.

  • Tremblay and Gutberlet (2010) show how community outreach, capacity building, and interviews helped in tandem over time to improve a recycling project.

  • Luo and Liu (2014) argue that cultural awareness and touching base with local cultural differentiation are extremely important, and that contextually well-grounded participatory research will help make the interventions effective.

  • Nathan et al. (2013, p. 3) used a “Research Reference Group comprising representatives from grant partners and participating schools formally … to advise on proposed study measures and recruitment of study participants.”

  • Taylor et al. (2012) showed that investing in leadership capabilities among water agency champions is a transferable, feedback-based method of invoking improvements, which can be transferred to improve other development change processes.

  • Brink et al. (2011) show how sustainable game management is not being achieved by results-oriented research, whereas process-oriented, learning-based, adaptive co-management and co-regulation can work much better. They argue that not only is the new method better but also that the old top-down methods are failing utterly to keep game reserves sustainable in either human livelihood terms or in the sense of sustaining health natural populations.

  • Pollard and DuToit (2011) argue that a practice-based understanding of policy, governance-sensitive actions, self-organisation, and feedback loops helped interventions in an integrated water resource management context to be much better (more effective, more acceptable, more responsive) over time.

  • Gimenez and Perez-Foguet (2010) use participatory methods to good effect in a policy impact assessment.

  • Ngwenya et al. (2012) show that short-term intervention studies miss out the key gender issues, including gross invisibility of women’s work in a fisheries context in Botswana. Solutions are found through discussion.

Other concrete studies which use mixed data types effectively include Kambala et al. (2017), and King and Samii (2014).

In particular with reference to Ngwenya et al. (2012), if we compare the RCTs in Mali by Masset and Gelli (2013), the methodological contrasts are immense. Ngwenya et al. (2012) write in a transdisciplinary way with sensitivity and awareness of multiple stakeholder voices, whereas Masset and Gelli are aiming at a mono-disciplinary medical audience. The practice of publishing the RCT trial protocol first and the results later engages peer review in the Masset and Gelli (2013) case (see also Gelli et al. 2017, 2018), but it does not promote any feedback loops and longer-term engagement of grassroots standpoint holders. It would be possible to do both, but apparently rigid epistemological boundaries, often known as ‘epistemes’ or sets of mutually exclusive rules about data, block the RCT users from invoking the better development research practices (Ravallion 2009).

In brief, the three misunderstandings which often arise are: (1) determinism; (2) foundationalist naivety; and (3) closed stages.

The Evaluation Stage 2

Most good impact evaluations are going to use mixed methods and gather mixed forms of evidence. In development contexts teams are guided in part by funding agencies such as Department for International Development (DFID) (see UK Aid Connect 2018) to formulate a “theory of change”, and use this to derive the strategy for causal analysis (and RCT design). The theory of change approach involves a discussion around how narrow/wide the net is to be cast for a project. Discussions which introduce theories of change include Funnell and Rogers (2011), the Aspen Institute (2004), and UNDP/Hivos (2011), cited in UK Aid Connect (2018).

I am often asked whether therefore the impact evaluations use pragmatism. Pragmatism was a key argument found in Creswell’s discussions of mixed methods overall (1994). Detailed summaries and critiques by Maxwell and Mittapalli (2010) and Allmark and Machaczek (2018) are helpful. The pragmatist ideas offered by Creswell and Clark (2018, pp. 38–40) are rather confusing. They cite Teddlie and Tashakkori (2003) in favour of abandoning metaphysical concepts, whereas the latter actually argued mainly in favour of abandoning the qual-quant paradigm wars (Teddlie and Tashakkori 2009, p. 8).

Overall, I am not convinced that pragmatism makes a strong contribution to impact evaluation. Pragmatism in philosophy means a number of things, and it offers very few implications for what you should do or may not do.

Instead, when one argues for scientific realism making reference to the real world, this is a fierce and firm philosophical standpoint. Fierce because it demands that evidence be more than personal, that it be worthy of respect and scrutiny, and firm because it is foundational to consider that the world around us pre-dates us to any extent. It is a metaphysical claim, whereas pragmatism makes few metaphysical claims. It is mainly about processes of making choices.

Implications of the Dialectic as Real

I would go even further. In general, the causality we observe in development does not work simply. It is complex for two reasons. First, life moves onward through various dialectical processes, and second, we ourselves as researchers are inside the changing world. We know that tensions exist, and we are able to influence the world, so these various tensions build up to social change. A dialectical change is one which has three stages: the initial stage, the building up of tension, and the resolution through some qualitative change. Often a project’s quantitative data avoids mentioning the very changes which are the ones that really shift a social situation.

Personal experience in social movements and political participation has proved to me that dialectical changes are a real, ontic thing, not just a figment of an author’s imagination. Also, a multiple set of dialectics are going on all at once, making life challenging for all of us. I would argue that it is better to avoid being a determinist when it comes to causation. If this is the case, we then realise that evidence may belie the truth, or create a mask. For example, social class dynamics occur, so tension exists, so working class people may hide things from an elite observer. Language dynamics exist with a lot of tension during our modern period of rapid change, so we hide some facts that are best expressed in the minority language, and thus transcripts are faulty… and so on. We need to generate a critical perspective, enunciate our questioning views, check on all evidence, and get external reviews which may offer worthwhile insights. The core agents in an intervention are human, multiple-agent interests. These combine to create a panoply. (The agents’ voices do not just reflect arbitrary, subjective viewpoints; they offer glimpses of actual standpoints reflecting real interests).

In turn, agents’ values and beliefs affect what is accepted as premises. Or as data. Or as reasoning. To accept that we will study a problem, or an intervention, does not mean to accept agents’ values and beliefs uncritically. It means to consider them critically. For a realist, the ‘science’ part involves comparing and making judgments about different arguments.

Interim findings from action researchers or participatory action research are of value in themselves. The findings from such activities are usually based on a mixture of retroductive and abductive logic. Retroduction involves asking why: why this activity choice; why this outcome occurred; and, more deeply, why is it this problem that you feel needs to be solved (Downward and Mearman 2007)? Abduction on the other hand refers to knowing ‘from inside’, in either a phenomenological or ethnographic context, what things really mean to people within the scene, such as a development project. Both these methods make use of personalised data. Instead of transparent or recorded data, both action research and ethnography depend heavily on a person’s bodily experience, memories, and ex post vocalisation. Such methods must and can be combined with surveys or interviews which involve much more record keeping, allow comparability but nevertheless do not require a randomised form of control group selection.

The evaluation process overall does need to provide a transparent evidence base. Transparency arises naturally from survey data collection, re-use of schools’ or NGOs’ data, or the transcription of interviews or focus group transcripts. Meanwhile, at the concluding stage, the actors who are involved in the action research, monitoring and evaluation or participatory research must also re-evaluate their own positions (reflexively), and this is likely to occur in a private pre-publication dialogue, not In public. Admitting this will strengthen development policy debates, not weaken them. Therefore, we could decide to use rapporteur methods to bring notes of the late-stage discussions into a public, ongoing reflexive multilogue.

Looking at the process I have just described, we use more than mixed methods, we use a mixture of steps of analysis interspersed with scene-setting decisions, re-analysis, induction from larger datasets in the survey components as well as in the interview components, and deduction from the elements of quantitative data that belong in the overall database. Qualitative software, such as NVIVO, can be used to gather up these many threads. The human interactions of action research can be part of the ongoing evaluation efforts, and all documented into the NVIVO database for further retroductive analysis at the end.

To be really specific, the retroductive questions we use, in asking ‘why’ in a backward looking way, would include:

  • Why did that not work?

  • Why did this disagreement happen?

  • What narrative reconstruction worked to resolve disagreements?

  • Why were forecasts not met from early in the project? Was it due to some internal discursive limitation, or a clash with an external body? And, if so, what were the boundaries that had to be breached?

Comparative case analysis may fit well with the analysis of the survey data (Aus 2009;). This need not be an RCT-based analysis, it could be a qualitative comparative analysis (QCA, fsQCA, csQCA) or a process tracing through both survey and other enquiry forms (Hellstrom 2001).

All in all, my argument is that we must not limit one study to one logic from among the four choices: abductive, retroductive, inductive, deductive. We need to combine them and use them in sequence, or iteratively, as fits each study. Furthermore, this sequence first has to have an arena for the discussion, and that is the most important initial part of an evaluation project.

Further Development of Retroduction

There are many ways to do retroduction. To retroduce being to ask why, we can first discern open retroduction and closed retroduction.

If you consider Stage 1 closed and Stage 2 underway, you might only refer to existing recorded data (evidence) for retroduction. You may re-examine the existing variables, look for supporting quotes, and find answers to obvious questions in the data you have. This can include transcripts of interview data. This would be closed retroduction.

If you consider Stage 1 (arena–setting) to be open, then you can do open retroduction. You start to ask what needs to be re-conceived in order to get the answers to knotty problems in the research. Why did this intervention fail in this area? Why was it not implemented properly? What unexpected barriers were hit? As any experienced researcher knows, these questions involve re-opening the whole can of worms. This will mean a widened ontology, an openness to reconsider changing things that may have been considered fixed or irrelevant at the start. This would be open retroduction.

Important Background

RCTs are dominant, and examples abound. Costliness arises from the likely co-correlation of ‘unobserved’ confounders. The idea of a confounder implies closed retroduction. The costliness of the whole trial actually needs to be put on one side if we can do open retroduction to find out what is going wrong, or what went right, as quickly as possibly in a definitive way.

Cluster trial methods are widely used among those who believe in closed retroduction, no retroduction, or pure deductivism in research (Kelcey et al. 2016; Taft et al. 2012 illustrate these). The cluster idea locates groups of treated and untreated people (or cases, e.g. schools) far apart in space. Then, no changes of strategy are allowed, as it would pollute the data. Diffusion, on the other hand, is a natural part of human communicativeness. Diffusion of ideas from the trial is discouraged by the RCT attitude, the RCT-committed team, the RCT protocol (see Kikuchi et al. 2015 for an example). This purity depends on a deductive logic. It assumes the ‘data’ will lie in a hermetic seal. Validity in this logic is not the same as validity in the transparent science sense. Validity in this logic is achieved by “if … data and trial then… therefore…” logic. But transparent scientific logic could take a different form: “The problem is XX and we discovered a barrier to solving it, BB…” There can be evidence about BB and XX, but it would not have been foreseen at the start.

My advice is to start Stage 2 but then allow for a revisit to arena-setting if need be. This does not happen in cases like the RCTs of Lubinga et al. (2014) or Pradhan et al. (2013).

Furthermore, basic reasons for the costliness of RCT trials is that the measurements must be both longitudinal (pre- and post-treatment measures, typically) and spread out far and wide to disallow people in the Control arm from discussing the treatment or its effects with those in the Treatment arm (an example in microfinance is McHugh et al. 2017; see Orr 2015). In other words, high-quality data is the key aim. This implies an epistemological value over other human values. It turns out to be deontological. (Deontology refers to principles-focused thought.) I have no commitment to deontology, because it tends to be highly conceptual and not realist.

Secondary to this aim of pure data, the choice of ‘cases’ often turns into a reductive search for atomistic units of society to be ‘affected’ by the treatment. Atomism itself has been roundly critiqued in the philosophy of social science (Sayer 2000). Whilst individualism is common in some circles of the academic world, known as having Anglo-Saxon traditions, most of the world’s development science aims for a mixture of holism and atomism. Researchers try to achieve a healthy mixture of depth ontology and multiple levels so that the atomic units are seen in context. A depth ontology is explained with useful diagrams by Lopez (2000, pp. 33, 78, 88). In essence, we should assume open, interacting systems. I will give two examples illustrating the two extremes, and their costs.

On the atomistic side, examples like White (2013) give ‘guidance’ but implicitly send a message that holism is not wanted. On the mixed-methods side, we have authors like Befani et al. (2014) and Ssengooba et al. (2012) who argue in favour of more open forms of evaluation.

Alternative approaches also abound, notably participatory action research, action research, process tracing, QCA, and monitoring and evaluation. Mixed methods can also combine these.

Ontological Discussion Should Occur Explicitly

Creating an arena for debate is widely thought to be an epistemological activity. It is like trying to decide what we are talking about, so it seems it must be about knowledge. But really, constructing the conceptual ‘ground’ or fundamental list of entities and the key problem, and agreeing to talk about them for a while, is what I mean by setting up the arena. This is an ontological task.

Only once the ontology has begun to be worked out, with some boundaries on time and space, some aims and some objectives, concepts and names of things (such as who is a respondent, who is a participant, what members will be consulted, how action plans are to be written down if at all, and what will constitute evidence), can the project really start in earnest as teamwork. Before this moment, there is speculation, there are beliefs, there are lay narratives. These are not to be thought of dismissively, but the nature of development as a science, a social science, is that it can focus on specific areas of life and bring together a discussion around the evidence and experience in these areas. Yes, that can include history. No, it cannot avoid setting up an ‘arena’ of debate.

Evidence is created as part of a bridge to action. Therefore, the control group evidence could be interesting, even if it were polluted by knowledge of the intervention, as long as it still creates contrasts that are of value in relation to objects or entities that we have agreed to notice.

It is important to realise that, without action research or participation, people studying interventions are going to learn too slowly. The modes of learning must ideally be embedded in all the ‘development methods’ training. The tasks of research can be embedded in all the stages of the implementation of the intervention. Thus, experts can work in the field alongside others, and everyone’s expertise can be respected.

Otherwise, if we accepted the kind of standards of methods that are used by medics or by statisticians, who are focused on a set of identical cases reflected in the dataset (not the underlying reality), we could miss opportunities and waste a lot of research money. It is important how one argues about this.

I will now explain why I promote the use of process tracing, case comparison, context-focused QCA, and the discernment of causal mechanisms (see Sayer 2000 who sets out causality as real mechanisms).

These qualitative methods are consistent with choosing control groups with a restricted treatment group. What is not consistent is the use of merely deductive reasoning, or simplified purely mathematical methods, to draw policy conclusions.

Pro-Mixed Methods is Inherently Anti-Deductivist

Any sketch of the research process will tend to show mixed methods using multiple logics: abductive, retroductive, inductive, deductive.

Induction in particular when combined with dialogue of diverse actors creates a dialogic element. This combination brings complexity to bear during the reflexive stages. The actors in the process bridge to action. I cited earlier examples from African fisheries and land-use planning, and sustainable development, where processes of change were mediated by change agents whose growing awareness of multiple stakeholder standpoints and positions helped the group to move to better management practices. Bridging to action is a way to see development as practice. RCTs, on the other hand, postpone development practice and isolate the researchers from those who are being researched.Footnote 1

Studying interventions may not be as important as getting to grips with empowerment issues, barriers to human capabilities, and development objectives which are fundamentally not being addressed by current research relationships. Research is human, not just medical. Development research about an intervention can be valid, highly structured, systematic, transparent and expert, without having a control group at all.

Thinking of Ngwenya et al. (2012), who recommended co-managing research whilst innovatively co-managing the fisheries resources, that study did have a large structured part. Ngwenye et al. used both a local survey and crosstabulations, and they triangulated their data with government data to give a backbone and generality to the study (Ngwenye et al. 2012, pp. 111–116). The use of tabulated information is called systematic analysis (Olsen 2012). At the same time, the researchers reflected on their interviews to reach new opinions, which they did not hold at the start. To argue that women are excluded from recognised roles in fisheries, and that new policies had obliterated women’s roles, is to raise scientific, well-grounded objections.

In such cases of pluralist mixed methods, it is helpful to think of the author’s logic not as deduction, which is arguing from a series of particulars to a general law, but rather a building up of complex arguments. 1. Many fishers are women, notably basket fishers (Ngwenya et al. 2012, p. 115). 2. Past policy on fishing favours men. 3. Favouring men occludes favouring women. 4. Not noticing women’s work leads to the ignoring and discouraging of women’s work. 5. Fish policy has discouraged women’s work in fishing. 6. Favouring men will not help the fishers. The argument is logical but not deductive. This is known as a warranted argument: it moves from premises to conclusions, with integration around key concepts. (A growing set of works by Alec Fisher, Bowell and Kemp, and Weston build up such logic into ‘critical thinking’, a skill which can be taught.)

In warranted arguments, the conclusion does not stand alone and it is not a subjective belief. Instead, it rests upon the roots which are the premises of the argument.

It is possible to include treatment and control groups without limiting the research to RCT methods of statistical analysis. Agarwal (2018) illustrates by studying a sample of 70 groups of women, each doing rental farming collectively, in the states of Telangana and Kerala. She did follow-up interviews and had close liaison with the NGO and farmer leaders. She engaged in group-comparative statistical post hoc analyses of the overall outturn at the end of the study year. Here, there was no randomisation, but perfectly adequate comparative methods (ibid.) Agarwal’s team also liaised with government officials and women’s group leaders and members. This was teamwork with a scientist able to move about and conduct a cross-state comparative element.

Group reflexivity is generally important at the end of the control group data cleaning stage whether the evaluation uses case-comparative, group-comparative, or RCT methods. Here, we are making systematic comparisons. However, conclusions from data tables alone are not final. The numbers cannot speak. An argument is something humans construct, it is not something an artificial intelligence could construct. Instead a circuit of discussion and revision is likely to be needed. We bridge to action again.

The ongoing nature of many interventions has meant a long delay in publishing findings of evaluations. This may be a good thing. Long-term effects are not the same as short-term effects, as Lam and Ostrom (2010) illustrate. Their study of watershed management in Nepal used data which showed short-term water gain improvements below the water management engineering systems, but the long-term picture was really the important scene. Their paper concluded that long-term water provision was a key outcome.

Another reason for waiting to publish is that the overall pattern of economic or measurable effects may not be visible or evident to participants, but the project that has a wide scope and ambitious coverage requiring the assembly of a large and extensive dataset will have complex data. Once the project has happened, the team need to keep funding aside for their last few meet-ups, and translation is also needed so that outsiders do not dominate at this key stage. Pattern discernment often takes three formats at this stage:

  1. 1.

    Tables of means and comparison of means across groups.

  2. 2.

    Adjusted means comparisons, after allowing for confounders or while using IV adjustment or a fixed effects method. Sometimes, a whole regression is fundamentally based on a contrast of means by groups using ‘difference in difference’ (DID) methods. A DID estimate allows for change over time that is normal versus a different trend among the treated group. A simple DID is a pair of lines moving upward showing a growth curve of profits over time (or scores or size), with the treatment group rising more quickly from the same base as the control group. Variants use curves, multiple treatment amounts, and so on.

    Entropy-based propensity score matching (PSM) is also used to make fair contrasts of group means, but PSM sometimes allows cases to drop out (see discussions in King and Nielsen, forthcoming; and Duvendack et al. 2012). Textbook treatments of this topic by Hansen et al. (2011) and Lan and Yin (2017) offer innovations but not methodological pluralism.

  3. 3.

    Regression in full format, attributing causality to one factor or another. Here, there is shifting ground for comparison over time. The initial random sampling is not enough to guarantee a base of support for claims focused on the treatment, because the base of support (i.e. the treated cases and their counterparts in the control group) may change due to compositional change over time. There are also some problems with the regression format: do you use, or avoid, interaction terms? Do you allow moderation or not, in other words? Do you allow for mediation of effects? Is not the treatment going to have multiple effects, and could any of these be external and not measured? If so, who records this, who speaks this truth, who brings it all together? Snow and Cress (2000) showed with sound ethnographic evidence that multiple outcomes may be meaningfully teased out, yet they would be ignored using the sophisticated atomistic statistical methods. Their study discovered four variants of policy success, R1, R2, R3 and R4, and then pursued the causal patterns for each for these. Such depth is too rarely achieved.

The truth is that adjustments to improve the accuracy of steps 2 and 3 have been promoted to the extreme of now questioning all tables of means. The very obvious descriptive point, that one group did better than another, which can be augmented by subtracting the normal starting point, has been lost, and the audience has grown narrower and narrower.

Instead of merely ‘using’ ‘control variables’, we should consider the whole of how the treatment might take its effect. Changes in concomitant input variables would be expected. This is the ‘coincident necessary part of a sufficient pathway’ approach. Sufficiency of A for an outcome Y is analysed using a Boolean logic of factors that co-exist (A&B, also written AB or A intersect B) and factors that are complements AB, A or B. We write A  => Y if A is a sufficient condition for Y to have occurred. (An alternative relationship is A IFF Y, where IFF means if and only if, which is a stronger relationship.) Most statistics assumes that if X IFF Y then Y IFF X, but I will show that this is a false and misleading assumption.

Once the data are ready (which is a late stage in a project), the great achievement is possible. The masterful statistician discerns a small rise in a good outcome, or a small decrease in a bad outcome, which would not be noticeable to the naked eye or to participants, or without the regression controls or the adjustments.

There is now perhaps an ‘evidence capital’, a social capital of holding the evidence while deciding on the key findings, then noticing the key themes, and making sure they can be evidenced. This is like looking for something that might be obvious, like the emperor’s new clothes.

It is deductivist to think we cannot see the difference of outcomes using raw data (see Mock et al. 1993, pushing for case–control methods).

It is like saying we can only see Z in the pattern of X and Y if we use these cleaning methods, of which only an elite group will know.

Mixed Methods Authors Work in Teams

The numerous studies that follow DFID guidelines add a monitoring and evaluation aspect to each development project. DFID has invested in large numbers of evaluation experts. The idea of monitoring is not to make quantitative records, but to engage with participants and see how things are going at a midpoint and near the endpoint of an initial trials stage. Really listening is very important as it implies openness and narrative breadth, open questions, multidisciplinarity. These are all excellent bases for mixed methods but do require different skills from the experts in questionnaires and interviews. Therefore, it is usual to have at least 3 people working on an evaluation; 4 if you include the social statistician. Working in a team, these people discuss with each other. They disagree, explain, argue, and develop ideas. Two important ‘Why?’ moments arise:

  1. 1.

    Why do you think that? What was the evidence that made you think that?

  2. 2.

    Why are we disagreeing? What is the language/wording/conceptual difference that underpins our disagreement?

When answering these questions, ironing it all out is not the aim. Realising that this is fundamental is important. Retroduction leads to better arguments that explain not only success but also failure of the intervention in diverse circumstances.

Figure 3 illustrates how retroduction will lead to generating new data, having discussions about our shared premises, and creating other forms of feedback loop.

Mixed Methods Using Retroduction Is Now Common

The word ‘retroduction’ arose in the 1980s during a debate on ‘scientific realism’. Its historical antecedents arise in a confusing argument among philosophers, which we must avoid (retrodiction being a different thing).

Numerous authors then posited four forms of logic for social research, all arguing that we can combine them (Blaikie op cit., for example). Other authors did not like to admit deductive logic had any purchase, and moved toward a supposed ‘constructivist’ pole. Most authors just try to avoid a sticky argument. But it is really simple. You can do induction for one month, deduction for a few days the next month, then write all that up and do abduction by observing in situ for a few weeks, and then discuss among your team and do retroduction for a week. ‘Doing’ each means creating the input and output stage of each. They are very different. But all are valuable in their own rightful sphere.

Mixed Methods with Multiple Pathways of Cause are Uncommonly Good

What is missing in the RCT statistics is a sense of equifinality. In reality, multiple pathways can lead to the same good outcome. For example, to limit ‘child labour’, achieving high household wealth will work, and achieving a welcoming school with breakfast and noon meals provided free to all children will also work. To take this further, the school may need toilets suitable for both girls and boys as separate blocks. We define this mechanism using a helpful terminology from logic:

  • Toilet provision is a necessary condition for one of the sufficient pathways.

  • High household wealth is sufficient as a pathway to having no children in ‘Child Labour’.

  • Having noon meals at school is not sufficient to eradicate ‘Child Labour’.

  • Having both breakfast and noon meals at school, and a welcoming school, combined with the toilet provision, is a sufficient pathway to removing ‘Child Labour’.

Thus, W or (N&B)&T is sufficient for removing ‘Child Labour’ at household level. (Also written W∪[(NB)∩T] => ~C, where ~C refers to the absence of child labour.) W refers to wealth and T to toilets, N noon meals, and B breakfast here.

Similarly, by blocking the good outcome, obstacles can take many forms. When we think of a ‘bad’ outcome, like ‘child labour’, we need to consider both outcomes:

  • Factors affecting having Child Labour

  • Factors affecting the eradication of Child Labour.

A large literature shows with case-study methods that multiple pathways commonly occur. Known as ‘equifinality’, this leads to doubts about the closure of the statistical models often used in RCT. If we have some unspecified necessary condition T (eg toilet provision), which is part of other apparently sufficient conditions for Y, we may not realise that Y in future will fail, due to the change in T. If we consider the treatment (noon meals and breakfasts) without considering T, yet T is crucial, then we do not have a good arena for the whole conversation about removing Child Labour.

The project should be learning about such factors. A learning community can be created. Ragin has argued for this in a series of books (2000, 2008; and Byrne and Ragin 2009).

Another example. At high school, to reach high scores, a student may have talent, support at home, a good school, or high resources in their school arising from government funding. Not all of these conditions are required. If such arguments are true—and they often are—then statistical methods that conclude ‘no effect’ in the face of ambiguity are wrongheaded. Yet it is an empirical question in each situation. The RCT proponents may tend to assume it is a simple empirical question. I would tend to assume that it is quite complex. Which factor is sufficient? What condition is INUS? (INUS is a necessary condition as part of a sufficient pathway.) Will we appreciate the difference?

A query to RCT supporters: If something is crucial for some cases, but irrelevant for others, can the data cover it? Yes, it can.

A second query to RCT supporters: Do you assume both necessary and sufficient status for every X, A, B, S, and T? Surely not. Statisticians tend to assume if X represents a real cause of Y then it is both necessary and sufficient for Y.

But if X is composite then this is not symmetrically true, even if X is sufficient for Y.

In general, if X is A&B&S, and X is sufficient for Y, we still have a problem:

With regard to counterfactuals, for such an X, Not-X does not imply (is not sufficient for) Not-Y, even if X is sufficient for Y (see Appendix). Nonreversal of causality in the strict case is typical.

In brief, if:

figure a

Suppose A was structural background and B was the treatment including an INUS condition, and we have evidence that together they support the achievement of the outcome, then we can’t say that the absence of one or the other will cause the failure of the outcome!

This is non-intuitive to most statisticians. They assume that moving upward on a curve is symmetric to moving back downward on that curve. This implies they have conflated => with = N > and IFF. These are three different operators in logic.

Abductive Impact Evaluation is Unlikely to Work Alone But Does Work in Bridging to Action

Mixed methods are very good because the speakers in the abduction and monitoring part tell us what we are getting wrong. Through the practice of good listening, stakeholders and team members will make researchers realise what is important in the mass of data.

The use of anthropological and ethnographic methods is very popular in development evaluation, but the staffing costs have to sit alongside other costs and thus the project managers must be ready to defend and argue the case for ‘using’ ethnography. At one level, ethnography does not sit easily with project evaluation because ethnographers intrinsically want to be open about their findings and not have a closed agenda from the start. Therefore, no promises will be made. However, most development researchers deeply appreciate the multifaceted knowledge that the development community gains from ethnographic practice and the resulting publications. Evidence for this arises from the widespread inclusion of ethnography in the grants already awarded in the growing Global Challenges Research Fund (a UK-based initiative of £1.5 billion in research funds linked to development interventions). But does the anthropologist in the team try to bridge to action?

The essence of how this works is through teams of researchers (and team meetings), and through the circulation of knowledge during and after a project through dissemination. A bridge to action can occur even if the individual ethnographer has not planned it. As long as they are involved in communication and dissemination, their work will have an influence. I value this circuit of knowledge. I argue against allowing the abductive investigation to occur without publications or public presentations as follow-up. The whole of the academic international community agrees with this position. What is missing in some disciplines is a respect for what abduction offers. By arguing that the learnings from ethnography fit in as claims within warranted arguments, I have shown how teams can invoke abductive logic without resting a whole argument entirely upon that one logic.

Conclusions

The schisms in the methods literature are largely artificial. Instead, the situation is that different groups of authors set up diverse arenas for discussing how events are affected by development interventions. Due to the existence and influence of academic disciplines, the arena offered by some authors is not acceptable to others. This is the case, for example, when impact assessment is set up too narrowly, with an ontology and theoretical framework from within just one discipline area.

The big challenge is to set up arenas which mix up the disciplines’ sensitivities without becoming too wide ranging for a project to be feasible. For example, one has to allow for basic elements of wealth or social class; for institutionalised habits or enculturation; for gender and/or other structural elements that underpin inequality; and for aspects of geography in the region where the intervention has occurred. This article argued in favour of multi-disciplinary approaches to impact evaluation across the whole spectrum from medical and social to business and engineering disciplines.

A third issue is the micro-nature of some impact evaluations using RCTs. Nested cases always exist, so being too reductionist in data collection might reflect an arena that had little depth. The impact evaluation lacks holistic objects such as the government, the social history, organisational types, or the cultural grounding. Many development projects have had successful evaluations, often including both an ‘action’ part and a survey-based data analysis part. Each impact evaluation has specific characteristics, and these could limit the usefulness of their findings in other contexts, giving very restricted external validity. Mixed methods could be useful to better understand these specificities across the micro-, meso-, and macro-levels. Thus it is important to recognise the meso-level in a globalising, international world.

Thus, a particular form of pluralism is possible in development evaluation. This kind of pluralism has a technical name in the methodology literature: methodological pluralism. I advocate using methodological pluralism across the action research–systematic research divide (and thus also the qualitative–quantitative divide), and pluralism of disciplines, i.e. transdisciplinarity.