Managing performance and winning trust: how World Bank staff shape recipient performance

World Bank evaluations show that recipient performance varies substantially between different projects. Extant research has focused on country-level variables when explaining these variations. This article goes beyond country-level explanations and highlights the role of World Bank staff. We extend established arguments in the literature on compliance with the demands of International Organizations (IOs) and hypothesize that IO staff can shape recipient performance in three ways. First, recipient performance may be influenced by the quality of IO staff monitoring and supervision. Second, the leniency and stringency with which IO staff apply the aid agreement could improve recipient performance. Third, recipient performance may depend on whether IO staff can identify and mobilize supportive interlocutors through their networks in the recipient country. We test these arguments by linking a novel database on the tenure of World Bank task team leaders to projects evaluated between 1986 and 2020. The findings are consistent with the expectation that World Bank staff play an important role, but only in investment projects. There is substantial evidence that World Bank staff supervisory ability and country experience are linked to recipient performance in those projects. Less consistent evidence indicates that leniency could matter. These findings imply that World Bank staff play an important role in facilitating implementation of investment projects.


Introduction
Development finance agreements between the World Bank and recipient governments mandate a range of behaviors in implementation, including, for instance, commitment to the project, conducting stakeholder consultations, resolution of implementation issues, adequate financial management and adequate procurement. Data from the World Bank Independent Evaluation Group (IEG) evaluate the degree to which recipients fulfilled these obligations and "performed" well. The data show substantial variation between projects. Existing research explains these variations based on differences between countries (Girod and Tobin 2016). However, even when comparing projects in the same sector, from the same country, and approved in the same year, variation remains. Afghanistan, for instance, had two projects focusing on water and sanitation approved in 2005. In one, the IEG rated the performance as satisfactory, whereas in the other, it did not. This kind of variation is insufficiently understood in the literature on the effectiveness of development assistance and in broader debates on adherence to the agreements between recipients and International Organizations (IOs).
To explain variation in recipient performance, we build on studies that have shown that international institutions can differ in their service provision (Kilby 2013a;Smets et al. 2013) and complement this research by shifting the focus to the role of IO staff members. We emphasize that staff can foster recipient performance. The argument is built on a substantial body of literature focusing on the influence of international bureaucracies in IOs ( Barnett and Finnemore 2004;Copelovitch 2010;Knill et al. 2019). These studies suggest that international bureaucrats have a considerable impact on policymaking at IOs in general (Barnett and Finnemore 2004;Hawkins et al. 2006;Knill et al. 2019) and the World Bank in particular (Honig 2020;Weaver 2008). Given that recipient performance can crucially hinge on the design or supervision of development finance agreements and development projects, the influence of staff members on recipient performance seems likely. Therefore, we pose the following question: to what extent do international bureaucrats influence recipient performance in World Bank projects? Our argument is built on recent research on the World Bank that suggests that not all staff members execute their tasks in similar ways. Staff preferences (Briggs 2019a) and staff effectiveness (Bulman et al. 2017;Denizer et al. 2013) seem to differ substantially. We claim that these disparities matter for recipient performance. To explain variations, we hypothesize that three staff characteristics can influence recipient performance. These three characteristics are derived from theoretical perspectives in the IO literature that explain why member states comply with IO demands. We argue that international bureaucrats can be better at supervising projects, more stringent or lenient in applying development finance agreements, and more successful in securing the trust of essential stakeholders in recipient countries. Through these three pathways, international bureaucrats may directly influence the degree to which recipient states follow the aid agreement and, thus, perform well in the eyes of the Bank. Our empirical results show substantial support for the hypotheses that World Bank staff characteristics matter for recipient performance. We demonstrate that recipient performance increases with the supervisory ability and country experience of staff. While we also show some evidence for the role of leniency, those findings depend much more on the variables employed and the specification used. Generally, the evidence implies that staff play a critical role in facilitating the performance of recipients. However, the positive contributions of staff seem to materialize only for investment projects and not for development-policy operations. The findings are robust to extensive checks focusing on bias in ratings, employing alternative ways to measure the main independent variables, and employing a range of alternative model specifications. Furthermore, we probe whether self-selection might drive the results but find little evidence. Overall, evidence from a battery of different tests implies that World Bank staff are important in shaping recipient performance.
The article contributes to two debates in IO research: World Bank project performance and compliance with IO demands. Existing studies on World Bank project performance have primarily concentrated on project outcome ratings rather than recipient performance (Bulman et al. 2017;Denizer et al. 2013;Dreher et al. 2013;Honig 2019;Kilby 2015). They have demonstrated that the World Bank plays a substantial role in determining project outcomes: either through design (Smets et al. 2013) or through supervision (Kilby 2001). However, the literature on IOs more broadly has long argued that explanations focusing on the adherence to IO demands, often referred to as compliance (Börzel 2020;Girod and Tobin 2016), necessitate a different set of explanations than studies focusing on effectiveness (Simmons 1998). While research on the effectiveness of development assistance has concentrated on understanding the conditions under which development assistance leads to welfare gains, research on compliance has aimed to understand why member states adhere to IO demands in the absence of a centralized authority with coercive capacity (Hurd 1999). That said, whether welfare gains materialize through a project depends on much more than the adherence of recipients to the obligations spelled out in the aid agreement. Development impact might not materialize even if borrowers do everything that is expected of them, for example, because a project ignores essential local contexts. Critics of the Bank have often argued that Bank programs lack effectiveness or even do harm to development objectives (Easterly 2014). We deem it unlikely for an inappropriate project to become more effective, even when borrowers do what the World Bank expects of them. The crucial difference between outcomes and compliance is more than semantics. It has spurred debates that focus on explaining differences in performance and compliance, respectively, and emphasize different sets of variables (Simmons 1998). Our study explores recipient performance rather than project outcomes and complements the literature on World Bank project performance.
By doing so, we also contribute to the literature on compliance in International Relations. In their seminal study, Girod and Tobin (2016) contrast a strategic importance hypothesis with an alternative income hypothesis to convincingly demonstrate that countries with lower income dependence on World Bank projects are less likely to fulfil the expectations of the World Bank in the project. However, such differences between countries can still not account for a substantial degree of the variation we observe in recipient performance. This emphasis is also a feature of a larger body of the quantitative literature in International Relations (IR) that focuses on the adherence to an international agreement. This research on compliance focuses almost entirely on country-level variations (Börzel et al. 2010;Koliev et al. 2020;Simmons 2009;Vreeland 2006). We add to this compliance debate in two important ways: first, we extend its arguments with insights on the characteristics of international bureaucrats from the respective literature in public administration. Second, we demonstrate empirically that the variation in adherence to World Bank assistance agreements can be explained by variations between the international bureaucrats tasked with overseeing the project.
We proceed in five steps. First, we discuss why we expect staff to influence recipient performance. Second, we extend well-established perspectives on compliance from international relations and develop three hypotheses on how World Bank staff can shape the stringency of enforcement, build the capacity of recipients, and foster the sympathy of interlocutors. Third, we introduce the data used to test these hypotheses. Fourth, we present evidence to evaluate the relative importance of country and staff level factors in explaining recipient performance. Finally, we conclude by discussing the implications of our findings for academic debates on both World Bank effectiveness and country adherence to the rules of the agreement with an IO.
2 Recipient performance and the role of task team leaders Recipient performance in the context of World Bank projects can mean different things depending on different types of projects. There are two main types of lending instruments the World Bank uses: policy-based lending and investment project loans. 1 Both categories of operations include an agreement that stipulates the obligations of borrowers in the context of the loan. For Development Policy Financing (DPF), the contract includes conditionality that borrowers are supposed to implement. The particular set-up has changed over time from Structural Adjustment Loans (SALs), where conditionality was implemented in exchange for disbursements, to Development Policy Loans (DPL), where the implementation of conditionality precedes loan approval. 2 Investment Project Financing (IPF), on the other hand, is tied to specific projects. Here, the World Bank commits money in the expectation that it is used for a particular project that was previously agreed upon with a recipient. The project is implemented by the borrowing government and implementing agencies in the recipient country. Recipient adherence to the aid agreement contains less precise conditions to be implemented and more generally ascribed behaviors in IPF. These behaviors include, for example, government ownership, stakeholder consultations, resolution of implementation issues, as well as adhering to adequate financial management, procurement guidelines and counterpart commitment.
We argue that World Bank staff can play an important, underappreciated role in facilitating recipient performance in World Bank projects. Specifically, we focus on one group of staff members: Task Team Leaders (TTL). They are the most relevant staff for individual projects (Briggs 2019a;Bulman et al. 2017;Denizer et al. 2013), although several different staff at different levels of the hierarchy are also involved in 1 In the interest of parsimony, we speak of lending throughout the article even though some of the loans are partly or fully given as grants. Recent years have seen a rise in program for results funding, which could be classified as a third main type of lending types (Cormier 2016). However, this shift had not taken effect yet in the period under study. 2 We refer to DPF when we talk about development-policy financing regardless of whether the specific kind of DPF the World Bank administered was called SAL, DPL, or another credit line was used that is subsumed under DPF.
projects. Most notably, all work with recipients is managed by the country management unit and headed by the country director. In large countries, like India, the World Bank has one country office exclusively focusing on this country. In other cases, country directors are in charge of multiple countries, like the country director for Kenya, Rwanda, Eritrea and Uganda. Country directors work with the recipient government and World Bank sectoral staff to converse on project ideas. Once the general idea has been determined, a TTL is chosen who works as "the Bank's principal point of contact for the borrower for the project" (World Bank 2013a, p. 1). TTLs are usually recruited by the practice managers responsible for a project. The project team is headed by the TTL and further includes technical staff working on legal issues, financial management and safeguards.
TTLs have a substantial role to play both in the preparation and implementation of projects (World Bank 2013a). In the preparation of a project, TTLs lead several missions in the recipient country. These missions include the identification mission, the pre-appraisal mission, and the appraisal mission. TTLs are also in charge of drafting all relevant documents while designing the project. Such documents are the project information documents, the project concept notes and project appraisal documents. Thereby, TTLs spearhead planning on procurement, financial management, project objectives, risk identification, and project instruments. After the initial planning phase, TTLs head the negotiation team of the World Bank in the final negotiations with the recipient. They prepare those documents that the executive board uses when approving projects. 3 During implementation, World Bank projects are in the hands of the recipient government and implementing agencies. Nevertheless, TTLs play a fundamental role. They lead the World Bank's implementation support to the recipient government and are in charge of producing implementation reviews that are updated at a minimum every six months. These reviews document the progress the project has made toward its objectives and are also used as the information basis for the task team when deciding where to focus efforts, whether to reduce the resources available in the project or to propose restructuring a project. Furthermore, TTLs are in charge of producing reviews that are conducted roughly at the mid-point of the project. These mid-term reviews can lead to changes in closing dates, alternations to the targets of the project as well as plans on restructuring. The TTL leads decisions on proposing a restructuring of a project to the executive board and prepares all relevant documents for meetings discussing restructuring. Additionally, the TTL is in charge of identifying governance and corruption issues in the project. TTLs review selected contracts, procurements and financial management to identify these issues.
In a limited number of major decisions, TTLs do not have a meaningful say. Formally, they are not authorized to decide on additional financing requests (decided by the country director) or final decisions to submit the project for approval to the executive board (decided by the regional Vice President) and subsequent approval (decided by the executive board). However, TTLs are involved in drafting all documents that are the basis for the decisions of these actors (World Bank 2013a). Due to their wide-ranging responsibility throughout the project-cycle, they can be seen as the most crucial World Bank staff members for each project (Briggs 2019a;Bulman et al. 2017;Denizer et al. 2013).

Three explanations for the influence of staff
After having discussed TTLs centrality for World Bank projects in general, we derive hypotheses regarding variations between different TTLs in their ability to foster recipient performance in World Bank projects. To do so, we need to make assumptions about the actions of individual staff members and on the actions of borrowers in a project. We assume that individuals show consistency in their actions and that there are variations between individuals working in the same organization. The assumption is based on a substantial body of literature in psychology and public administration research that focuses on individual learning, routines and decision-making by "streetlevel" bureaucrats (Eckhard 2020;Lipsky 1980;May and Winter 2009). Furthermore, the assumption builds on the evidence presented by Denizer et al. (2013) on World Bank TTLs. These authors show that TTL-fixed effects explain a sizeable part of the variation in project outcomes. We present similar evidence for recipient performance below. Our findings imply consistency in the decisions of individuals across contexts as well as variations between individual staff members. However, which variations matter when explaining recipient performance?
To ascertain relevant TTL characteristics, we build on the burgeoning literature on compliance with international agreements in International Relations (Raustiala and Slaughter 2002;Simmons 1998;Tallberg 2002). It has focused on a central problem all social systems confront: "the problem of social controlthat is, how to get actors to comply with society's rules" (Hurd 1999, p. 379). This problem is particularly pronounced in global governance because IOs lack the extent of coercive capacity that national governments commonly possess. In the case of the World Bank, the problem of social control materializes in the necessity to ensure that recipient governments do not just "take the money and run" (Girod and Tobin 2016, p. 209). The Bank wants to assure that recipients adhere to the aid agreement and perform to its satisfaction. Notably, we derive expectations from three central perspectives that make different assumptions on the reasons why borrowing governments might be lacking in recipient performance. These three perspectives focus on either the ability or the willingness of actors in the recipient country to adhere to the aid agreement. We identify staff characteristics that would conceivably allow staff to make positive contributions. The derived hypotheses follow on building either the capacity (supervision) or the will (leniency, country experience) of recipients to perform well in the view of the Bank. We discuss the three perspectives in turn.

Supervisory ability and recipient performance
The first perspective, commonly referred to as the management school of compliance research, explains recipient performance based on the cost-benefit calculations of rational recipients. Non-adherence to a commitment is a feature of lacking capacity or imprecise rules (Chayes and Chayes 1993). This perspective corresponds to what has been called a "cooperative model" of World Bank-recipient relations in the World Bank literature (Kilby 2001, p. 205). The World Bank and the recipient have common interests, and any failure to "perform" on the part of the recipient must be due to technical concerns. Indeed, a range of issues during the implementation of a project can undermine the ability of a recipient government and its implementing agencies to perform well. The literature on the effectiveness of development assistance has emphasized corruption and state fragility as a central inhibitor for the success of development projects (Dollar and Levin 2005;Dollar and Svensson 2000;Honig 2019). If lower-level officials in the government or the implementing agencies work to enrich themselves through the project, satisfactory performance is substantially less likely. The recipient government will have agreed to a project that it plans to implement, but it will not have the capacity to induce its representatives to do so. Donors react to the level of corruption in recipient countries and try to minimize the use of country implementation systems in specific contexts (Dietrich 2013;Knack 2013;Winters 2014). Similarly, if a country is faced with substantial fragility, the project environment is much less predictable, which necessitates more decentralized decision-making in implementation (Honig 2019;Marchesi and Masi 2020).
We expect some TTLs to be better able to support recipients through better supervision. For example, former World Bank staff member Paul Cadario directly discusses differences between staff in that regard: "with the exception of some of the weaker loan officers, people knew the country, the managers were very hands-on, they rolled their sleeves up, they read things, they gave you feedback (…), they edited things if they didn't have the right direction on them" (World Bank 2013b, p. 7). Supervision entails monitoring of project implementation and technical assistance to the recipient (Kilby 2001). Better supervision allows for earlier identification of implementation issues and better ability to remedy these problems. The efforts staff exert in supervision can vary. Authors have convincingly demonstrated that country-level factors like government ideology or geopolitical alignment to the World Bank's major shareholders can explain why efforts vary across staff (Smets et al. 2013;Wane 2004). We argue that individual differences between staff members also explain why the World Bank will vary in its quality of supervision. There are differences between TTLs in the importance they ascribe to project effectiveness (Briggs 2019a) and also their competence (Denizer et al. 2013;Limodio 2021). These differences shape the ability of staff to build capacity in recipient countries and help willing governments to adhere to the World Bank's expectations. The first hypothesis hence reads: H1: The better a given international bureaucrat is at supervising a project, the more likely the recipient is to comply.

Leniency and recipient performance
The second perspective, which is often referred to as the enforcement school in the compliance literature, also argues that recipient actors make decisions based on costbenefit calculations but shifts the focus to willingness rather than capacity (Downs 1998;Fearon 1998). In this view, recipients follow IO demands if enforcement inflates the costs of not doing so. This perspective has been called the "adversarial model" in the World Bank literature (Kilby 2001, p. 203). The argument is that recipient governments could implement as little as possible to retain disbursement while lacking real commitment to the project. Here, lacking commitment at the higher levels of the recipient government translates to less effort in the implementing agencies. This is the perspective Girod and Tobin (2016) take in their study of recipient performance. They argue that governments that rely substantially on World Bank assistance for their budgets and public services will be more committed to adhering to the aid agreement. On the other hand, those governments that have alternative sources of revenue, like resource rents or FDI, will be much less interested in diligent implementation (Girod and Tobin 2016). The second variant of enforcement approaches is presented by Stone (2002) based on an analysis of the IMF's role in economic transitions of several Eastern European countries. He argues that member states that are strategically important to the organization and its major shareholders will be able to secure disbursements even in the absence of compliance with IMF conditionality. Indeed, evidence on the World Bank suggests that strategically important countries seem to be able to secure more disbursements (Andersen et al. 2006;Kilby 2013b). Recent evidence, on project design rather than disbursement, suggests that preferential treatment might not occur because of the lobbying of powerful member states but rather because staff either internalize shared preferences or subconsciously try to "please the principal" (Clark and Dolan 2021, p. 4). Whether powerful shareholders lobby staff directly or whether staff internalize considering their preferences, differences in disbursement might translate into lacking effort on the part of the recipient government.
We argue that bureaucratic preferences also play a decisive role in enforcement throughout the project. World Bank bureaucrats have discretion in enforcement decisions because staff are central for decisions to release disbursements. However, TTLs are caught between two objectives. On the one hand, they are expected to contribute to the effectiveness of projects positively. On the other hand, TTLs prestige and career advancement depends on the disbursement of projects and getting projects with high loan volumes approved (Briggs 2019a;Weaver 2008). As one report by the Independent Evaluation Group highlights, staff have the "perception that individual success depends more on obtaining new deals and ensuring timely disbursement than on quality implementation" (World Bank 2016, p. 28). TTLs differ in how they square these two, at times competing, objectives. Recent survey evidence presented by Briggs (2019a) including responses by 115 TTLs indicates that around half of TTLs think that project ratings are not at all or only slightly important for their career success, while ca. 25% think they are extremely or very important. Therefore, there are likely variations between staff in how stringent they are in basing disbursement decisions on the implementation record of the project. Anecdotal evidence illustrates this argument. Former World Bank staffer Jan Walliser describes the work of another staff member in his interview with the oral history project: "We went out to a site that had been identified for a multi-donor project but where construction was only about to begin. At that point, the project had actually encountered quite a bit of trouble. And I will never forget how she stood on the sides of the river there, interrogating the Russian contractor who had not properly mobilized the equipment to start the works despite some of the prepayments that had been made, and questioning him trying to get some answers on how the project could get off the ground" (World Bank 2018, p. 40). Given the incentives in the Bank, not all staff members show this same dedication to basing disbursement decisions on the efforts exerted by recipients. If recipients discover that a TTL is less stringent on gathering information on lacking implementation and aims to ensure disbursements regardless, they will start lagging behind. The corresponding hypothesis is: H2: The more stringent an international bureaucrat is in enforcement, the more likely the recipient is to comply.

Country experience, sympathetic interlocutors, and recipient performance
The third perspective focuses on the will of recipient actors as well but shifts the focus from cost-benefit calculations to the recipients' receptivity for World Bank advice. It is derived from ideational perspectives on compliance (Checkel 1997;Simmons 2013). In the literature on International Financial Institutions, these perspectives have been discussed under the label of sympathetic interlocutors (Arpac and Bird 2009;Bazbauers 2019;Heinzel et al. 2020;Woods 2006). According to Woods (2006, p. 10), sympathetic interlocutors are understood as actors that are "both willing and able to embrace the priorities preferred by the institutions." A lack of sympathetic interlocutors can impede the national implementation of IO policies (Broome and Seabrooke 2015). Woods (2006) argues that the International Financial Institutions need to identify existing sympathetic interlocutors inside the national bureaucracy or persuade interlocutors to increase their commitment to reform. These two means of cultivation are further specified by Bazbauers (2019). To identify sympathetic interlocutors, the World Bank staff can either locate them through their public support for programs or based on their relationships built through prior engagements. Alternatively, World Bank staff can seek to persuade interlocutors of the importance of the project. Persuasion can work through direct interaction or by mobilizing already existing interlocutors (Bazbauers 2019).
Country experience crucially influences the ability of IO staff to identify sympathetic interlocutors and persuade recipients to embrace the World Bank's priorities. First, identification of sympathetic interlocutors in a recipient country is eased if staff have prior experience in this country. Experienced staff likely have a better sense of the actors in the national bureaucracy and implementing agencies that share the objectives of a project. Their networks in the recipient country will also allow them to seek out sympathetic interlocutors and try to work with them on the project. That way, their country experience will equip them with more of the necessary tools for the identification of people that may help steer the domestic implementation process. Second, persuasion of opponents of projects is also eased when staff have country experience and, thus, can utilize their existing personal and professional relationships to foster support for the new project. This insight is built on findings in social psychology and organizational studies that indicate that more prolonged exposure to individuals increases trust, ceteris paribus (Bornstein 1989;Kwan et al. 2015). Several studies on development projects have pointed to the importance of trust and communication between national and international implementation staff. These authors argue that when recipients trust World Bank staff, they will be more open to dialogue and persuasion, and be more inclined to adhere to their obligation in a project (Bazbauers 2019;Ika et al. 2012;Lannon and Walsh 2020). Indeed, research on social trust shows that trust renders policy recommendations more credible and leads to actors being more receptive to external advice and criticism (Keefer and Knack 2008). These factors, in turn, render persuasion more likely. For example, former World Bank staffer Jan Walliser discusses his experience of working with country counterparts: "I developed very close relationships with the counterparts, and the sense of trust, especially with the Finance Minister (…). And they were largely built on, that they knew or that they got over time, the sense that what I was saying was to help them rather than to make a point" (World Bank 2018, p. 10). This is not to say that all relationships between World Bank staff and interlocutors will be positive. Government counterparts do not automatically listen to staff just because they have worked in the country before. However, staff with country experience even have an advantage when prior experiences were negative, and a prior World Bank project was met with resistance. In these cases, they will have a better sense of who might not be receptive to their advice than those who lack country experience. Therefore, staff with country experience have an advantage over staff without country experience when trying to both identify and persuade interlocutors. Based on these considerations, the third hypothesis is: H3: The more country experience a given international bureaucrat has, the more likely the recipient is to comply.

Research design
To assess the three developed hypotheses, we built a database of World Bank projects. It combines novel data collected on the main staff members in charge of the implementation of projects with available data on projects from the oftentimes used World Bank projects database, Independent Evaluation Group (IEG) evaluation data and country-level variables (World Bank 2020a). We discuss the dependent variable, staff level independent variables and country-level independent variables in turn.

Dependent variable
When measuring recipient performance, we draw on project evaluations produced by the IEG between 1986 and 2020. The IEG is the standard data source for analyzing World Bank project performance (Bulman et al. 2017;Denizer et al. 2013;Dreher et al. 2013;Kilby and Michaelowa 2019;Winters 2019) and recipient performance (Girod and Tobin 2016). At the end of a project, the World Bank staff evaluate each project based on a predefined set of criteria. The IEG desk reviews the projects to ensure consistency. Projects are sometimes re-evaluated by IEG. When multiple ratings are available for the same project, we use the most recent rating. 4 Projects are rated through two kinds of reports: Implementation Completion Reports (ICR) and Project Performance Assessment Reports (PPAR). 5 PPARs make up around 30% of evaluations. 6 4 We use the most recent available rating for all variables drawing on IEG data throughout the article. 5 21 of the earliest projects in the database are rated through Project Completion Reports. Project Completion Reports were issued between 1982 and 1994. They were replaced by ICR in 1995. 6 Kilby and Michaelowa (2019) provide a detailed discussion of IEG's selection of projects for PPARs. They find that many of the patterns of selection are consistent with IEG's mission. The IEG is more likely to choose very positive evaluations and evaluations where the ICR quality is rated as very poor. Projects with larger loan volumes are selected more often and IEG selects investment projects less frequently than program loans. However, they also show that PPARs are more often in countries where working conditions are easier and that are popular tourist locations. Finally, they find evidence that IEG seems to consider institutional power when determining the release date of the reports and that a seat in the UNSC can have an impact on the PPAR rating. These findings show that PPARs might not be as independent in general but they should still be more independent from the TTL than the ICRs (and we provide evidence that staff variables are not significant predictors of rating revisions in Appendix A5).
They include both extensive review of World Bank documentation and visits to the recipient country to verify ratings in detail. In about one in five evaluations, IEG adjusts the ratings because of disagreements with the decisions of initial evaluators (Malik and Stone 2018).
We draw on IEG data that measure "the extent to which the borrower (including the government and implementing agency or agencies) ensured quality of preparation and implementation, and complied with covenants and agreements" (World Bank 2015b, p. 21). The IEG rates recipient performance based on two criteria: the performance of the borrowing government and the performance of the implementing agency in member states. Ratings are produced on a scale of 1-6 from highly unsatisfactory (1) to highly satisfactory (6). Figure 1 displays the distribution of the recipient performance evaluations. As is the case with all IEG evaluations, they skew rather positively. We perform some robustness checks to account for the overall positive ratings (see below).

Staff-level variables
To operationalize the three hypotheses on the influence of World Bank bureaucrats, we draw upon a novel database on World Bank TTLs in charge of World Bank projects. We collected data on TTLs from the World Bank website. We scraped the names of the leading individuals, in charge of projects at implementation, using the publicly available World Bank API. We have data for 10,000 of the roughly 18,000 projects the World Bank has run in its history. 7 The data present us with two challenges when calculating the indicators: first, increasing coverage and second, rotation of staff. The data set has excellent coverage after 2010 (more than 90%), and for half of the projects in the 1990s (for details, see Fig. A2). However, it is unlikely that staff-specific factors influence the likelihood that the World Bank gives details on a particular project on its website. Instead, reporting is likely influenced by developments in technology and increased transparency over time. The pattern of missingness is not substantively related to observed rating quality, outcome or recipient performance ratings or project amounts (Figs. A3-A4).
In addition, staff rotation might be an issue for constructing the variables. The World Bank website records the TTL at the end of the project rather than the one that started the project. TTLs tend to rotate every 3-5 years (Denizer et al. 2013;Kilby 2013b). This rotation pattern is based on the so-called 3-5-7 rule of the World Bank. The rule mandates being in assignment for a minimum of three years. After five years, staff are encouraged to pursue reassignment to another vicepresidential unit. When seven years have elapsed, regional managers actively facilitate the rotation of staff (World Bank 2015a, p. 62). To contextualize our 7 The staff variables we use are based on past behavior and experiences of TTLs. This approach is built on Denizer et al. (2013) and Bulman et al. (2017) who estimate "TTL quality" based on the past outcome ratings of projects. We believe that we can get a reasonable estimation of differences between TTLs by drawing on their past record, because authors working on public administration have long argued that staff develop routines throughout their career that make their actions more consistent (Lipsky 1980). Nevertheless, our approach has the limitation that we ascribe project ratings to the past behavior and experiences of TTLs that they might not be the sole reason for. analysis, we randomly selected 200 projects and collected data on the TTLs in charge of appraisal and at the time of the ICR from the ICR documents. Table 1 displays patterns of rotation in these 200 projects. The vast majority of projects sees a change of TTLs from appraisal to implementation. This applies to both DPLs and IPF. The average length of those projects that share TTLs at appraisal and implementation is around 500 days shorter than when TTLs change. Substantially fewer TTLs change between implementation and evaluation. While we cannot account for the patterns of rotation directly, we try to rectify these issues in the construction of our supervision variable by only drawing upon the supervision component of the indicator and by controlling for project length in a robustness check.
One can reasonably disagree with a range of choices we made when constructing the indicators. To account for such disagreements, we present several alternatives to calculate our variables in the appendix (Tables A8-A10). We now discuss the three variables we use in the main body of the article. If more than one TTL is listed for one  project, we take the average value of each of the variables for each TTL listed on a given project. We do so for all three variables. 8 First, we measured the past performance of a given bureaucrat's projects to operationalize H1 on staff supervision. In constructing this indicator, we assume that past behavior is a predictor of future behavior. IEG evaluates the degree to which a project was designed well (quality at entry) and the degree to which staff helped recipient governments to identify and rectify issues in implementation (quality of supervision). We draw on supervision alone because our data do not include the TTL that was in charge of design for most projects (see above). To calculate the supervision indicator, we take the average supervision score for all projects that a given TTL was in charge of before the project of interest was approved (ranges from 1 to 6).
Second, to test H2, we constructed an indicator for the leniency of staff. The operationalization of TTL leniency draws on the observed leniency in TTL's past projects. The reasoning is that TTLs that were lenient in the past are more likely to be lenient in the future. We use a measure of disbursement share conditional on recipient performance in projects. The disbursement indicator is calculated by dividing total disbursement by total commitment of each project according to the World Bank (2020a). We follow Malik and Stone (2018) in truncating disbursement share data between 0 and 1. To build the leniency indicator, we divide the disbursement of projects (range from 0 to 1) by their recipient performance rating (range from 1 to 6).
Third, to operationalize H3 focusing on the trust of interlocutors, we measure the degree to which someone is embedded in local networks. We argue that TTLs that have worked previously in a given country and built substantial networks are better able to cultivate sympathetic interlocutors. We use the number of projects someone has run in a given country as an approximation of the integration into country networks. To do so, we draw on all projects listed in the World Bank projects database for which we have TTL data (10,196 projects). 9 We then take the natural logarithm (+1) to account for the decreasing marginal returns of experience (Yelle 1979). 10 The equations used to calculate all three staff variables can be found in the appendix (pages 2-4). One limitation of this approach is that the experience of TTLs that are newly hired is more fully captured than of those of TTLs which have been working for the World Bank much longer. 11 8 These co-TTLs are teams of TTLs that work on one project together. Co-TTLs occur in the larger database on TTLs roughly 19% of the time. However, two or more TTLs work only on roughly 3% of the projects in the database. This is for two reasons. First, co-TTLs are much more likely for projects focusing on multiple countries or whole regions. These projects are excluded from the analysis here, to not artificially create country-level variables from projects spanning multiple countries. Second, co-TTLs have become more numerous over time. Since many of the most recent projects have not been evaluated (see Figure A2), these projects are not in the database. When multiple TTLs are staffed, we average staff variables for one project. We further include a dummy variable controlling for the presence of co-TTLs in all models. 9 They include IDA, IBRD and World Bank trust fund projects. 10 We add +1 to account for projects where experience is 0. 11 The earlier years are more impacted than the latter years by the truncated records of TTL appointments. When looking at the whole TTL database, the mean number of projects the TTL has supervised in a given country is 0.19 in the 1980s (SD 0.588), 0.32 in the 1990s (SD 0.76), 0.43 in the 2000s (SD 0.93) and 0.91 in the 2010s (SD 1.58). While only circa 13% are indicated as having worked in the country before approval of a given project for projects approved in the 1980s, roughly 46% of TTLs have observed country experience in the 2010s.

Control variables
We use control variables on the country-and project-level to minimize the possibility that our findings are driven by unobserved heterogeneity. When choosing control variables, we aim to account for variations the three perspectives on adherence to IO demands would deem crucial. All variables are measured in the year the project was approved unless explicitly discussed otherwise.
We employ a range of country-level variables to account for the different channels affecting recipient performance on the country-level. First, we control for variations in the strategic importance of recipient countries. To identify the necessary control variables, we can draw on longstanding debates on the informal influence of major shareholders at the World Bank (Stone 2013). Qualitative case studies (Babb 2009;Wade 2002;Woods 2006), as well as quantitative analyses (Andersen et al. 2006;Clark and Dolan 2021;Vreeland 2019), have demonstrated that US allies often get a better deal from the World Bank. In line with recent research on the World Bank (Clark and Dolan 2021), we use the distance between ideal points calculated based on important UNGA votes to measure such geopolitical alignment (Bailey et al. 2017). Second, credibility in the application of development finance agreements hinges on the power of the recipient country (Börzel et al. 2010;Stone 2002). It is more difficult to pressure powerful recipients like China compared to small island states like Kiribati. In line with research on rule adherence, we include a country's GDP, from the World Bank World Development Indicators (World Bank 2020b), as a predictor to account for these variations (Börzel et al. 2010). Furthermore, GDP is also crucial because it can be seen as a measure of state capacity. Third, Girod and Tobin (2016) have argued that it is more challenging to enforce development finance agreements when countries have alternative sources of income, like FDI. Therefore, we include data on FDI inflows from the World Development Indicators (World Bank 2020b). Fourth, we include control of corruption as an indicator of capacity constraints. The World Bank has often stated that implementation of its projects is hindered by governance issues in recipient countries (Aguilar et al. 2010;Wolfensohn 1996) and the argument has been prominently featured in the literature on the effectiveness of development assistance (Dollar and Levin 2005;Knack 2013;Winters 2014). We use the Varieties of Democracy Project's (VDem) public corruption indicator to measure country-level corruption (Coppedge et al. 2019). Finally, we control for regime type in the recipient country to account for varying size of the electorate of policymakers in recipient countries that can shape the incentives of governments to adjust policy (Bueno De Mesquita and Smith 2009). To do so, we employ VDem's polyarchy index.
In addition to these country-level variables, we control for differences between projects. First, we include the (log) project amount to take differences between the relevance of projects into consideration. Some projects include higher monetary commitments than others. If these projects are not implemented diligently and the World Bank does not disburse, the recipient country would lose more money than in small projects (Girod and Tobin 2016;Smets et al. 2013). Data on project amounts are taken from the IEG evaluation database (World Bank 2020a). Second, we try to account for heterogeneity in administrative procedures of the World Bank by including a dummy indicating whether projects were DPF. To control for variation between recipients and projects that the discussed variables cannot account for, we further employ country-approval-year and sector-approval-year fixed effects in some specifications. Third, we operationalize arguments regarding the fit of international demands with government preferences by creating a dummy on whether the government changed during a project. The rationale is that government are more likely to take ownership of and thus be committed to the successful implementation of projects they initiated (as compared to projects a previous government initiated). Data on government change comes from the IADB's Database of Political Institutions (Scartascini et al. 2017). Finally, we control for projects supervised by multiple TTLs by employing a dummy variable that indicates whether the project was supervised by co-TTLs.

Analysis
We proceed in three steps to assess our argument that staff can influence recipient performance. First, we discuss the relative importance of country-level and staff-level variation in explaining the performance of recipients in the eyes of the World Bank. We do so to support our general argument that staff have a substantial impact on recipient performance. Second, we analyze the relative explanatory power of the three theorydriven explanations on the staff level and assess their magnitude in comparison to commonly discussed country-level variables. Third, we compare different loan types and actors on the national level that are involved in World Bank projects.

The relative importance of recipient country and World Bank staff in recipient performance
The basic premise of this article is that the literature has not paid sufficient attention to the role of staff in fostering adherence of recipients to development finance agreements. To probe that argument, we first discuss the relative importance of staff-level compared to country-level explanations.
We are not the first to ask the question of how much staff matter in World Bank projects. Indeed, Denizer et al. (2013) present evidence that World Bank staff can positively influence the outcome of World Bank projects. They show that the influence of World Bank staff is about as pronounced as all country factors combined. While their findings are highly relevant, it is not clear whether the same applies to recipient performance. Therefore, we replicate their analysis regarding the comparison of country and staff-level factors in Table 2. To do so, we limit the analysis to those staff members who were responsible for at least two projects in at least two countries in our database. We then use a simple two-way ANOVA to separate variation in recipient performance into variation due to staff and country fixed effects (Denizer et al. 2013). The model accounts for 48% of the variation in recipient performance ratings. Both country and staff fixed effects are jointly highly significantly different than zero. The mean sum of squares allows for a comparison of the relative explanatory power of country and staff factors. It accounts for the different number of fixed effects of both factors, which means that the sheer higher number of TTL fixed effects does not drive the results. The mean sum of squares of staff factors is on a similar order of magnitude as country factors. Therefore, it can be said that the staff seem to have a substantial impact on recipient performance. The strength of the finding increases the importance of better understanding what drives the differences in staffs' ability to contribute to recipient performance positively.

Assessing the three explanations for staff influence
We assess the three hypotheses focusing on the contribution of staff to recipient performance in turn. 12 All models used in the main body of the article are OLS regressions 13 with standard errors clustered at the evaluation-year. 14 Our estimation approach allows for correlated errors due to similar IEG evaluation practices in the same financial year (Denizer et al. 2013). Descriptive statistics can be found in the appendix (Table A2). In Table 3, we present the results for the argument that staffs' supervisory capacity influences recipient performance (H1). In the first three presented regressions, we alter the model specification by varying fixed effects to account for different sources of unobserved heterogeneity. Model 1 includes project-level and country-level controls as well as country, sector and approval-year fixed effects. While we account for a range of possible confounders on the country-level directly, we might still face omitted variable bias in particular countryapproval-years or sector-approval-years. Therefore, we include the respective fixed effects in Models 2 and 3. Finally, fostering recipient performance includes both higher-level political engagement with the government as well as lower-level administrative coordination with implementing agencies. We assess whether the explanatory potential of staff supervision differs between government performance and implementing agency performance in Models 4 and 5.
The findings presented in Table 3 indicate that the supervisory ability of TTLs can affect recipient performance. TTL supervision is a positive and statistically significant predictor of performance ratings in four of the five models. Even when accounting for country-approval-year (Model 2) and sector-approval-year fixed effects (Model 3), the coefficient is significant (p < 0.1). However, when disaggregating by government and recipient performance, we find that supervision seems to work more consistently 12 The sample size of models differs throughout because the staff supervision and leniency indicators draw upon the past project records of TTLs. When TTLs run their first project, such a record is not available and they are omitted from the sample. Furthermore, while we have data on TTLs records for 10,196 projects, recipient performance evaluations are only available for 3,562 of these projects. 13 To increase computational efficiency we use the algorithm developed by Guimaraes and Portugal (2010) to fit models with high-dimensional fixed effects. 14 When discussing evaluation-years or approval-years throughout, we refer to financial years rather than calendar years. through agency performance. While the coefficient is on a similar order of magnitude for both groups of recipient country actors, it fails to attain statistical significance for government performance (Model 4).
In a second step, we evaluate the association between the leniency shown in past projects of a given TTL and recipient performance. The results are displayed in Table 4. Note: Standard errors in parentheses; errors clustered at the evaluation-financial-year; project-level controls: DPF dummy, project amount (log), government change throughout the project and co-TTL dummy; countrylevel controls: corruption, ideal point (based on votes the US state department designates important) difference between recipient and US, GDP (log), FDI and democracy; * p < 0.1, ** p < 0.05, *** p < 0.01 The models mimic the specification choices explained above. Model 6 includes country, sector and approval-year fixed effects. Model 7 and 8 are estimated using country-approval-year and sector-approval-year fixed effects, respectively. Model 9 and 10 focus on government and agency performance. While there is some evidence that staff leniency could matter, the results are not robust across the models presented. While the coefficients for leniency point in the expected direction, the coefficient is significant at conventional thresholds (p < 0.1 or p < 0.05) only in models 6 and 8. Therefore, the evidence for the importance of leniency is relatively weak. This could be for two reasons. First, leniency might not matter substantially for recipient performance. Research on the World Bank has long pointed out that staff have motivations to disburse loans due to a culture that rewards high-disbursing projects (Wapenhans 1992;Weaver 2008). Second, our measure of leniency could be imprecise. Past disbursement decisions reflect not just the individual preferences of TTLs and recipient performance but could also be influenced by other factors, like interference by powerful member states (Kilby 2013b).
The third hypothesis is evaluated through the models presented in Table 5. Again, we display the same five models, with varying fixed effects and focusing on recipient government and implementing agency performance, respectively. We present substantial evidence indicating the importance of TTLs country experience for facilitating recipient performance. The coefficient is either statistically significant or marginally significant at conventional thresholds in all five models.
So far, we focused on identifying whether staff-level factors shape recipient performance. Now, we aim to understand better whether the identified associations are meaningful. One way of gauging the relative importance of staff factors is to compare their extent with country-level explanations. We compare the two variables with relevant variables on the country-level in Table 6. To do so, we estimate two sets of regressions for each independent Note: Standard errors in parentheses; errors clustered at the evaluation-financial-year; project-level controls: DPF dummy, project amount (log), government change throughout the project and co-TTL dummy; countrylevel controls: corruption, ideal point (based on votes the US state department designates important) difference between recipient and US, GDP (log), FDI and democracy; * p < 0.1, ** p < 0.05, *** p < 0.01 variable. The first compares country and staff variables excluding country-fixed effects to contextualize the overall magnitude of the variables (Models 16, 18 and 20). The second employs country-fixed effects to focus on within-country differences (Models 17, 19 and 21). Overall, the models show that the coefficients of staff variables are of similar relevance as important country-level variables discussed in the World Bank literature. The coefficient for supervision is only slightly smaller than coefficients for corruption in Model 16, and it is about half as large as UNGA voting similarity with the United States in Model 17. The coefficient for leniency is similar to GDP (log) in strength in Model 18. However, leniency is substantially smaller than all country-level variables when employing country-fixed effects in Model 19. The same is apparent for country experience, where the coefficient is moderate compared to the main country-variables of interest. Overall, the staff-level variables are similar in magnitude as FDI inflows, the central explanatory variable in extant analyses of recipient performance (Girod and Tobin 2016). Therefore, we can say that the three stafflevel variables show a sizeable association with recipient performance, although not a stronger one than the central country-level variables discussed in the World Bank literature, such as UNGA voting and corruption in recipient countries. Note: Standard errors in parentheses; errors clustered at the evaluation-financial-year; project-level controls: DPF dummy, project amount (log), government change throughout the project and co-TTL dummy; * p < 0.1, ** p < 0.05, *** p < 0.01 The final part of our analysis focuses on the differences between the two major types of loans the World Bank administers. As briefly discussed above, there are central variations in the administrative procedure of different World Bank projects. To understand the scope of our arguments regarding the ability of staff to foster recipient performance, we now focus on some of these differences. We highlighted that policybased lending, DPF, necessitates a substantially different set of actions than investment loans, IPF. Since abolishing Structural-Adjustment-Loans in 2004, the World Bank has mainly relied on DPLs. DPLs rely considerably more on so-called prior action conditions that need to be fulfilled before the project starts. After compliance with these conditions, DPLs are paid out as budget support. IPFs, on the other hand, focus on specific projects that are implemented over a more extended period of time. The World Bank expects recipients to adhere to a set of behaviors specified in the aid agreement and to perform satisfactorily. While we controlled for these differences through a dummy variable in the presented models, we also want to account for them more explicitly. In Table 7, we display results for both kinds of World Bank projects: DPFs 15 and IPFs. Models 19-24 replicate Models 1, 6 and 10 but focus on sub-samples in the loan type of projects. We choose the less demanding specification for these models because we split 15 Of the 341 to 551 projects in the DPF sub-samples, 50 to 100 are SALs (depending on the availability of staff data). Since they do not rely as much on prior action conditionality as the newer DPFs, they would warrant additional analysis. The findings focusing only on SALs are substantially similar to the findings regarding DPF in general. We refrain from reporting these results at length, because the small sample size potentially jeopardizes robust inference. Note: Standard errors in parentheses; errors clustered at the evaluation-financial-year; project-level controls: DPF dummy, project amount (log), government change throughout the project and co-TTL dummy; countrylevel controls: corruption, ideal point (based on votes the US state department designates important) difference between recipient and US, GDP (log), FDI and democracy; * p < 0.1, ** p < 0.05, *** p < 0.01 the sample and employing country-approval-year or sector-approval-year fixed effects leads to the exclusion of a substantial number of projects that do not vary within these groups. Table 7 shows stark differences between loan types. In the case of DPF, staff variables fail to attain statistical significance at conventional thresholds. Two interpretations appear reasonable. First, it could be that the longer time frame and the greater need for supervision make IPFs more prone to the influence of staff than DPF. Satisfactory DPF performance, in the eyes of the World Bank, includes implementing a set of prior action conditions before the project even starts. Therefore, there is much less scope for staff influence in the implementation of these projects. This finding is supported by the much larger variance the fixed effects explain in the DPF than in the IPF models. Structural factors in the design stage seem to matter more for DPF than for IPF. The second explanation is that the substantially smaller number of DPF in the sample leads to less precision when estimating the association. However, even when running bivariate OLS regressions without any control and a substantially larger sample, the coefficients remain statistically insignificant and small despite the exclusion of country-approval-year fixed effects. This gives some indication that the different findings for IPF and DPF are substantive rather than due to sample size.

Robustness checks
The robustness checks aim to rectify three possible concerns with our results. These concerns are biased ratings, selection into projects, and omitted variable bias. We discuss each of them in turn.

Biased ratings
First, we address the potential that the staff might be able to bias ratings. Several authors have discussed concerns with bias in the IEG ratings in particular (Kilby and Michaelowa 2019;Malik and Stone 2018) and IO evaluations in general (Eckhard and Jankauskas 2019). Others have argued that it is relatively unlikely that World Bank staff would bias the ratings because there are few incentives to do so. Staff are primarily evaluated based on running larger projects and maximizing disbursement rather than on producing well-evaluated projects (Dreher et al. 2013;Wapenhans 1992;Weaver 2008). Nevertheless, we try to minimize the possibility that bias in ratings rather than actual differences in compliance drive the correlations we discussed.
There are two potential biases: halo effect and gamed ratings. First, some authors working with IEG data have pointed to a "halo effect" where evaluators mostly care about the outcome rating and then determine the recipient and bank performance ratings accordingly (Kilby 2013a). This would imply that recipient performance ratings do not include much new information beside what is already included in outcome ratings. Therefore, we estimate models controlling for IEG outcome ratings (Appendix Table A3). We find a significant association for leniency and country experience, even when holding outcome ratings constant. The coefficient of supervision decreases substantially and fails to attain statistical significance at any conventional threshold. The robustness check implies that supervision does not explain differences between outcome and recipient performance ratings. In light of the findings of Denizer et al. (2013) and Bulman et al. (2017), we are cautiously optimistic that this is due to the fact that "TTL quality" is also a key predictor of outcome ratings. In other words, while country experience and leniency might specifically work through influencing recipients, supervisory capacity seems to apply to other aspects of the project too.
Second, to address the possibility that staff can game ratings, we utilize the different ratings produced by IEG. It is much more difficult to bias the PPAR ratings than the ICR ratings because PPARs are done independently of the project team. Therefore, we re-estimate the models using only these ratings (Table A6). We lose a large number of observations in our sample. The directions of the coefficients are similar to the ones in the main models. Supervision and leniency are significant when excluding country fixed effects. When employing country fixed effects, none of the three variables stays significant. When we analyze the same projects ICRs, the coefficients are similarly insignificant for experience and supervision. Therefore, the finding for those two variables is likely due to sample restrictions, rather than gamed ratings. In addition, we utilize that IEG occasionally changes the rating outcome from ICRs to PPARs (Table A5). If ratings were gamed more by the staff members we focus on, a correlation between staff-level variables and negative rating changes should be observable. However, we do not find a statistically significant correlation between any of the three staff variables and rating changes. Furthermore, IEG rates the quality of the ICRs. If gamed ratings are related to the variations in staff we observe, then that could imply a lower quality of ICR's for those staff members. Therefore, we regress the ICR quality rating on the staff variables (Table A4). The approach we take is valid if one believes that IEG is able to detect gamed rating at least in some cases. Furthermore, the robustness checks we use regarding the "halo effect" also have implications for the gamed-ratings argument. If staff gamed ratings, they would surely impact the outcome rating. There is little reason to bias the recipient performance rating, while not doing so for the outcome rating. The fact that we get robust results for including the outcome ratings further increases our confidence that the results are not driven by gamed ratings.

Selection into projects
Another possible concern is that staff might be able to pick specific projects that are more promising. Recent research indicates that if anything, high-performing staff are allocated to more difficult contexts (Limodio 2021). Nevertheless, this is an option that warrants careful examination because we would face a selection bias. That would mean that higher compliance rates are a feature of less ambition rather than actual higher compliance. We try to account for the ambition of the project. Malik and Stone (2018) code the number of objectives each project has. While this measure does not cover substantial differences in the ambition of individual objectives, we assume that projects with more objectives are also more ambitious. This assumption could be questioned. Therefore, we employ a further indicator on the overall project costs according to the World Bank project database. Neither of our staff variables predicts project objectives. However, we do find that staff with country experience tends to run projects with lower overall costs (Table A8).

Omitted variables and alternative measurements
To address concerns with the calculation of the staff variables. we implement several alternative ways to calculate the staff variables in the appendix. First, we employ robustness checks to normalize the variables by active projects and use alternate data sources to construct the supervision indicators (Table A9). Second, we test for the robustness of leniency to using alternative indicators. The checks concentrate on normalization by active projects and the possibility that false equivalency was introduced through using recipient performance ratings as a denominator (Table A10). Third, we created several other versions of the country experience variable. They are used to account for criticisms regarding the aggregation of experience when co-TTLs are in charge of the project and doubts regarding the way in which experience increases networks (Table A11).
Additionally, we employ a range of different specification choices. We re-estimate the models using non-linear models. Those include a conditional logit with a binary dependent variable that codes values 1-3 as unsatisfactory and 4-6 as satisfactory. Furthermore, we use an alternative dependent variable that codes only 5 and 6 as satisfactory to account for the overall very positive ratings (Table A12). In addition, we estimate an ordered logit because the original ratings are given on an ordinal scale (Table A13). Next to these alternative model choices, we alter the original specification of the models in several ways. Modifications include varying the fixed effects used in the models to account for relevant years beyond the approval-year. We use effectiveness-year fixed effects and evaluation-year fixed effects instead of approvalyear fixed effects (Table A16). Additionally, we cluster standard errors at the country level instead of the evaluation year (Table A15).
Further auxiliary control variables are employed in robustness checks. There is a trade-off in controlling for project-level factors because the staff influence most projectlevel variables. It could be these very factors that staff can use to increase recipient performance. On the one hand, we do not want to bias our results by accounting for project-level factors that staff use to achieve a higher recipient performance. This is the reason we are cautious in including project-level controls in the main models of this article. Nevertheless, we control for the duration of the project, sub-sector fixed effects, a count of the number of co-TTLs, and whether there was an election throughout the project (Table A14). Finally, we estimate models, including the different TTL variables together. While they are theoretically distinct, there are empirical overlaps of the different TTL competencies. Therefore, we hold the respective TTL variables constant and estimate their conditional association with recipient performance (Table A17).
The robustness checks support the conclusions drawn in the main body of the article. Supervision is statistically significant or marginally significant in most alternative specifications used. However, it fails to attain statistical significance, when controlling for IEG outcome ratings (Model 28), in some of the models focusing on the sub-sample of projects where PPARs available (Models 41 and 42) as well as when incorporating some additional project-level controls (Models 80 and 82). The evidence for leniency is much more unsteady. Leniency is not significant in nearly all models employing countryapproval-year fixed effects throughout, when using any alternative indicator (Models 65-67), and when clustering standard errors at the country-level. Experience is statistically significant or marginally significant in all models used throughout the appendix, except when using the smaller sample where PPARs are available (Models 46-48). Finally, when including all three TTL variables in Table A16, we find that supervision and leniency fail to attain significant results. This is likely because, while leniency and supervision are theoretically distinct, they overlap. TTLs that supervise more diligently might also base disbursements more on recipient performance (see appendix Table A17). Nevertheless, when holding the two other variables constant, country experience consistently attains statistically significant results (Models 99-101). Overall, we find considerable support that supervision and experience shape recipient performance in World Bank projects, both throughout the main specifications and the robustness checks. Furthermore, we find some indications that leniency might play a role.

Conclusion
The article started from the observation that there is substantial variation left unexplored in the empirical literature on recipient performance in World Bank projects. We built on research that has shown that these variations in IO service provision can matter (Smets et al. 2013) but shifted the focus to variations between international bureaucrats in their ability to shape recipient performance positively.
The findings of this article demonstrate that variation between the international bureaucrats tasked with supervision is highly relevant. In the case of the World Bank, variations between staff appear to shape recipient performance nearly as strongly as variations between countries. Staff seem to be able to influence recipient performance in a wide variety of country contexts. We explored variations between staff members by extending three perspectives on reasons for member states' adherence to IO demands to this new group of actors. Evidence from a battery of statistical tests implies three major takeaways: first, high performing staff members can foster recipient performance. Second, staff members who have better country knowledge and networks can positively contribute to recipient performance. Third, there is some evidence that leniency of staff may play a role, but it is much less robust.
One important nuance in the results is that the importance of the different staff variables varies by administrative arrangement (IPF or DPF) and by the recipient actor that the World Bank works with (government or implementing agency). We show that staff seem to matter in shaping recipient performance for IPF, which made up around US$ 12 billion or 52% of IBRD and IDA commitments in 2020. However, the results also illustrate that the same cannot be said about DPF (US$ 8.3 billion or 34% of World Bank operations in 2020). 16 Staff do not seem to be able to make a difference in persuading or aiding recipients to adhere to prior action conditionality included in DPF, that often need considerable political will. This finding could imply that World Bank staff make a difference in cases where they interact with administrative counterparts but not in cases where high-level politics shapes recipient performance. More research could aim at further unpacking these differences. 16 The amounts are calculated on the commitments made by IDA and IBRD in 2020 according to the World Bank project and operations database as of November 2020. The remaining 14% were program for results. These loans have substantially increased in recent years but there are not enough evaluated loans to analyse them in this article.
The results presented in this article contribute to two debates in IO research. First, although the empirical section of the article focused on the case of World Bank assistance agreements, we think that our findings are potentially applicable to a larger universe of aid agencies. Recent research has found that findings generated on the case of the World Bank seem to apply to a larger number of aid agencies, both bilateral and multilateral (Briggs 2020). Furthermore, analysis of the impact of TTLs on projects of the Asian Development Bank yields similar results as those from the World Bank (Bulman et al. 2017). More research is needed to unpack how variations between international bureaucrats shape the implementation of development projects.
Second, our findings reinforce calls for extending debates on adherence to and compliance with international agreements beyond state-centric accounts to include the variety of actors that are crucial for compliance processes (Dai 2005;Mitchell and Hensel 2007). In their adherence to World Bank assistance agreements, recipient governments and their implementing agencies seem to be influenced by international bureaucrats. This implication seems the more important as the positive contributions to compliance seem to occur irrespective of the country context. Therefore, the compliance literature could benefit from engaging more with debates on the varying influence of international bureaucrats (Bayerlein et al. 2020;Chwieroth 2013;Eckhard and Ege 2016).
Finally, the findings imply that the World Bank and other development agencies could benefit from re-thinking their organizational incentive structure for staff. It seems to matter who is supervising the project. More targeted staffing policies could, therefore, help to foster recipient performance. Better performing staff and staff with more country experience seem to be able to shape recipient performance. Recent research shows that World Bank staff think that their career advances are primarily dependent on moving money (Briggs 2019a). The findings of this article reinforce calls for changing the incentives of staff to focus more on skills and networks of staff, rather than getting large projects approved and disbursed. Reforming staff incentive structure to reward those that contribute to the organizational mandate could prove beneficial for the World Bank.
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.