Introduction

Background

The challenge of improving research translation or implementation

Translation of scientific knowledge to routine, evidence-based practice in healthcare settings ensures optimal care and improved outcomes for patients [1, 2]. Despite this, the translation of research knowledge to evidence-based practice is often slow or poor [3,4,5,6]. A foundational study by McGlynn [7, 8] found that during a two-year period between 1998 and 2000, patients in the United States receive 55% of evidence-based care with great variance in the rate of evidence-based care received among medical conditions. Furthermore, a 2005 systematic review by Schuster et al. [9] found 30–40% of patients were missing out on treatment that has been proven to be effective, while 20–25% of patients were receiving treatments that they do not need or that can cause them harm. A more recent Australian study by Runciman et al. [10] in 2012 with a sample of 1154 participants found that participants received appropriate care at 57% of healthcare encounters, again varying across medical conditions (from 32 to 86%). McGlynn [8] suggests that despite attempts to address these deficits in evidence-based care, there have been no large-scale studies in the United States measuring the provision of evidence-based care since 2003 and that although smaller studies indicate there have been improvements in some areas, there has been little change in healthcare overall. This failure to translate knowledge to evidence-based practice can result in poor outcomes for patients including sub-optimal treatment, exposure to unnecessary or harmful treatment, poorer quality of life, and loss of productivity [2, 6]. For healthcare systems, this failure can result in ineffective organisations and unnecessary expenditure [2, 6].

In healthcare, evidence-based practice refers to the translation or implementation of clinical research and knowledge into healthcare practice [6]. The two key steps toward evidence-based practice are: first, the translation of basic scientific knowledge to clinical practice, and secondly, the implementation of evidence-based practices that have found to be effective in the local setting into routine healthcare and policy [6, 11]. Barriers to successful implementation can be individual, structural, and organisational cultural [6, 12], including commitment from management, access to research, capacity issues, financial disincentives, inadequate skills within an organisation, or a lack of requisite facilities or equipment, staffing, peer morale and commitment, and leadership [6, 12]. Implementation strategies and frameworks assume or include important roles for leaders. Leadership has been shown to be an integral factor in nurturing a culture of evidence-based practice in clinical settings including cancer care, substance abuse, weight management, palliative care, and physiotherapy [3, 13,14,15,16,17,18]. Subsequently, leadership behaviours can encourage or discourage change and innovation within healthcare organisations [13, 19].

Despite leadership being considered a determining factor in implementing and sustaining evidence-based practices [1, 4, 20,21,22,23,24], the term remains an ambiguous concept in research [16]. Leadership has been conceptualised as a series of inherent personal traits, as learned behaviours, and as responses to particular situations or contexts [23]. Various types of leadership have been proposed including transformational leadership, transactional leadership, distributive leadership, charismatic leadership, heroic leadership, empowering leadership, engaging leadership, authentic leadership, collective leadership, servant leadership and passive or avoidant leadership [25,26,27,28,29]. A systematic review by Reichenpfader et al. [16] found that in 17 studies in the field of implementation science, the term was used imprecisely and inconsistently [16]. For the purpose of this paper, the authors will use Reichenpfader et al.’s [16] definition of leadership, being “a process of exerting intentional influence by one person over another person or group in order to achieve a certain outcome in a group or organization”. Likewise, the authors will consider leaders to be those people who are considered to exert influence on group or organisational outcomes, be they formal or informal leaders.

Formal leaders or positional leaders - managers or supervisors whose responsibilities include the oversight of staff, budgets, and operations - have the ability to procure and disperse funding and resources, and design and enforce implementation policies [19, 30]. Formal leaders have the responsibility to ensure that healthcare organisations support the implementation of evidence-based practice through adequate funding and resources, supportive plans, practices, and strategies, as well as providing a work environment conducive to implementation [19]. The Consolidated Framework for Implementation Research (CFIR) [31] considers formal leaders to be the people who project manage and coordinate implementation. In healthcare settings the implementation of practice change in health often requires leadership from multiple professional groups including nurses, physicians and allied health [32]. Powell et al. (2015) have suggested implementation strategies that leverage formal leaders including recruiting, designating, and training leaders for the change [33].

However, it is not only formal leaders who influence implementation. Change champions, who may be formal or informal leaders and are also referred to as opinion leaders, implementation leaders, facilitators, and change agents throughout the literature [34], also play a critical role in effective implementation [3, 19, 30]. Change champions are people within an organisation who are invested in implementing change, work hard to bring that change to fruition, are often personable, and are influential [3, 34]. Change champions may be frontline staff who may or may not have a formal management role, who frequently positively influence others’ attitudes or behaviours [3, 6, 30, 34]. Change champions acquire their influence through their demonstration of technical competence and accessibility and availability to their peers [6]. The CFIR suggests formal or informal change champions in implementation are those who are dedicated to supporting and driving implementation and influence attitudes toward implementation [31]. Implementation strategies utilising change champions identified by Powell et al. [33] include: identifying change champions, preparing them for the intervention and ensuring they are informed so they may influence the support of their colleagues [33]. It is these champions who have the responsibility to facilitate healthcare organisation climates being implementation-friendly through gaining support from senior management, formal leaders, as well as their peers [19].

Despite the critical role of both formal and informal leaders in facilitating the implementation of evidence-based practice in healthcare organisations, there is relatively little empirical study of how various aspects of leadership may be directly related to the efficacy or speed of research translation, or to the delivery of evidence-based practice [2]. Although it is clear that leadership is critical in the successful implementation and the sustainability of innovations [1, 35], it is unclear how the leadership traits and behaviours can be identified, measured, and developed [2, 3, 5, 19].

Consequently, the study of the relationship between leadership and research translation in healthcare requires accurate and relevant leadership scales. Leadership and change management is a growing area of scholarship [36,37,38,39,40,41], and some progress has been made on beginning to identify and synthesise scales which measure leadership traits and behaviours and to validate the psychometric properties of these scales [42,43,44]. Given the need for a variety of health professionals to be involved in the leadership of practice change; a leadership scale cannot be considered valid and reliable for administration with health professionals, until it is tested with a broad cross-section of such health professionals. However, a systematic review of general implementation scales (i.e. not leadership-specific) has highlighted a gap in the development and availability of validated scales which can be applied to the assessment of leadership traits and behaviours [45]. This gap inhibits the ability of implementation researchers and health professionals to identify evidence-based traits and behaviours which can facilitate identifying formal and informal leaders who may be integral in the promotion and delivery of evidence-based healthcare.

Methods

The aim of this systematic review was to identify published leadership scales that have psychometric properties (reliability, validity or acceptability) which have been assessed with clinical health professionals.

This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) (see Additional File 1) [46]. The synthesis methods of this review were guided by Clinton-McHarg et al.’s [45] 2016 work which examined the psychometric properties of scales developed in public healthcare and community settings [45]. This review was registered with PROSPERO (Registration Number CRD42019121544).

Search strategy

MEDLINE, EMBASE, PsycINFO, Cochrane, CINAHL, Scopus, ABI/INFORMIT, and Business Source Ultimate were searched to identify relevant studies published in English between January 2000 and December 2018. A second search was conducted with the same criteria between January 2019 and January 2020. These time periods were selected to optimise currency of the findings and given very few (if any) relevant studies were published prior to 2000. Prior to the database searches being conducted, search terms were developed through and iterative process guided by the PICO (problem, population, intervention and comparison, and outcome) Statement [47, 48]. These terms were refined in consultation with a senior librarian from the University of Newcastle, Australia to capture the relevant studies and to ensure the correct use of Boolean operators, truncation, and subject headings. The selected search terms for all databases related to the key concepts explored, being healthcare leadership (problem), health clinicians (population)_ the type of scale (intervention and comparison), and assessment of psychometric properties (outcome), with additional terms related to health included for non-health focussed databases (population). The full search strategy for the MEDLINE database is shown in Fig. 1.

Fig. 1
figure 1

Search strategy

Eligibility

Publications were included if they: (1) were peer-reviewed journal articles reporting original research results; (2) reported data collected from or about practicing health professionals; (3) identified and assessed a leadership related scale for reliability, validity, or acceptability (See Table 1 for selection criteria and key definitions).

Table 1 Selection criteria key definitions

Study selection

The initial search yielded 4593 records. Of these, 1779 duplicate records were excluded. From the remaining pool of 2814 records, the titles and abstracts from a subset of 100 records that had been randomly selected were independently screened by two authors (CP and MC), to pilot the application of the inclusion and exclusion criteria. Title and abstracts from an additional subset of 500 randomly selected studies were then independently screened by the two authors (CP and MC) with the remaining screened by one author (MC). Studies that did not meet the inclusion criteria were excluded. The full-text manuscripts of the remaining 462 studies were then sourced. Of these 462 studies, the full text of 160 (~ 35%) studies were screened by two authors (MC and CP). The remaining 302 studies were screened by one author (MC). Of the 462 full text manuscripts screened, 274 did not meet the inclusion criteria and were subsequently excluded, leaving 188 eligible publications. After further discussion, the criteria for a leadership scale were refined to exclude any scales that did not specifically address leadership (i.e. those measuring burnout, implementation, non-technical skills, organisational context, patient safety, task/event-based leadership, or work roles). Using these criteria, a further 149 records were then excluded, leaving 39 records remaining for extraction (See Fig. 2 for PRISMA diagram).

Fig. 2
figure 2

PRISMA flow diagram

Data collection process & data items

The following information was extracted and tabulated from publications that met the inclusion criteria: (1) author(s); publication year; setting (e.g., oncology, cardiology, etc.); country of study; participants (e.g., physicians, nurses, multidisciplinary, etc.); study aim; methods; leadership assessment (namely, type and name of scale or tool); outcome assessment; and findings. And (2) psychometric properties including face validity, content validity, internal reliability, test-retest reliability, construct validity, criterion validity, responsiveness, acceptability, feasibility, revalidation, cross-cultural validation, convergent validity, and discriminant validity.

Summary measures

Setting, sample, and characteristics of the innovation being assessed

Settings, sample, and characteristics of the innovation were extracted including the country and setting where the scale was validated, as well as the gender and profession of the sample and the sample response rate.

Face and content validity

Face validity assesses whether a scale is meaningful and relevant to those who use the scale [49]. Scales were considered to have face validity where administrators and/or test-takers agreed through a formal process that the scale measures what it is designed to measure [49]. Content validity assesses whether the scale fully captures the concept and sample it is designed to measure. The scale was considered to have content validity if the paper described how the items were selected and assessed, which revisions were made, and how they were made, or the theories and/or framework guiding the scale design [50].

Internal reliability and test-retest reliability

Scales or subscales were considered to have internal consistency if the Cronbach’s alpha was >.70 [51]. Where a paper only reported a range of Cronbach’s alphas for the scale’s subscales and part of the range was <.70, internal consistency was rejected. Repeated administration of a scale with the same sample and within 2–14 days was necessary to consider the scale’s test-retest reliability (i.e. a re-administration period outside of 2–14 days did not satisfy our criteria) [52]. Further, test-retest reliability was achieved if correlations between scores from the two administration time points had an intraclass correlation coefficient (ICC) of >.70 [45, 50].

Construct and criterion validity

Exploratory and/or confirmatory factor analysis (EFA/CFA) results were primarily used to determine a scale’s construct validity (i.e. internal structure). If both an EFA and CFA were conducted for a single scale, cut-offs were applied to the CFA results. When interpreting an EFA, scales were considered to have construct validity if eigenvalues were set at > 1 and/or > 50% of variance was explained by the scale [53, 54]. In studies where percentage of variance explained was reported, eigenvalues of > 1 were assumed. When interpreting a CFA, scales were considered to have construct validity where analysis was performed with a root mean square error of approximation (RMSEA) < .08 and a comparative fit index (CFI) > 0.95 [55, 56]. While a RMSEA of <.06 is supported by Clinton-McHarg (2016) [45], in this healthcare leadership literature, it was more common for an RMSEA of <.08 to be an acceptable cut-off, as often referenced from Hu and Bentler (1999) [56]. A scale was considered to have criterion validity if different scores were obtained for subpopulations with known differences (e.g., general nurse versus nurse manager) [57].

Responsiveness, acceptability, feasibility, revalidation, and cross-cultural adaptation

A scale’s ability to detect change over time (i.e. responsiveness) was determined based on a reported moderate effect size (> 5%) and/or minimal floor and/or ceiling effects (< 5%) [50, 58]. A scale was considered acceptable based on a low proportion of missing items and feasible based on time taken to complete, interpret, and score the scale. It was also noted if a scale was revalidated with additional populations or samples, or adapted across cultures or languages.

Convergent and discriminant validity

A scale’s convergent and discriminant validity was determined respectively by Pearson’s correlation coefficients (r) > .40 with similar scales and (r) < .30 with dissimilar scales. Where convergent or discriminant validity was reported for a scale, however testing did not involve correlating the scale with other similar/dissimilar validated scales, these were marked as unclear when determining satisfaction of criteria.

Synthesis of results

Given that the publications varied considerably in their use and description of methodologies and measurements, a narrative synthesis rather than a meta-analysis was required. Popay et al. (2006:5) suggest that unlike a narrative review, which ‘are typically not systematic or transparent in their approach’, [59] narrative synthesis denotes ‘a process of synthesis that can be used in systematic reviews focusing on a wide range of questions, not only those relating to the effectiveness of a particular intervention … [It] is part of a larger review process that includes a systematic approach to searching for and quality appraising research-based evidence as well as the synthesis of this evidence’. [59] For the purpose of this review, studies were synthesised according to their expressed aim(s).

Results

Of the 2814 records screened at the title and abstract stage, 2352 records were excluded. The 462 records remaining were screened at the full text stage. Of those records, 274 were excluded, leaving 188 eligible publications. After further discussion, the criteria for a leadership scale were refined to exclude any scales that did not specifically address leadership (i.e., those measuring burnout, implementation, non-technical skills, organisational context, patient safety, task/event-based leadership, or work roles). Using these criteria, a further 149 records were then excluded, leaving 39 unique records remaining for extraction (See Fig. 2 for PRISMA diagram).

Study characteristics

Setting and characteristics of study sample for assessed scale

Of the 33 scales, the majority of scales were validated in English speaking countries including the USA (n = 15) and Canada (n = 4), but also with some translations and use in Europe and Asia. The Implementation Leadership Scale was validated with five separate types of health professionals, more than any other of the included 33 scales. This was followed by the Multifactorial Leadership Questionnaire and the Evidence-Based Practice Nursing Leadership Scale, which were both validated in two separate types of health professionals. The majority of studies validated scales with nurses (n = 27), followed by allied health (n = 10), with only two studies validating scales with a sample that included physicians; and no scales were validated with most other types of health professionals. It is also worth noting that women were overwhelmingly represented within the sample. The percentage of women in the studies ranged from 39% to 99.5%, with the average percentage of women across the 26 studies that reported gender being 75%. Given that the studies with the lowest rates of women in their samples were those studies that included non-nurse health professionals, this is likely due to nursing being a female-dominated profession. These data were reported in Table 2.

Table 2 Characteristics of study sample for assessed scales

Psychometric properties of the scales including face and content validity, internal reliability, test-retest reliability, construct and criterion validity, responsiveness, acceptability, feasibility, revalidation and cross-cultural validation, were assessed and reported in Table 3.

Table 3 Summary of psychometric properties reported for each scale

Face and content validity

Of the 39 studies, face and content validity were evaluated and satisfied in 18 and 33 studies (16 and 30 scales), respectively.

Internal reliability

Of the included 33 scales, 29 scales (88%) achieved internal consistency, as indicated by Cronbach’s alphas >.70. All five studies reporting on the ILS indicated adequate internal consistency [19, 76,77,78,79], with two reporting for the entire scale [19, 78], and three for individual subscales (e.g. ‘Y (only subscales reported)’) [76, 77, 79]. Of the two studies reporting on the MLQ, one reported adequate internal consistency of the whole scale [85] and one of the individual subscales [86]. Of the remaining 27 scales that reported internal consistency, 16 reported for the entire scale [43, 60,61,62,63,64, 66, 70,71,72,73, 75, 78, 80, 83, 84, 87, 89], and ten for individual subscales [65, 67,68,69, 74, 81, 82, 93, 95, 96]. Three papers [64, 66, 72] reported only the range of Cronbach’s alpha values of the scale’s subscales, indicating one or more subscales with a Cronbach’s alpha of <.70, and thus did not satisfy our criteria for confirming the whole scale’s internal reliability.

Test-retest reliability

Of the 33 included scales, nine scales were tested for test-retest reliability [62, 71, 80, 86, 88, 90,91,92]. Considering the Pearson’s correlation coefficient cut-off of >.70 alone, seven scales achieved adequate test-retest reliability [62, 71, 80, 88, 90,91,92] and two did not [86, 87]. Re-administration periods ranged from within 2–14 days (n = 5) [71, 88, 90,91,92], between 14 and 30 days (n = 3) [62, 80, 87], and one year [86]. Our criteria for adequate test-retest reliability required both an r of >.70 and a re-administration period of between 2 and 14 days. The five scales re-tested within 2–14 days [71, 88, 90,91,92] fulfilled this criterion. One scale [80] demonstrated high test-retest reliability (r = .96) slightly outside the recommended re-administration period (15 days post-initial assessment), and was deemed successful in satisfying our criteria.

Construct and criterion validity

Thirty-three studies reported their scale’s internal structure using either an EFA (n = 10) [includes PCA [n = 7]]), a CFA (n = 10), or both (n = 12). Of the five studies [19, 76,77,78,79] reporting on the ILS, three [19, 77, 78] reported acceptable thresholds for good construct validity and two [76, 79] did not. Of the remaining 26 scales, 54% (n = 14) satisfied the acceptable thresholds for good construct validity, in that the EFA indicated > 50% of variance explained by the final model and eigenvalues were set at > 1 and/or the CFA indicated acceptable RMSEA (< .08) and CFI (> .95) values. Five scales were marked as marginally unsuccessful (i.e. ‘N*’) [60, 74, 75, 87, 90] in satisfying our criteria for construct validity, indicating either an RMSEA value <.08 but not <.06, and/or a CFI value >.90 but not >.95. One study [63] reported only the scale’s RMSEA value (< .08) and so, was marked as unclear (‘U’) when determining adequacy of construct validity (i.e. needing both the RMSEA and CFI to determine adequacy). Two further scales [82, 92] were marked ‘U’ as, although mentioning factor analysis or construct validity, they did not report RMSEA or CFI values. Four scales [64, 67, 80, 95] did not satisfy our criteria for adequate construct validity.

Of the 33 included scales, five scales [62, 68, 73, 75, 93] demonstrated criterion validity and one [60] was marked as unclear. Ten scales were correlated against existing scales to evaluate convergent and/or discriminant validity, as indicated by Pearson’s correlations (r). Eight of these scales (including the ILS, as convergent validity was tested and achieved in three of the five ILS studies) [60, 63, 66, 68, 74,75,76, 93] were considered to have convergent validity (r > .40) and two scales (the iLead and the ILS) [19, 75, 79] were considered to have both convergent and discriminant validity (r < .30). Three studies [67, 76, 87] reported on convergent and/or discriminant validity that did not involve correlating the scales with other validated scales and thus, were marked unclear (‘U’). Only one scale (Survey of Transformational Leadership) [93] achieved acceptable construct, criterion, and convergent validity.

Responsiveness, acceptability, feasibility, revalidation, and cross-cultural adaptation

Of the 39 studies, only five reported on responsiveness, three of which included scales that satisfied our criteria for floor and ceiling effects of < 5% [62, 71, 90]. One scale [67] had a small ceiling effect with scores skewed toward the higher end of the scale (14–62% of people obtaining the highest possible score for each item). The three papers that reported on their scale’s acceptability [67, 90, 94] satisfied low proportions of missing items. Only one study recorded the time taken to complete the scale (5–10 min) [67]. Other studies mentioned the expected time to complete the test in their methodology but did not record actual time taken by test-takers. Of the eight scales that underwent a process of revalidation in additional settings and subpopulations, five were successful in language retranslation and use with additional populations [62, 69, 71, 90, 91], two were unsuccessful within our criteria [64, 80] and one was unclear [87].

Discussion

The objective of the review was to inform healthcare implementation regarding appropriate scales for assessing traits and behaviours for identifying formal or informal leaders who can successfully implement change. Notably, a large number of scales (n = 33) were identified as having undergone some form of psychometric testing with health professionals. However, only three of the scales had been tested on multiple occasions. These were the Implementation Leadership Scale (n = 5), the Multifactor Leadership Scale (n = 2), and the Evidence-Based Practice Nursing Leadership Scale (n = 2). The implementation Leadership Scale was found to have sound: face validity and content validity with Registered Nurses; construct validity with Child Welfare Workers, Registered Nurses, and Mental Health Clinicians; internal consistency with Child Welfare Workers, Registered Nurses, and Mental Health Clinicians; convergent validity with Mental Health Supervisors and Mental Health Clinicians. The Multifactor Leadership Questionnaire was found to have acceptable face validity, content validity, construct validity, and internal consistency with Nurses. The Evidence-Based Practice Nursing Leadership Scale was found to have acceptable face validity, content validity, construct validity, internal consistency, test-retestability, responsiveness, and was also cross-culturally validated. Most of the identified scales were tested in English speaking high-income countries such as the USA or Canada, predominantly with samples of nurses, or a sample of health professionals that included nurses (n = 27). Only two validation studies included physicians, which may suggest a limited number of scales proven suitable for assessing leadership in this group. Given that leadership roles can be occupied by physicians (e.g., department heads), nurses (e.g., nursing team leads) or others (e.g., rehabilitation team leads, mental health team leads) who are often involved in implementation of interventions, it is important that the scales for assessing leadership are tested in varied settings and known to be robust enough for research involving physicians, nurses, allied health professionals, and others who have a leadership role in practice change. It is also important to consider the roles of gender and cultural variation in leadership. Therefore, future work should consider validating leadership scales with a wider variety of diverse health professionals and in a variety of contexts.

The psychometric properties which were found to be strong for most scales, were content validity and internal consistency. These properties have similarly been found to be strong in the wider literature regarding testing of leadership scales with non-health professional samples [77, 97,98,99,100]. For example, the Servant Leadership Survey (SLS), which has been validated with 638 workers in three Spanish speaking countries (Spain, Argentina and Mexico) [99], the Ethical Leadership Behaviour Scale (ELBS) [98], which has been validated with 405 workers in Brazil, the School Counsellors Leadership Survey (SCLS) [97], which has been validated with 776 school counsellors and school counselling supervisors in the USA, and the Implementation Leadership Scale (ILS) [77], which has been cross-validated with 214 child-welfare providers in the USA. Glasgow et al. [101] suggest that a scale with acceptable internal consistency may also have a high number of items and consequently be more burdensome for users [101]. They further suggest it may be more pragmatic to consider content validity [101], which assesses how well the scale measures the concept and sample it is designed to measure. Content validity was strong in most (n = 30) scales in this study, including the Implementation Leadership Scale, Multi-Factor Leadership Questionnaire and Evidence-Based Practice Nursing Leadership Scale.

The findings in relation to construct validity are potentially concerning in that only 15 of the 33 scales were found to satisfy the acceptable thresholds for good construct validity. This potential concern has not been clearly identified in the literature regarding testing of leadership scales with non-health professional samples [102,103,104]. For example, one study found that although a more recent revision of the Multifactor Leadership Questionnaire (MLQ) exhibited high internal consistency, previous literature employed older versions that lacked discriminant validity [102]. Another study testing the construct validity of the Servant Leadership Scale (SLS) found the construct validity to be sound, however, the authors suggested that previous studies had not adequately tested the construct validity of the scale [71].

In relation to the remaining psychometric characteristics – test re-test reliability, responsiveness, acceptability, cross-cultural revalidation, convergent validity, discriminant validity and criterion validity – very limited testing has occurred.

There are seven scales that stand out as likely to be psychometrically sound for use with health professionals (at least for nurses and allied health professionals), in that they are reported to have satisfied most of the reliability and validity criteria. Of the scales tested in the English-language, the iLead scale demonstrated good internal reliability and face, content, criterion, convergent and discriminant validity, and was only marginally outside our cut-off for having satisfied construct validity (CFI > .90 but not >.95). It is important to note that several studies decided to deem a CFI of >.90 as adequate for good construct validity. The Supportive Leadership Behaviours Scale also satisfied internal and test-retest reliability, face, content, and construct validity, and was successfully revalidated. The Survey of Transformational Leadership (STL) demonstrated internal consistency and good construct, content, criterion, and convergent validity. Finally, the Implementation Leadership Scale has been evaluated several times and repeatedly demonstrates strong internal consistency, face and content validity, and convergent and discriminant validity. There are some inconsistencies in the scale’s construct validity, with two of the five evaluations of the ILS not satisfying our criteria for adequate construct validity. Of the scales tested in languages other than English, the Brazilian adaptation of the Charismatic Leadership Socialised Scale demonstrated inadequate construct validity and internal consistency, and so was not successfully revalidated. The Authentic Leadership Self-Assessment Questionnaire (Polish version) (ALSAQ-P) reported on and satisfied seven of the 11 criteria, including internal and test-retest reliability, content, construct and criterion validity, and evidence of good responsiveness and revalidation. The Persian version of the Spiritual Leadership Questionnaire (SLQ) demonstrated good internal and test-retest reliability and face and content validity. Moreover, the Persian SLQ was deemed responsive, acceptable and feasible, and achieved revalidation in Persian language. This scale, like the iLead scale, had a CFI of >.90 but did not meet our cut-off of a CFI > .95. The Chinese translation of the Evidence-Based Nursing Leadership Scale (EBP Nursing Leadership Scale) achieved internal and test-retest reliability, construct, face, and content validity, good responsiveness and revalidation. In summary, seven scales were found to have acceptable psychometric properties for use in healthcare, being the: Authentic Leadership Self-Assessment Questionnaire (Polish version), the iLead, the Spiritual Leadership Questionnaire (Persian version), the Supportive Leadership Behaviours Scale, the Evidence Based Nursing Scale (Chinese translation), and the Implementation Leadership Scale.

Few studies assessed the degree to which scale might be considered pragmatic, such as the time required to complete the scale or the acceptability and feasibility of the scale. Given the importance of identifying validated leadership scales in implementation science [45], and the key role of acceptability, feasibility, and cost (including time and resources) in assessing implementation outcomes [105], this represents a significant gap in the literature. However, it must be acknowledged that the search strategy did not focus extensively on pragmatic aspects of scales, for which tools are now emerging (e.g., Stanick, 2021) [106]. The availability of a quick, acceptable, and validated leadership scale would provide opportunities for researchers, leaders, and clinicians to assess health professionals in busy clinics for evidence-based leadership to drive evidence-based healthcare.

Limitations

Due to the diversity of the literature on leadership, the chosen set of search terms may have excluded some relevant studies. The review inclusion criteria resulted in the exclusion of a large number of studies relating leadership in the context of developing or demonstrating specific or technical skills (e.g., surgical skills). While these types of scales were considered too narrow or purpose-specific to be of benefit for assessing healthcare leadership more generally, it is possible that these scales could be potentially useful if adapted or modified. In addition, as noted by a number of authors [101], the pragmatic aspects of scales are important for implementation but have not been thoroughly addressed here. Inclusion of such assessment would be a useful addition to the field. The assessment of construct validity in this review focussed on factor analysis, as this was the approach generally taken in these studies. It is acknowledged that other approaches such as assessing a construct’s relation to theory are also important to establishing construct validity.

Additionally, women were overwhelmingly represented in the samples, perhaps due to the high number of scales validated with nurses. A working paper by the World Health Organisation (WHO) analysed gender equity in health professionals in 104 countries [107]. They found that women make up 67% of health professionals in the included countries, however in most countries, occupations such as physicians, dentists and pharmacists are mostly dominated by men, with professions such as nursing and midwifery mostly comprised of women [107]. A 2017 systematic review of medical leadership in hospital settings [108] found 28 studies exploring physician leadership. Of those 28 studies, nine found ‘leading change’ to be described as an activity performed by physician leaders. This suggests there may be a role for physicians as formal or informal change champions. Boateng et al. [109] propose that one component of best practice of scale development and validation is to do so with the population it is intended to be used with. Given that most of these scales have been validated primarily with nurses and allied health professionals who are predominantly female, it is difficult to claim that these scales are suitable for assessing leadership traits and behaviours in healthcare professional groups which are mostly male, or professional groups other than nurses and allied health professionals. Therefore, future work may consider validating these scales with a wider variety of health professionals.

Conclusion

There are seven scales which may be sufficiently sound to be used with nurses and allied health professionals. These are The Authentic Leadership Self-Assessment Questionnaire, the iLead scale, the Spiritual Leadership Questionnaire, the Supportive Leadership Behaviours Scale, The Survey of Transformational Leadership the Evidence-Based Nursing Leadership Scale and the Implementation Leadership Scale. There is a research gap in assessing leadership traits and behaviours of physicians and it appears that males have been underrepresented in some validation studies. Given the role of leadership in driving best practice in healthcare, there is a need for further psychometric assessment and validation of existing scales with physicians, males, and in assessing and understanding gender and cultural differences in implementation leadership. This serves to limit confidence with which the available scales can be used across health care disciplines in implementation research and practice, but also provides an opportunity for advancing the science of implementation leadership.