Scoping review and Coordinated Multicentre Team Protocol registrations

  1. 1.

    PROSPERO Protocol Registration: CRD42021137932 https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=137932

  2. 2.

    Open Science Framework (Coordinated Multicentre Team Protocol): https://osf.io/gvr7y/

  3. 3.

    Coordinated Multicentre Team Protocol publication: https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/s13643-018-0879-2

Background

The emphasis on and number of studies involving health research partnerships has grown substantially over the last decade [1]. Despite this evolving popularity and mounting demand for the systematic quantification of partnership outcomes and impacts, the assessment of health research partnerships has not kept pace [2]. Here, we refer to health research partnerships as those involving “individuals, groups or organizations engaged in collaborative, health research activity involving at least one researcher (e.g. individual affiliated with an academic department, hospital or medical centre), and any partner actively engaged in any part of the research process (e.g. decision or policy maker, healthcare administrator or leader, community agency, charities, network, patients, industry partner, etc.)”(p 4) [3].

Although quantitative tools for assessing the outcomes and impacts of health research partnerships emerged in the late 1980s to early 1990s [5,6,7], available tools are largely simplistic and the assessment of outcomes and impacts in the health research partnerships domain, nascent [5, 7,8,9,10,11,12,13]. Available studies are often hampered by a lack of rigorous measurement, including tool psychometric testing to establish evidence of validity and reliability. The limitations of existing studies fall into three categories: many primary studies select single-use and locally relevant tools as a core part of the partnership process, with a focus on monitoring their partnerships’ progress and on bespoke outcomes and impacts of highest relevance to them [5, 9]. Although most tool studies aim to incorporate partner views, track individual partnership progression and capture partner perspectives, few aim to create more universally applicable, standardized tools that can be used more broadly or for replication studies [10]. Second, many such studies are limited by small sample sizes and lack of iterative tool testing, which in turn contributes to the lack of psychometric evidence and a lack of evidence across a broader range of contexts. Third, primary studies in this domain are often limited by interchanging terminology, a lack of discrete concept definitions, problems associated with literature indexing, location and retrieval [3, 14, 15], and multiple tool-specific challenges including construct identification, definition, refinement and application [5,6,7,8,9,10, 12].

Cumulatively, these challenges inhibit the evolution of partnership assessment and ultimately slow the advancement of partnership science [9, 10]. A recent overview of reviews examining quantitative measures to evaluate impact in research coproduction suggests that investigators must “engage more openly and critically with psychometric and pragmatic considerations when designing, implementing, [evaluating] and reporting on measurement tools” (p. 163) [8]. There is an established rationale for developing robust, pragmatic measures that are both relevant to partners and usable in real-world settings; pragmatic tools are viewed as a critical accompaniment to pragmatic designs [16,17,18]. In this light, health research partnership tools should be relevant to partners, be actionable, have a low completion burden, and demonstrate adequate validity and reliability. Importantly, there is a need for tools that are broadly applicable, can be used for benchmarks with accompanying norms to aid interpretation, and that demonstrate strong psychometric and theoretical underpinnings, without causing harm [16]. Closing these gaps would help to facilitate tool use, advance the measurement of systematic partnerships and drive improvements in partnership science [8].

Numerous tools for assessing health partnership outcomes and impacts have been identified in previous reviews focused on specific partnership domains, partner groups or contexts [5,6,7,8,9,10,11,12]; however, scope restrictions in these reviews preclude our understanding of tools across health research partnership traditions. These reviews also reveal that information about tool psychometric and pragmatic properties remains lacking. This study reviewed and systematically assessed globally available tools for the assessment of health research partnership outcomes and impacts to address documented gaps in both the psychometric and pragmatic characteristics of these assessment tools.

Our primary research question was as follows: what are the globally available, valid, reliable and acceptable tools for assessing the outcomes and impacts of health research partnerships? Our secondary research questions pertained to tool characteristics, including the following: what are the reported purposes of the tools, are outcomes and/or impacts measured, and what are the reported theoretical underpinnings and psychometric and pragmatic properties of the tools? (Additional file 1: Appendix S1). Secondary research questions pertaining to partnership characteristics were captured and will be reported in a forthcoming publication to preserve manuscript clarity.

Methods

This review is part of a comprehensive, multisite synthesis effort by the Integrated Knowledge Translation Research Network (IKTRN) [3, 19] and was guided by a collaboratively built conceptual framework [3]. In this review, we define tools as “instruments (e.g. survey, measures, assessments, inventory, checklist, questionnaires, list of factors, subscales or similar) that can be used to assess the outcome or impact elements or domains of a health research partnership” (p 5)[3, 20].

The overall approach to the review was guided by the steps outlined by Arksey and O’Malley [21], with refinements [22,23,24], and additional guidance from the Centre for Reviews and Dissemination (CRD) guidance for undertaking reviews in healthcare [25], the Cochrane Handbook for Systematic Reviews [26] and the Joanna Briggs Institute Reviewers’ Manual [27]. This manuscript was structured and reported using the newly updated Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting standards [28]. Operational terms and definitions were published a priori as part of the multicentre approach [3]; additional definitions are provided in Additional file 1: Appendix S2 and detailed in the PROSPERO registered protocol, including key questions, inclusion–exclusion criteria and a priori specified methods [29]. All protocol deviations and accompanying rationale are detailed in Additional file 1: Appendix S1.

Search strategy and data sources

In consultation with an academic medical librarian (MVD), we iteratively developed a comprehensive search strategy using key papers and audit-improvement rounds to refine study catchment and feasibility [30]. The resulting health research partnership term clusters and the search strategy development methods have been applied to subsequent, parallel reviews [2, 3, 14, 15, 31]. We tested the strategy in Ovid MEDLINE to balance search sensitivity and scope [32]. The partnership search term cluster underwent peer review [33, 34] by an academic librarian to test for conceptual clarity across multiple partnership approaches. The overall strategy was subjected to the Peer Review of Electronic Search Strategies (PRESS) checklist review by a second academic network librarian, resulting in the spelling correction of a single term. No restrictions for date, design, language or data type were applied. The search strategy was translated for all four databases (Additional file 1: Appendix S3).

Electronic databases

Using the a priori, unrestricted strategy, we searched MEDLINE (Ovid), Embase, CINAHL Plus and PsycINFO from inception through 2 June 2021, including two updates. The search generated a total of 56 123 citations, resulting in the screening of 36 027 de-duplicated records [35] and 2784 full-text papers, managed with EndNote™ X7.8.

Eligibility and screening

We kept studies involving health research partnerships that (i) developed, used and/or assessed tools (or an element or property of a tool) to evaluate partnership outcomes or impacts [5, 36] as an aim of the study, and (ii) that also reported empirical evidence of tool psychometrics (e.g. validity, reliability). We excluded studies in which the main purpose of the partnership was recruitment and retention of study participants. Conference abstracts were excluded from the eligible literature only after full-text assessment or confirmation that the citations were preliminary or duplicate records, or were lacking sufficient abstraction detail [37]. Abstracts in languages other than English were passed through title/abstract (level 1 [L1]) screening but translated prior to full-text assessment (Table 1).

Table 1 Study inclusion–exclusion criteria

All titles/abstracts (L1) and eligible full-text studies (L2) were screened and assessed independently, in duplicate (KJM with JB, LP, LN, SS, SM, MK, CM, AG, LS, KA), and tracked in a Microsoft (MS) Excel [38] citation database and screening spreadsheets. We tested and revised screening tools at each stage of the review and employed a minimum calibration rule (Cohen’s κ ≥ 0.60) [39] to align team members’ shared understanding of concepts and the application of eligibility criteria [40,40,41,43]. To balance abstraction burden with data availability and complexity, full-text abstraction (study and tool characteristics) was undertaken using a hybrid strategy [22, 44]. Eligible papers were independently abstracted by KJM and independently validated (MK, SS, SM, KP) [45] using a predefined coding manual. We resolved all discrepancies by consensus discussion [21, 41]. Investigators were sought out to locate missing tools or for assistance in differentiating linked citations only [43]. At least two attempts were made to locate corresponding authors and tools when contact details or tools were incorrect or missing [3, 5, 14]. The assessment and abstraction/scoring of psychometric, pragmatic tool evaluation and study quality characteristics were also undertaken independently and in duplicate, with discrepancies resolved the same way.

Study and tool characteristics

Data pertaining to study and tool characteristics were abstracted per the protocol [29]. We anticipated challenges associated with consistent use of terminology as are commonly reported in this research domain (e.g. outcomes/impacts, partnership approaches, tool type) [3, 8, 14, 15]. When this occurred, we used the terms most prominent in methodological descriptions. We coded health subdomains inductively based on key words and study purposes [46]. More than one code per study was used to describe the study subdomain, as required.

Empirical evidence of tool psychometrics

The empirical psychometric evidence for tools was evaluated for each identified tool. Informed by previous studies [6,7,8,9,10,11,12] and best-practice recommendations [17, 18, 36, 47, 48], we created an initial list of psychometric evidence types, and expanded this list iteratively when new sources were identified by included studies (Additional file 1: Appendix S3). Only studies reporting empirical psychometric evidence were retained in this review to (i) address the documented lack of research reporting psychometric evidence for health research partnership outcomes and impacts assessment tools, and (ii) advance our understanding about the presence and types of psychometric evidence available in existing literature beyond simple dichotomous labels (e.g. valid/not valid or reliable/not reliable). By synthesizing the presence of psychometric evidence across studies, we also aimed to highlight areas in which the nature and type of psychometric evidence could be improved and advance the science of partnership assessment. This approach necessarily focused on later testing and evaluation stages of tool development [49] but does not diminish the importance of conceptual and theoretical sources of evidence to establish tool reliability and validity as important precursor evidence sources. As previously reported, the identification and reporting of psychometric data was complex and varied substantially in level of detail. This was mitigated through iterative review, piloting and calibration; all abstraction discrepancies were independently, then collectively considered, then resolved to consensus through recurrent discussion.

Pragmatic tool evaluation criteria

We modified a set of consensus-built criteria developed by Boivin et al. [7, 50] as an alternative to applying the Psychometric and Pragmatic Evidence Rating Scale (PAPERS) criteria [17, 18] due to the quality of reported data. The main purpose of the criteria checklist was to appraise the tools from the perspective of those intended to use the tools [7]. Team members iteratively modified and piloted the revised items. A final set of 20 criteria (five questions in four domains: Scientific Rigour, Partner Perspective, Comprehensiveness and Usability) were generated. Piloting confirmed that these criteria were a better fit for the level and detail present in the literature under examination, and provided a comprehensive, easily interpretable (single score) evaluation of scientific, partner, comprehensiveness and usability/accessibility properties for each tool (Additional file 1: Appendix S4). It is important to note that the original criteria were intended for use as a checklist, not a quality assessment [7]; we used them this way in our review. The modified criteria were applied independently and in duplicate to all tools [51], with discrepancies resolved by consensus. Tools were coded as toolkits in studies where multiple tools were described and intended for collective use; in these cases, tool characteristics were scored cumulatively and reported as a single tool.

Study quality assessment: the quality of survey studies in psychology (Q-SSP) checklist [52]

Study quality assessments typically assess the degree to which adequate measures were taken to minimize bias and avoid errors throughout the research process [53], and are hence design-focused. After piloting several quality appraisal tools with the eligible literature, we found that the best-fitting tool was an assessment of survey methods, namely the Q-SSP appraisal checklist and guide (Additional file 1: Appendix S5). The Q-SSP checklist was developed to address a wide variety of research and to help investigators differentiate broadly acceptable from lower-quality studies [52] using a four-stage process comprising evidence review, expert consensus, checklist refinement and criterion validity testing [52]. Q-SSP assessments were undertaken independently, in duplicate, and we resolved discrepancies by consensus.

Analysis

Basic descriptive statics including means, standard deviations and frequencies were calculated to synthesize quantitative study, tool, psychometric and pragmatic characteristics in MS Excel [38] and Stata v13.1 software [54]. The synthesized data were consolidated into tables. Scores for each of the pragmatic and tool evaluation criteria (mean/standard deviation) were synthesized and reported by criterion, domain and overall sample. We synthesized qualitative variables using thematic analysis [46] in NVivo v12.7 [55], in keeping with the overarching descriptive-analytical approach for the review [56], and used existing reporting guidelines to organize the findings [57,57,59]. Finally, study quality assessments (Q-SSP) [52] were documented by calculating an overall quality (%) and four domain-specific scores (ratios) for each study.

Results

The search generated 36 071 de-duplicated records and 49 full-text studies (48 studies and one companion report), as depicted in the study flow diagram (Fig. 1).

Fig. 1
figure 1

PRISMA systematic review study flow diagram

The team Cohen’s kappa was 0.66 [95% CI (0.64–0.67)] at L1 title/abstract screening and 0.74 [95% CI (0.72–0.76)] at L2 full-text review; these results were categorized as “substantial” [39, 42].

Study characteristics

Eligible studies comprised English-language and a single French-language report originating mostly in North America (39) and Europe (9), with a small remainder from South Africa (3), Australia (1) and Taiwan (1). Five dual-site studies involved the United Kingdom and South Africa (3), Canada and Australia (1), and Mexico and the United States (1) (Table 2).

Table 2 Characteristics of included studies (n = 48 studies, 1 companion report)

The eligible literature was widely dispersed, with exactly half of the publications (24, 50%) published in the same number of journals. Several small publication clusters were identified, including seven studies in Health Education & Behaviour (15%), three each in the American Journal of Community Psychology, Global Health Promotion and theses (each 6%), and two each in Health Promotion International, Public Health Nursing, Evaluation and Program Planning and Health Promotion Practice (each 4%). As shown in Fig. 2, about half of the identified literature was published after 2014 (20, 42%), and the earliest study was published in 1996.

Fig. 2
figure 2

Year of publication for included studies (n = 48 studies)

Most studies involved cross-sectional (28, 58%) and mixed methods with embedded survey (14, 29%) designs, case/multi-case (3, 6%), post- and pre-post designs (2, 4%), and a single nested longitudinal study (1, 2%) (Table 2). Studies employed quantitative (31, 65%) or mixed methods (17, 35%), and of the mixed-methods studies (17), most were true mixed quantitative–qualitative methods (14, 82%), and the remainder were mixed qualitative (2, 12%) and mixed quantitative (1, 6%) methods.

The studies were conducted in multiple health subdomains (Fig. 3), including health promotion, prevention and public health (19), and disease-specific domains [i.e. cancer, mental health and substance use/harm reduction, and sexually transmitted/blood-borne infections and sexual health (12)]. The smaller subdomains included community health and development (7), special populations (e.g. primary care, paediatric/adolescent health, and immigrant and geriatric health) (6), partnerships (6), health equity (4) and health services research (3).

Fig. 3
figure 3

Health subdomain clusters

Most studies reported explicit conceptual underpinnings (44, 91%). Methodologically, studies were multifocal, contributing to the health research partnership assessment literature through tool validation (44, 92%), development (25, 52%), modification (21, 44%) and evaluation (13, 27%), and measured outcomes (25, 52%), impacts (2, 4%) or both outcomes and impacts simultaneously (21, 44%). Explicit definitions for the terms outcome and impact were available in less than half of studies (20, 42%), and terms were frequently switched.

Tool characteristics

Included studies yielded 58 tools. The characteristics of the included tools are summarized in Table 3. With one exception, studies were exclusively English-language, and six contained non-English-language tools (English–Spanish, 3 [60,60,62]; English–French, 2 [63,63,65]; and Dutch, 1) [66]). Tools targeted multiple partner groups including partnership members (28, 43%), community members (11, 17%), researchers (10, 15%), patients, and public and coalition staff (4, 6% respectively), and to a lesser extent targeted research staff (3, 5%), healthcare staff and partner organizations (2, 3% respectively), and education staff members (1, 2%). Surveys (21, 36%), questionnaires (17, 29%) and scales (12, 21%) were the most common tool types identified, and these categories were complicated by frequent switching of terms (survey, questionnaire, scale) and variable categorization across reports. We also identified several toolkits (3, 5%), indices and rubrics (2, 3%, respectively), and a single checklist (2%).

Table 3 Tool characteristics (n = 58 tools)

Almost all tools assessed process (55, 95%), but only half assessed outcomes (30, 52%) or both outcomes and impacts (26, 45%). Very few focused on impact assessment alone (2, 3%); however, we observed inconsistencies in the use and definition of these terms. We identified multiple forms of empirical evidence for validity (86%, 50) and reliability (95%, 55) in the tools. The presence of conceptual underpinnings (90%, 52) was the same as study-level conceptualization.

Pragmatic tool evaluation scores

Tables 4 and 5 present a synthesis of pragmatic tool evaluation criteria [7] (Additional file 1: Appendix S4). Mean domain scores were highest for Comprehensiveness (3.79, SD 0.75) and Scientific Rigour (3.58, SD 0.87), followed by Usability (3.19, SD 1.38). The lowest mean domain score was for Partner Perspective (2.84, SD 1.04), which was a surprising finding given the review focus on health research partnership assessment.

Table 4 Pragmatic tool evaluation consolidated scores (n = 58 tools)
Table 5 Health research partnership tool evaluation—study scores (n = 48 with 1 companion report; n = 58 tools)

Tool  comprehensiveness was high in terms of documenting outcomes and/or impacts (100%), partnership process (95%) and context (97%); however, tools lacked deliberate design for recurrent monitoring of partnerships (33%).

In terms of Scientific Rigour, tools were not typically informed by systematic evidence (17%) but were conceptually grounded (90%) and presented evidence for both validity and reliability (90% and 93%, respectively, inclusive of both empirical and theoretical/conceptual sources). Only half of the tools were explicitly based on the experiences and expertise of partners (55%).

Overall, tool Usability was mixed. Tool purpose was always present (100%), but only half of the tools were freely accessible (50%), considered easy to read and understand (53%), accompanied by instructions (57%) and available in a readily usable format (62%).

Tools were generally designed to be self-administered (97%), but not for reporting back to partners (28%). The level of partner involvement was not commonly included (28%), and partners were deliberately involved as co-designers in only 59% of studies, despite frequent capture of partner influence (76%).

The overall tool evaluation mean score was 66.64 (SD15.54), with scores ranging from 35 to 90% (Fig. 4).

Fig. 4
figure 4

Pragmatic tool assessment—criteria total scores (n = 58 tool scores)

The domains and total score analysis highlighted strengths for several tools. Twelve tools scored high (4 or 5) across all four domains (≥ 85%) [61,61,62,64, 67,67,68,69,70,72], and an additional two tools [73, 74] had lower Partner Perspective domain scores (3) but still achieved a high total score (85%) across the remaining three domains. Several tools demonstrated top scores for Comprehensiveness [69, 73, 75,75,76,77,79] while others scored higher in Scientific Rigour [61, 66, 70,70,72, 74, 80] and Usability [61,61,62,64, 67, 68, 71,71,72,74, 77, 81,81,83]. Few achieved top scores in the Partner Perspectives domain [62, 70] (Tables 4 and 5).

Psychometric assessment

Psychometric testing and reporting were widely variable and challenging to assess, primarily due to inconsistent or incomplete testing, reporting and reporting detail. Almost three quarters of studies presented two or more forms of psychometric evidence for validity (35, 73%); eight studies (17%) presented two forms of evidence for reliability. Iterative assessment and abstraction of psychometric evidence revealed reliability evidence in four categories (internal consistency, test–retest reliability, inter-rater reliability and other). The most frequently occurring form of reliability evidence was internal consistency (83%). Validity evidence was found in 11 categories [construct validity (convergent, factorial, discriminant, known groups, other), criterion validity (predictive, concurrent), structural validity (dimensionality), responsiveness, face validity, and content validity] (Table 6). The most frequent validity evidence was convergent construct validity (43, 27%) and predictive criterion validity (31, 20%). We observed norms and abstracted two forms of evidence for interpretability (ceiling/floor effects and interpretability); however, both evidence forms were rare.

Table 6 Consolidated tool psychometric evidence (n = 58 tools)

We identified 18 studies with more advanced and comprehensive assessment and reporting of psychometric evidence for validity and reliability [60, 61, 65, 68, 69, 71, 72, 74, 78,78,80, 82,82,83,84,85,86,88]; several of these studies overlapped with high-scoring tools identified using pragmatic tool evaluation criteria [61, 68, 69, 71, 72, 74].

Study quality assessment (Q-SSP)

The Q-SSP assessment revealed an overall mean study quality score of 58.02% (SD 12.32%), with scores ranging from 25 to 80%. Most studies (42, 88%) scored < 75%, and thus were categorized as having “questionable” quality by convention; very few studies (6, 12%) scored ≥ 75% or within the “acceptable” range [61, 65, 71, 81, 88, 89] (Table 7).

Table 7 Q-SSP assessments by item, domain and total score for included studies (n = 48 studies, 1 companion report)

Across studies, the Introduction domain mean score was 3.04/4.00 points (SD0.82), the Participant domain mean score was 1.77/3.00 points (SD0.78), the Data domain mean score was 5.27/10.00 points (SD1.62), and the Ethics domain mean score was 1.52/3.00 points (SD0.71).

The problem and target population were generally well described and participant sampling and recruitment details present, but operational definitions (32, 67%), research questions and hypotheses (24, 50%) and sample size justification were often lacking (35, 75%). There were strong links between the proposed and presented analyses (46, 96%), but the study measures themselves were frequently missing from reports or supplements (17, 35%). The provision of validity evidence for included measures was found lacking in almost a third of studies (14, 29%), and most studies lacked detail about those collecting data (42, 88%), the duration of data collection (29, 60%) and the study context (25, 52%). Explicit reference to informed consent/assent and the inclusion of participants in post-data-collection debriefing was largely absent or unclear across included studies (29, 60% and 37, 77%, respectively).

Overall, four of the six studies with “acceptable” quality overlapped with studies reporting more comprehensive psychometrics [61, 65, 71, 88], but only two overlapped with those reporting higher pragmatic tool criteria scores [61, 71].

Evidence summary: tool validity, reliability, pragmatics and study quality

This review identified 58 tools underpinned by empirical psychometric evidence in the assessment of health research partnership outcomes and impacts. When considered with pragmatic tool evaluation criteria and study quality score findings, four noteworthy groups of studies and accompanying tools emerged (22, 46%). First, only two studies (2, 4%) reported more comprehensive psychometrics and had both high pragmatic tool criteria and Q-SSP study quality scores [61, 71]. A second group of studies (7, 15%) reported more comprehensive psychometrics and either high pragmatic tool criteria scores [68, 69, 72, 74, 80] or high study quality scores [65, 88]. The third group (8, 17%) had more comprehensive psychometrics [60, 78, 79, 82,82,83,85, 87], and the last set of studies (5 plus companion report, 10%) scored high on pragmatic tool evaluation criteria [62,62,64, 67, 70, 73].

Discussion

This systematic review identified 58 tools for assessing health research partnership outcomes and impacts with tool psychometric evidence and pragmatic characteristics. We were able to identify a group of noteworthy tools, distinguished by their psychometric evidence, tool pragmatic characteristics and study quality scores.

Key study-level comparative findings

Overall, the presence and reporting of empirical psychometric evidence and pragmatic characteristics appeared improved in our study compared with previous reviews, yet several challenges related to the nascency of this research field remain (e.g. lack of key term definitions and measurement clarity, term switching, a lack of studies with deliberate focus on tool development, testing, evaluation and improvement, variable and inconsistent reporting). Future research to advance partnership measurement and science should consider both psychometric improvements (with specific emphasis on increased consistency, level of tested and reported detail, and dedicated study) and pragmatic considerations (specifically on accessible tools that are better informed by partner experiences and expertise, designed for partnership monitoring, and quantifiably readable). In examining tools with empirical psychometric evidence, this study contributes to our understanding of existing partnership tool measurement strengths and gaps. Our review provides practical ways to advance partnership measurement and, ultimately, partnership science.

At the study level, our findings aligned with previous reviews in that most included studies were North American- and English-centric, with a wide publication dispersion pattern and mid-2010 emergence [2, 7, 8, 11]. We also experienced previously reported challenges in the location of tools and author responsiveness [5, 7]. Our study differed from others documenting a predominance of qualitative methods and relative rarity of quantitative tools, designs and methods [9, 12, 70, 90,90,92]. By contrast, our review deliberately sought and identified tools with empirical psychometric and pragmatic characteristics encompassing diverse health research approaches. This review identified studies employing cross-sectional and mixed-method/embedded survey designs and quantitative and mixed methods; this catchment is likely a function of our study inclusion criteria but may also reflect an increasing overall trend towards the quantification of partnership assessment [1, 7, 11,12,13, 92, 93].

Key tool-level comparative findings

On a tool level, we found similarities and differences between our study and previous, related reviews, but these studies differed in scope (e.g. literature, search period, research domains other than health, focus of measurement) and definitions of partnership, generating very different samples and eligible primary literature [2].

Our findings demonstrate the need for research deliberately focused on tool development, testing and evaluation. Like other related health research partnership reviews [7, 8, 10, 94], we found that while tool purpose was universally reported, investigators focused almost exclusively on assessing and understanding the characteristics of bespoke partnerships. This was a consistent finding, despite the diverse scope and focus of these reviews (i.e. patient/public evaluation tools, community coalitions, coproduction impacts, and research collaboration quality and outcomes, respectively). Very few primary studies in our review focused specifically on tool validation or psychometric testing, although most involved one or more such activities. Furthermore, most studies were multifocal, that is, encompassing one or more tool development, modification, use, evaluation or validation activities simultaneously. These findings support previous reports regarding the paucity of focused health research partnership tool evaluation research [10, 94]. Our findings strengthen existing recommendations targeting the systematic assessment of psychometric and pragmatic tool properties [8], and more deliberate funding of research on tool design, testing, improvement and evolvement in general [49]. These aspects are considered key to advancing partnership science measurement and partnership science as a field [8, 9, 70, 95].

Conceptually, our study revealed a much higher presence of theoretical underpinnings at both the study and tool levels (91%, respectively), compared with levels reported in other partnership tool reviews of patient/public and community coalition evaluation tools [7, 94]. However, the implications of this finding remain unclear. Some authors have observed that theoretical/conceptual connections to both partnership and measurement theory rarely translate into operationalized tool elements [8, 17]; this is an important area of future inquiry.

The tools we reviewed measured outcomes similarly, as compared with a recent review of patient/public partnership evaluation tools (52% vs 56%) [7]; however, in our study, we found that explicit definitions for outcome and impact terms were present intermittently and often interchanged. Terminology challenges have been reported in other systematic studies in the health research partnerships domain, noting the significant variance, overlap and omission of key term definitions from reports (i.e. terms for outcomes/impacts, partnership approaches and tool types) [9, 14, 15, 96]. While comparative research and crosstalk among research partnership traditions is a relatively recent phenomenon [4, 6, 96,96,97,99], clarity on key concepts, terminology, definitions, core measures and tools is fundamental to advancing partnership measurement and scientific inquiry [8, 9, 49, 70].

Comparative findings: tool pragmatic characteristics, validity and reliability

Pragmatic tool evaluation scores were generally higher in our review than in Boivin and colleagues’ review of patient partnership evaluation tools [7]. In our study, the highest mean domain scores were Comprehensiveness and Scientific Rigour, whereas Scientific Rigour was the lowest domain score in the Boivin review [7]). Importantly, we found that only a single tool overlapped between the reviews. This lack of overlap can be accounted for by differences in review scope, targets and inclusion criteria (i.e. the Boivin review focused on patient and public involvement evaluation tools and included tools for assessing engagement in both health system decision-making and health research, with narrower search terms over a shorter time span; and our review deliberately selected studies reporting empirical tool validity and reliability evidence).

Tool validity (86%) and reliability (95%) evidence in our study was markedly higher and contrasted starkly with prior work [7, 8], in which evidence for validity was found in only 48% and 7% of studies, respectively [7, 8], and evidence for reliability was found in 45% and 35% of studies, respectively [7, 8]. As noted previously, there was little to no overlap in captured tools between these reviews (n = 1 [7] and n = 13 [8], respectively), which can be similarly accounted for by differences in scope that generated different primary and secondary literature sets. The MacGregor overview of reviews [8] focused solely on reviews of tools to assess the impacts of research coproduction, differing by time span, key partnership terminology and key domains. As a result, only four of the eight identified reviews were considered in-scope; thus, the number of overlapping tools was limited (n = 13).

Future research

Boateng et al. [49] describe the requisite steps, activities and key precursors and concurrent factors required for robust tool development, testing and evaluation in the future. Specific attention to such steps and components could enable more deliberate tool evolvement in the health research partnership assessment domain. Specifically, the authors call for graduate-level training in the development and evaluation of tools, to create expertise in graduate students and research teams. Furthermore, the authors caution that this research can be “onerous, jargon-filled, unfamiliar, and resource intensive” (p. 1) [49]. Specific accommodations to offset resource and time intensity and higher participant burden due to larger sample sizes may be required. Health research partnerships assessments must meet the needs of both researchers and end-users by balancing rigour and resource intensity in a way that remains fit for purpose. Both deliberate funding and the use of hybrid study designs will be helpful for providing required focus and generating robust evidence that will address persistent psychometric and pragmatic gaps with future research.

Study limitations

We noted several key limitations with this review. We observed several challenges with respect to the evidence for and the testing of tool psychometric properties. Like Sandoval et al. [5], we experienced challenges related to the reporting of psychometrics on multiple levels (e.g. scale, index, subscale, item and tool), as well as mismatched use of psychometric evidence (e.g. justification or application of previous scale, subscale or item-specific psychometrics to other levels of testing). To mitigate this risk, we approached psychometric evidence in eligible studies with these issues in mind, and relied on strict methodological processes (independent, duplicate abstraction and review and resolution of all discrepancies through consensus discussions) to ensure accurate interpretation and representation of abstracted data.

As mentioned previously, the variable use of terminology may have compromised our ability to clearly describe and assess health research partnership tools. Further efforts to consolidate terms and definitions across health research partnership traditions will help resolve these issues in future work.

This study was limited in several ways by the accessibility and reporting concerns documented in previous reviews [3, 5, 7, 14, 15]. Most included studies were multimodal and did not often explicitly refer to tool development, testing or evaluation in their purpose statements. To mitigate the risk of missing potentially relevant studies in our review, we deliberately kept our inclusion criteria broad at the title and abstract (L1) screening phase. However, this strategy also produced a large set of L2 full-text assessments, negatively impacting study feasibility. Consensus and consolidation of evidence in this research domain, as well as more focused, explicit reporting of health research partnership assessment, tools and psychometric and pragmatic characteristics, will facilitate more efficient literature location, retrieval and assessment in the future.

Finally, we noted a potential gap in the scope of a question modified as part of the pragmatic tool evaluation criteria: Was the tool informed by literature generated from a systematic literature search? In retrospect, we surmise that this question was too narrow to capture evidence derived from historical hypothesis testing generated by theoretically driven research (i.e. dimensionality tests) [49]. In addition to synthesis-level evidence for relevant components, tools or tool components that are informed by iterative tests of components derived from conceptual framework testing could play an equal or more important role in identifying and refining key tool constructs. Theoretically grounded components may also progressively improve the psychometric quality of health research partnership outcome and impact assessment tools. We recommend amending this question for use in future tool evaluation studies to better capture the full scope of relevant evidence underlying assessment tools.

Conclusions

This large-volume systematic review successfully identified empirically evidenced tools for the assessment of health research partnership outcomes and impacts. Our findings signal some promising improvements in the presence of conceptual, methodological and psychometric characteristics in measurement tools, and the availability of pragmatic tool characteristics. Persistent challenges linked to the nascency of the research partnership field and its measurement remain. Practically, the comprehensive tool characteristics presented here can help researchers and partners choose assessment tools that best fit their purposes and needs. Finally, our findings further strengthen calls for more deliberate and comprehensive tool development, testing, evaluation and reporting of psychometric and pragmatic characteristics to advance research partnership assessment and research partnership science domains.

Advancing knowledge of health research partnership outcomes and impacts assessment and partnership science are mandated aims of the IKTRN [100]. The IKTRN is a research network based at the Centre for Practice-Changing Research at the Ottawa Hospital and supported by the Canadian Institutes of Health Research. The IKTRN comprises researchers from more than 30 universities and research centres and research users from over 20 organizations, with a broad research agenda focused on best practices and their routine application to ensure effective, efficient and appropriate healthcare [101, 102].