Background

Planning for injury prevention and control activities relies upon good quality data from surveillance [1]. Most information on injuries is obtained from data collections that are intended for other purposes, such as hospital admission collections [2], which may not provide the core information needed for injury surveillance (i.e. what injuries occurred where to whom, when they occurred and why [3, 4]). An assessment of a data collection is desirable to identify its capacity to perform injury surveillance and the likely accuracy and validity of conclusions that may be drawn from its data [57]. It is important to know the strengths and limitations of data collections as these define the limits for interpreting the analysis of the data in the collection.

Evaluation frameworks have been developed to assess public health [8, 9], syndromic [10], and communicable disease [11] surveillance systems. However, these frameworks all recommend the assessment of a different selection of characteristics of a surveillance system, and all suggest varying definitions of, and methods to assess these characteristics. This lack of a standard approach for the assessment of characteristics makes it difficult to compare evaluation results across different surveillance systems. In addition, none of these frameworks provide details of how they were developed and why particular evaluation characteristics were included. The aim of this paper is to describe the development of a framework for the evaluation of an injury surveillance system. The availability of a clearly defined evaluation framework using agreed evaluation characteristics will both provide a sound and reproducible basis for analysis of the extent to which a data collection can be used for a particular purpose and for comparison of data collections.

Methods

There were four main stages in the development of an evaluation framework for injury surveillance systems (EFISS). The first stage involved a review of the literature which identified the characteristics that have been used or recommended to be used to evaluate surveillance systems previously. The second stage reviewed these characteristics by testing them against a well-recognised set of criteria, the SMART criteria, and characteristics that did not meet this criteria were dropped. The third stage used expert judgments obtained using a modified-Delphi study to assess the remaining characteristics. Lastly, the final stage created a system for rating each characteristic. Ethics approval for the conduct of this research was obtained from the University of Queensland's Medical Research Ethics Committee.

Stage 1 Identification of surveillance system characteristics

The aim of this stage was to review the literature to identify characteristics that had been used, or recommend to be used previously to evaluate a surveillance system. For this research, a characteristic was considered to be any attribute that might be assessed for a surveillance system. The review was undertaken using Medline (1960–2006), Embase (1982–2006), CINAHL (1960–2006), Web of Science (1960–2006), and Google™ using a variety of combinations of the key words related to the evaluation of surveillance systems 'surveillance', 'evaluation', 'guidelines', 'framework', 'injury surveillance', 'injury', 'comparison', 'review', 'assess', and 'quality'.

Stage 2 Review of surveillance system characteristics

The aim of this stage was to review the characteristics identified from the literature by testing them against a well-recognised set of criteria. The characteristics were first categorised and then reviewed using SMART criteria (described below). The SMART criteria are based on goal-setting theory [12] and have been used in a wide range of settings to aid decision-making [1315]. The SMART criteria were adapted so as to apply to evaluating characteristics of an injury surveillance system. Each characteristic was evaluated against the five criteria of the SMART framework, defined as:

  • Specific – the characteristic should be as detailed and specific as possible;

  • Measurable – it should be possible to objectively assess or monitor the characteristic;

  • Appropriate – the characteristic should be suitable to assess an injury surveillance system and provide information central to injury surveillance;

  • Reliable – the characteristic should be able to provide information that is consistent and reliable; and

  • Time-consistent – it should be possible to measure or monitor the characteristic consistently over time.

Each characteristic was reviewed using the SMART criteria by two authors (RM and AW). Characteristics where agreement was not initially reached were discussed and final SMART ratings for these characteristics were agreed upon. The characteristics that met all of the SMART criteria moved to the next stage of EFISS development.

Stage 3: Assessment of characteristics by expert opinion

The aim of this stage was to use subject-matter experts in injury surveillance systems to provide expert opinion on the characteristics proposed to be included in the EFISS. Characteristics were tested in this stage using a two round modified-Delphi study [16, 17]. A modified-Delphi study conducted using electronic-based questionnaires was selected (as opposed to in-person discussions) as experts were widely distributed around Australia. It also allowed panel members to provide their point of view anonymously and all opinions could be considered which is not always possible during in-person meetings where discussion can be dominated by a few individuals [1820] or where opinions could be influenced by other individuals [19, 21, 22]. Moreover participants could complete the Delphi rounds at their leisure allowing more time for contemplation of responses [23].

Expert panel members were selected based on seven criteria: (i) working in the field of injury prevention; (ii) familiarity with the evaluation of surveillance systems; (iii) awareness of the strengths and limitations of surveillance systems; (iv) published on the evaluation of a surveillance system; (v) awareness of quantitative evaluation methods; (vi) familiarity with Australian injury data collections; and (vii) willingness to contribute. Fourteen panel members residing in Australia were identified from an examination of international and national conference proceedings and publications in the peer-review literature that they had either authored or co-authored on the evaluation of data collections. These people were contacted via email and invited to participate. All potential panel members worked in senior positions in injury research centres or public health research facilities. Seven ultimately participated, three epidemiologists and four public health professionals, a response rate of 50%. Of the non-participants, four did not reply to the original invitation, one declined due to excessive work commitments, one declined due to family reasons, and one initially agreed but later withdrew.

Panel members were given a generic user name and password and a unique link to an internet site to download a Microsoft® Excel [24] file containing the questionnaires and background material for each modified-Delphi round. Each completed questionnaire was then uploaded to the internet site and the responses accessed and directly downloaded into SPSS [25] for analysis. Each questionnaire was pilot tested for content ambiguities on two individuals not familiar with the research.

Round one of the modified-Delphi focused on the subset of characteristics for which there was no consistent definition in the literature relevant to injury surveillance. The aim of this round was to reach consensus from a panel of experts on the suitability of 11 proposed characteristic definitions (ie. 6 data quality, 3 operational and 2 practical characteristics) for an injury surveillance system, the importance of these 11 characteristics for injury surveillance, and the practicality of assessing these 11 characteristics in an injury surveillance system. For this exercise, experts were asked to rate one definition, usually the most common from the review. However, all definitions of characteristics identified in the literature review were provided to the expert panel as background material. The panel were asked to rate: (i) the appropriateness of the proposed definitions of each characteristic; (ii) the practicality of assessing these characteristics; and (iii) the perceived importance of these characteristics for injury surveillance, and to suggest any modifications to proposed definitions.

A 5-point Likert scale, from 'not at all' to 'extremely', was used to rate each item. The expert panel was considered to have reached high consensus on an item when the proportion of all of the panel's ratings reached 70% and above, moderate consensus when the proportion of all ratings reached 50% to 69%, and low consensus if the proportion of all ratings was less than 50% [26].

A second round of the modified-Delphi had two purposes. Each panel member was asked to provide feedback on the appropriateness of the revised round one characteristic definitions and to rate the importance of all 28 characteristics for one of the three EFISS areas (i.e. data quality, operational, or practical characteristics of an injury surveillance system). For this round, the expert panel was provided with a summary of the panel's ratings and comments from round one, a summary of the revisions made to definitions following the panel's round one comments, and the revised definitions. The expert panel rated the appropriateness of the revised characteristic definitions using the same 5-point Likert scale as in the previous round. The panel rated the importance of all characteristics to assess either the data quality, operation, or practical capabilities of an injury surveillance system using a 7-point Likert scale from 'not at all' to 'extremely'. The 7-point Likert scale was selected to elicit more variability in responses, with the mean and median used to measure the central tendency of the distribution of the panel's ratings and the standard deviation (SD) and the interquartile range used to measure variability across the panel member's ratings.

Ratings were judged on the basis that the characteristic scored consistently high across raters. For a characteristic to be 'important' it was required to be judged by the majority of the panel as so, with a mean rating of 6.0 or higher adopted as a general cut-off to indicate a reasonably high level of importance. However, this meant that only data completeness, sensitivity, and representativeness would be included as data quality characteristics and specificity and positive predictive value (PPV) would be excluded. As both specificity and PPV were rated very close to the mean of 6.0 (i.e. 5.9) and both had high consensus (low SD), it was decided to consider both of these characteristics as also important for data quality. The SDs were adopted as a measure of consensus as a tight spread of scores indicated a high consensus (i.e. an SD between 0 and 1), a medium spread of scores inferred a moderate consensus (i.e. an SD greater than 1.0 and less than 2.0), and a wide spread of scores implied a weak or low consensus (i.e. an SD greater than 2.0) [26]. Specifically, a high score was judged as a mean rating of 5.9 and above and consistency was judged as high if the SD of ratings was one or less, moderate if it was between 1 and 2 SD's and low if it was more than 2 SD's. For inclusion in the EFISS, a characteristic needed both a high mean score and high consistency.

Stage 4: Development of a rating system for the EFISS characteristics

The aim of this stage was to identify an appropriate rating system to use with the EFISS and to identify what would be considered to be both low and high ratings of each characteristic. Rating systems have been used in a wide range of areas to assist in the interpretation of assessment results, to facilitate comparison, and to obtain an overall rating of performance. A number of rating systems were investigated for possible application to the EFISS, including systems developed to rate the quality of scientific evidence [27, 28], credit risk [29, 30], tractor safety [31], professional sports [32], and vehicle safety [33, 34]. In addition, the injury surveillance literature was reviewed to estimate what would be considered to be either high or low ratings for each characteristic.

Results

Stage 1 Identification of surveillance system characteristics

Twenty-four journal articles, book chapters, and reports were located that provided guidelines or made recommendations regarding characteristics that should be evaluated in a surveillance system. From these a list of 40 characteristics were identified (Table 1). These characteristics were: data completeness [4, 810, 3537], sensitivity [4, 8, 9, 3547], specificity [4, 4143, 47], representativeness [4, 810, 3537, 3947], positive predictive value (PPV) [8, 9, 3537, 3941, 4446], positive likelihood ratio (LR+) [48], clear purpose and objective(s) [8, 10, 3537, 40], data collection process [810, 3537, 41], clear case definition [8, 35, 37, 45, 46, 49], type of data collected is adequate for injury surveillance [8, 38], use of uniform classification systems (i.e. standardised classification system) [4, 8, 10, 36, 45, 46, 49, 50], system can be integrated with other data collections/compatible data collections [8, 10, 50], legislative requirement for collection of data [810, 41, 51], simplicity [4, 8, 9, 3537, 3946], timeliness [4, 810, 3537, 3946, 50, 52], flexibility [4, 810, 3537, 3946], quality control measures [36, 37, 52], data confidentiality [810, 36, 51, 52], individual privacy [810, 36, 52], system security [810], stability of the system [810, 41, 47], data accessibility [4, 10, 47, 49, 50], acceptability [4, 810, 3546], usefulness [6, 810, 3537, 4044], data linkage potential [10, 39, 52], geocoding potential [10, 39], compatible denominator data [36, 37, 39, 49, 50], routine data analysis [4, 8, 10, 3537, 41, 4446], guidance material for data interpretation [36, 41], routine dissemination of information [4, 8, 10, 3537, 41, 4446, 49, 50, 52], adequate resources/cost [810, 36, 38, 4044, 49, 50, 52], communication support [41], coordination support [41], effectiveness of system in supporting programs [9, 47], efficiency of resource use [9], portability [10], practicality of system [6, 47], relevance of data to users [47], supervision support functions [41], and training support functions [41, 50].

Table 1 Relevance of characteristics identified from the literature for inclusion within an evaluation framework for injury surveillance systems

Stage 2: Review of surveillance system characteristics

The characteristics were grouped into three categories based on the nature of the information they provide on injury surveillance. These were: (1) data quality characteristics, which provide evaluative information regarding the quality of the information obtained from a surveillance system; (2) operational characteristics, which describe key aspects or processes governing the way a surveillance system works; and (3) practical characteristics, which describe the functional capabilities and practical elements of a system. Each characteristic was assessed by two of the authors using the SMART criteria and initial agreement was reached for 80% of characteristics. The remaining characteristics were discussed and final SMART ratings for these characteristics determined. Twenty-eight characteristics were judged to meet all five SMART criteria. This included all characteristics in the data quality group, all but one operational characteristic (ie. stability of the system), and just under half of the practical characteristics (Table 1).

Stage 3: Assessment of characteristics by expert opinion

Modified-Delphi round one

The aim of round one of the modified-Delphi was to review the proposed definitions of eleven characteristics for which the literature review failed to identify any consistent definition. The appropriateness, practicality and importance of each of these characteristics was assessed.

The results for the panel's ratings of the appropriateness of the proposed definitions for the 11 characteristics that had not been consistently defined in the literature are shown in Table 2. For two characteristics, sensitivity and timeliness, the panel reached 100% agreement on the proposed definitions. For most of the remaining definitions the panel rated them as 'very/extremely' or 'moderately' appropriate. The definitions of the nine characteristics that did not achieve 100% agreement by the expert panel were revised based on the panel's comments.

Table 2 Expert panel rating of the appropriateness of the definition of each characteristic to assess an injury surveillance system (modified-Delphi rounds 1 and 2)

The panel's ratings of the practicality of each characteristic varied. Four characteristics, usefulness, simplicity, data completeness, and timeliness were most commonly rated by the panel as 'very/extremely' practical to assess in an injury surveillance system. Five characteristics, acceptability, sensitivity, specificity, representativeness, and flexibility were most commonly rated as 'moderate'. The PPV was rated as either 'moderate' and 'not at all/somewhat' (both 42.9%) and the LR+ was most commonly rated as 'not at all/somewhat' practical to assess.

The ratings of the importance of each characteristic showed that almost all characteristics were rated as important. The exceptions were flexibility, which was mainly rated as 'moderate' (71.4%), and the PPV and LR+ which were rated by the panel as 'not at all' or 'somewhat important' (both 42.9%).

Modified-Delphi round two

The aim of the second Delphi exercise was to review the definitions modified after the first round and to obtain ratings of the importance of all characteristics that remained after the SMART criteria assessment. To evaluate the appropriateness of the revised definitions, ratings of the revised definitions were compared with the ratings of the original definitions used in round one. Ratings improved for almost all characteristics. All data quality characteristics, one operational (i.e. timeliness) and one practical (i.e. usefulness) characteristic were rated by the majority of the panel as 'very/extremely' appropriate in round 2 (Table 2).

Ratings of the importance of the six data quality characteristics showed high consensus and high scores for all characteristics, except LR+ (Table 3). Ratings of the importance of the 14 operational characteristics showed high scores and consensus for nine characteristics. Three characteristics had low mean scores (i.e. simplicity, flexibility and system integration) and two were rated inconsistently low (i.e. legislative requirement for collection of data and data adequacy for injury surveillance) (Table 3). Ratings of the importance of the eight practical characteristics showed high mean ratings and consensus for all characteristics, except potential for data linkage, potential for geocoding and routine dissemination of information (Table 3).

Table 3 Expert panel rating of the importance of each characteristic to assess either the data quality, operation or practical ability of an injury surveillance system (modified-Delphi round 2)

At the conclusion of the modified-Delphi study, the definitions of six data quality, two operational and one practical characteristic of an injury surveillance system were all rated as appropriate and were considered suitable for use in an EFISS (see Table 4). The definitions of flexibility and acceptability were not rated as 'very/extremely' appropriate by the majority of experts and were, therefore, not considered suitable for use in an EFISS and were removed from the framework. Tables 5, 6 and 7 show the final definitions for each included characteristic.

Table 4 Recommended characteristics for inclusion in an evaluation framework for injury surveillance systems
Table 5 Rating criteria for the data quality characteristics of the evaluation framework for injury surveillance systems
Table 6 Rating criteria for the operational characteristics of the evaluation framework for injury surveillance systems
Table 7 Rating criteria for the practical characteristics of the evaluation framework for injury surveillance systems

Stage 4: Development of a rating system for the EFISS characteristics

The framework adopted to create the rating scales for each EFISS characteristic was the same framework used by the evidenced-based medicine (EMB) field [27, 28]. This framework was chosen as the hierarchical structure of this framework and its use of clearly defined rating criteria have been successfully applied in other areas, such as public health interventions [53].

There was only minimal guidance from the literature regarding what might be considered to represent either high or low ratings of each EFISS characteristic. For example, Hasbrouck et al [54] believed that the sensitivity of the detection of violent injuries in Kingston, Jamaica which ranged from 62% to 69% were 'adequate' and that a PPV of 86% was 'high'. Similarly, Hedegaard et al [55] stated that a PPV of 89% was 'high' in the confirmation of firearm-related injuries in Colorado, while Wiersema et al [56] reported a PPV of 99.6% to be 'very high' and referred to a sensitivity of 99.6% as 'extremely sensitive' in the detection of firearm-related injuries in Maryland. At the other end of the spectrum, McClure and Burnside [57] considered the sensitivity of the detection of injuries in the Australian Capital Territory Injury Surveillance and Prevention Project at 31% to be 'low'.

The rating scales developed for the EFISS are based on the (limited) previous research and the authors' professional judgment. A four-level rating scheme is proposed for most characteristics, composed of I 'very high', II 'high', III 'low', and IV 'very low'. For five characteristics a dichotomous scale is proposed using I and IV. These are set out in Tables 5, 6 and 7.

Discussion

Injury surveillance systems have lacked a systematic framework by which they could be evaluated. Evaluation frameworks exist for other public health surveillance systems [811], but none of these are specific to an injury surveillance system, and their methods of development are generally poorly described. This paper describes the development of an evaluation framework specifically designed to evaluate an injury surveillance system. It used a systematic process involving several rounds of review and analysis and has resulted in 18 characteristics. The strengths of this new framework are that the characteristics included have been tested using relevant subject matter experts in terms of their clarity of definition and their importance for evaluating injury data collections. The new framework could be applied to any type of injury data.

The process revealed considerable disagreement among experts as to the meaning and relevance of some of the potential characteristics, but high levels of agreement for others. In general those characteristics that were rated low in importance, or where there was considerable dispute about their importance, were excluded. One characteristic, acceptability, was also excluded because of disagreement about its definition despite being rated high in importance. This highlights the problem of using loosely defined characteristics in evaluations. Future refinement of this framework may consider incorporating this characteristic in some way. The remaining 18 characteristics have the advantages of clear definitions and relatively high-rated importance.

It could be argued that the standards adopted for including characteristics in the EFISS were too high and that some additional characteristics should be included. Indeed the core set of characteristics could be enlarged to an optional additional set to include all or some of the ten characteristics that had been previously excluded. This would involve one additional data quality (i.e. LR+), five additional operational (i.e. legislative requirement for data collection, data adequacy for injury surveillance, simplicity, flexibility, and system integration) and four additional practical (i.e. acceptability, potential for data linkage and geocoding, and routine dissemination) characteristics. However, the definitions of some of these (e.g. flexibility and acceptability) would need further refinement, and a number of these characteristics were rated as low in importance and had little consistency between raters (e.g. legislative requirement for data collection, and adequacy of data for injury surveillance). While it is certainly true that the characteristics employed in an evaluation can vary with the purpose of the evaluation, poorly defined characteristics that result in inconsistent ratings between raters will never be useful. Furthermore, there is a core set of characteristics of any data collection that form the basis for its use no matter what the purpose, data completeness and clear case definition, for example. As the purpose for evaluation of injury data was not specified for the expert raters, it is not surprising that these core characteristics emerged as the most important.

The EFISS includes a rating system for assessing the adequacy of each characteristic, the first such attempt for a public health-related surveillance system [810, 41, 58]. Further work may ultimately lead to refinement of the rating system, although the most appropriate rating criteria will likely vary with context.

There are several strengths of the current study. First, it adopted a broad literature search strategy to include reports prepared by government and non-government organisations as well as academia. This captured the broadest range of the existing evaluation frameworks since many were not published in the peer-reviewed literature. Second, the study used generally accepted criteria (SMART) as well as expert judgment for testing potential evaluation characteristics [811]. Lastly the study took a systematic, a priori approach to defining consensus during the modified-Delphi study through a technique of specifying a consensus range (ie. high, moderate, low) [26].

It is arguable that the results of this research may have been influenced by the nature of the Delphi panel. Even though the selection of panel members attempted to include all of the major experts in injury surveillance in Australia, the panel members self-selected to an extent as not all responded and three declined due to understandable reasons of other commitments. There were no obvious differences in the characteristics (age, experience, work context) between participants and non-participants. Furthermore, while all participants were from Australia, many of the participants had worked with international data collections and so are familiar with a range of types of injury data collections and with the different purposes to which they could be put. Whether the results would change with a larger group of injury surveillance experts working in another country remains to be established. The expert panel did consist of only seven members and while there are no strict rules governing the number of Delphi panel members [59] this low number was not ideal as the opinion of one or two experts could notably alter results. There is little or no agreement regarding the appropriate size of a expert panel for use in a Delphi study [5961]. Mitchell [62] states that a panel should have at least eight to ten members, but may be as large or as small as resources and time allow. On the other hand, Brockhoff [63] considers that five to nine participants can perform well using the Delphi process. Therefore, the number of panelists in the current study is not outside the limits of what is considered appropriate, or practical, to contribute to a Delphi study. Furthermore, although some Delphi studies use multiple rounds of review, only two rounds were used in this study to reduce the likelihood of questionnaire fatigue on participants [64, 65]. However it is possible that additional Delphi rounds may have resulted in more characteristics being included in the EFISS, as further revision of characteristic definitions may have resulted in higher ratings of appropriateness and importance by the expert panel.

Conclusion

The EFISS has built upon existing evaluation frameworks for surveillance systems to produce a framework to guide the evaluation of an injury surveillance system. It is offered as a prototype evaluation framework that has clear developmental foundations. While it can be used in its current form, it could certainly be developed further. For example, the EFISS could include a weighting system to adjust for the importance of different EFISS characteristics. In addition, the interrelationships between characteristics may also be considered within the rating system. Further testing may result in more precise and hence more useful definitions of problem characteristics like acceptability. In the meantime, the EFISS is offered to assist agencies operating injury surveillance systems to identify areas for data quality and system improvement.