Introduction/background

Peripheral arterial disease (PAD) describes the process of progressive atherosclerosis affecting arteries, most frequently in the lower limb. The prevalence in the general population has been estimated at up to 19% in people over the age of 55 years [1], with incidence increasing with advancing age and in the presence of smoking, inactivity and obesity [1, 2]. The presence of PAD is associated with increased risk of mortality and morbidity from cardiac atherosclerosis [2], and, in its advanced stages, can result in lower extremity ulceration and amputation [3]. Diabetes mellitus is an independent risk factor for the development of PAD [4], and in people with diabetes, atherosclerotic plaques tend to have a more distal and diffuse distribution and there is a more aggressive disease presentation [5].

Due to the high risk of concurrent cardiovascular morbidity, mortality and lower limb complications associated with PAD, accurate and reliable diagnostic testing methods are required for screening and ongoing monitoring [2, 6]. Early detection of PAD allows for intervention and management to reduce the risk of mortality and morbidity related to atherosclerosis (lifestyle modification, pharmacotherapy, e.g. statins, antiplatelets, and measures to address systemic risk factors such as hypertension or diabetes) [7]. Current recommendations for non-invasive lower limb vascular assessment include using the ankle-brachial index (ABI) as an objective measurement of peripheral blood flow [7, 8]. The ABI represents the ratio of ankle to brachial systolic pressure and is recommended to be calculated by dividing the higher systolic pressure of the dorsalis pedis and tibialis posterior vessels at the ankle with the higher of the systolic pressures measured in the brachial artery in both arms [7, 8].

The ABI is widely used to screen for PAD in different clinical settings and by different health professionals, from general medical practitioners to specialist vascular technicians [9, 10]. Reliability of the test for accurate ongoing monitoring of lower limb vascular status has the potential to be affected by a number of factors. As an operator-dependent test, this includes the experience and skills of the clinician, particularly as multiple clinicians are frequently involved in ongoing monitoring measurements [11, 12]. There are also a number of types of equipment (e.g. automated versus manual) and methods used to measure ankle and arm blood pressures (e.g. stethoscope, Doppler, photoplethysmography probe), with variable findings as to whether the results are interchangeable [13,14,15,16]. The pre-test protocol and test environment have also been demonstrated to affect the resting ABI at measurement, with variations in body position [17], recency of tobacco smoking, caffeine intake [18, 19] and exercise [20, 21], and pre-measurement rest time [22] all likely to introduce error to the measurement and affect the test-retest reliability.

Objectives

Given that the ABI is the recommended method for screening for the presence and progression of PAD, it is important that it is reliable. Therefore, the aim of this review was to systematically evaluate the literature to determine the inter- and intra-rater reliability of the ABI in adults.

Methods

Search strategy

A search of relevant biomedical journal databases from the University of Newcastle library website was performed to identify studies that consider the reliability of ABI measurement from database inception to January 2019 using MEDLINE (1946+), EMBASE (1947+), and CINAHL Complete. Truncated versions of some search terms were used to ensure that relevant studies were included (Table 1).

Table 1 Search terms: searches were limited to human studies

Inclusion and exclusion criteria

The review was conducted with reference to the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) statement [23]. The following criteria had to be satisfied for inclusion in the review: published original research evaluating the reliability of the ABI in adults. Studies were excluded if the test-retest time frame made it likely that results may be affected by disease progression e.g. > 12 months. No language restrictions were applied to the database searches.

Other sources

Hand searching of the reference list of appropriate articles was also conducted.

Data collection and analysis

All abstracts obtained were assessed independently by SC and SL for inclusion. There were no instances of disagreement between reviewers, so arbitration by a third person (VC) was not necessary. Data extraction was performed by SC and SL. It was pre-determined that a meta-analysis of reliability outcomes for inter- and intra-rater reliability would be conducted provided there were sufficient studies that report the estimator of interest, and a measure of uncertainty for this estimator (e.g. standard error, 95% confidence interval, non-truncated p-value). Given the expectation for a high degree of study heterogeneity, we believed a fixed effect meta-analysis would generally not be appropriate so we aimed to only pool estimates using a random effects approach provided there were at least 5 studies [24].

Methodological quality assessment

The studies that met the inclusion criteria were appraised for risk of bias using the Quality Appraisal of Reliability (QAREL) Checklist and qualitative methodological assessment [25]. All full-text papers were assessed for methodological quality independently by two reviewers (SC/SL), and as there were no disagreements arbitration by a third reviewer (VC) was not necessary.

Results

A total of 1703 articles were retrieved, of which 36 were identified as suitable for full-text review. Twenty-one papers were excluded based on the exclusion criteria (Fig. 1): 10 papers reported comparison of methods [26,27,28,29,30,31,32,33,34,35], five studies did not report measures of reliability [36,37,38,39,40], two studies compared raters’ experience [41, 42] and one reported a novel trial design, for which the reliability results were duplicated in another included paper [43]. One paper used measures repeated at up to 365 days apart, with a mean time between measures of 228 days, which is long enough to encompass changes attributable to progression of PAD [44]. Two papers were conference abstracts, for which the full text could not be obtained as the authors did not respond to a request for further information [40, 45].

Fig. 1
figure 1

PRISMA flow chart

Of the included papers, seven measured inter-tester reliability [12, 46,47,48,49,50,51], four assessed intra-tester reliability [52,53,54,55], and four considered both inter- and intra-tester reliability [13, 16, 56, 57].

Characteristics and overview of included studies

The 15 studies in this review included a total of 916 participants, with data collected from a combination of one and both lower limbs (1396 limbs in total). Two studies did not state the number of limbs included [52, 53]. Eleven studies assessed inter-rater reliability [12, 13, 16, 46,47,48,49,50, 56, 57], and eight studies reported intra-rater reliability [13, 16, 52,53,54,55,56,57]. The characteristics of included studies are described in Table 2. Eleven studies reported participants’ gender, with more men (n = 416, 56.4%) overall than women, whilst gender was unreported in four studies [12, 46, 49, 50]. Most of the studies included predominantly older participants (age range (41–92 years) [12, 13, 16, 47,48,49, 51, 53,54,55, 57], however two studies recruited only younger adults (age range 22–30 years) [46, 56], one study included 18–80 year olds [52] and one study did not report participants’ ages [50]. The majority of studies [12, 47,48,49,50,51, 55, 57] included only participants with suspected PAD, or risk factors for atherosclerosis; three studied a mixed population including those without risk factors or clinical indicators of PAD [13, 16, 52]; two studies included only participants with diabetes [53, 54], and two studies included only healthy individuals [46, 56].

Table 2 Participant characteristics and reliability measure

There was little consistency in the training and qualifications of the raters used, with experience ranging from students [46, 47] to experienced vascular technicians and/or vascular specialist doctors [12, 13, 48, 54, 57]. Six studies did not state the background of the personnel performing the test [49, 51,52,53, 55, 56]. The majority of the studies used Doppler and manual sphygmomanometer to measure systolic blood pressures; [12, 13, 16, 46,47,48,49, 51,52,53, 56] however three studies used an automated device to obtain some or all of the pressure readings [54, 55, 57] and one study did not report the method used [50]. The reported pre-measurement rest time varied from five minutes [55] to 15 min [48], with seven studies not reporting a period of rest before testing commenced [44, 47, 49, 50, 52,53,54]. The time between repeat testing varied from five minutes [46, 56] to 4 weeks [52]; six studies did not report time between repeated measures [12, 49,50,51, 54, 55]. Several different methods were used to calculate the ABI. The majority of studies [47,48,49, 51, 52, 56, 57] divided the highest ankle pressure by the higher brachial pressure measurement, two [13, 16] used the highest ankle pressure and the mean brachial value, and one used the lowest ankle pressure and the highest brachial pressure [55]. One study used a fully automated device that calculated the ABI value [54], and four did not state how the ABI was calculated [12, 46, 50, 53].

Methodological quality

The quality of studies was variable with regard to reported blinding of raters, order of examination and the time between repeated measurements, with no study clearly addressing all of these variables. While most studies used appropriate statistical measures of agreement, reporting of results was frequently incomplete and the true extent of reliability could not be determined (Table 3).

Table 3 QAREL Checklist

Meta-analysis

A number of the eligible papers identified lacked sufficient data relating to the main outcomes to allow for inclusion in a meta-analysis. For example, the paper by Chesbro et al., [46] provided no details on the intra-rater reliability of measurements taken using a Doppler, which was the main outcome being assessed in this review. Similarly, papers by De Graaff et al., [57] and Demir et al. [52] detailed no measure of variability for the intraclass correlation coefficients (ICC) reported, which is required when pooling results in a meta-analysis. It is not clear whether Chesbro et al. [46, 56] used data from the same population in both studies, and the authors did not respond to a request for clarification. Finally, for the paper by Aboyans et al., [13] the type of ICC calculated was not reported, and while pooling of this data would be possible, understanding which ICC was used is preferred to allow for accurate and appropriate calculation of the standard error. As there were only a small number of eligible papers identified we would require data from all articles to allow for appropriate pooling of ICCs. Thus, as a consequence of the small number of papers reviewed and insufficient data reported by several of the papers it was not possible to conduct a meta-analysis as part of this review. None of the authors responded to requests for missing data. A narrative review of results is presented instead.

Inter-rater reliability

Inter-rater reliability results are included in Table 2. Statistical methods for calculating reliability were inconsistent. Of the eleven included studies, five reported levels of agreement with ICCs [13, 16, 46, 56, 57]. Of these, only three [13, 46, 56] reported 95% confidence intervals, which limits the interpretation of reliability in the context of clinically meaningful results. Based on ICC values alone, inter-rater reliability was highly variable, ranging from poor (ICC: 0.42) [16], to excellent (ICC: 1.0) [46].

Other estimates of reliability reported in included studies were coefficient of variation between raters [12, 49] (ranging from 3.2 to 5.9%), inter-observer reliability of 10% for raters [48], and a moderate Pearson’s correlation coefficient of 0.52 in a population with suspected PAD [50].

Of the remaining studies, one demonstrated statistically significant differences in ABI between raters in a population with severe PAD and in those with no disease, which did not occur in those participants with mild to moderate PAD [47], suggesting increased reliability with this disease state. In contrast, another paper reported Kappa coefficients of 0.4 (low agreement) for healthy limbs, 0.7 (good agreement) for limbs with PAD, and 0.43 (moderate agreement) for limbs with medial arterial calcification (MAC) (p < 0.001 for all values) [51].

Intra-rater reliability

Intra-rater reliability results are included in Table 2. Various methods of calculating reliability were used. Of the eight included studies, four reported ICCs [13, 52, 56, 57], with ICC values ranging from poor (ICC: 0.42) [56] to excellent (0.98) [57]. Interpretation of the results was limited again by the fact that not all studies reported 95% confidence intervals, with only two articles having done so. [13, 56]. Other estimates of reliability included coefficient of variation [53,54,55] (range 4.95% [54] – 15.8% [55]), and an intra-observer variance of 8% [16].

Discussion

The findings of this review are that the inter- and intra- tester reliability of the ABI across a number of mixed populations appears to be acceptable, however statistical tests of reliability in included papers were heterogeneous and levels of statistical reporting were inconsistent and incomplete. This makes interpretation of the reliability of the ABI in the context of clinical detection, evaluation and ongoing monitoring of peripheral arterial supply challenging, and prevented meta-analysis. For example, where studies lack 95% confidence intervals for ICCs, the validity of interpretation of the value is reduced as it fails to provide the lowest level of reliability that it represents. Similarly for coefficient or estimate of variation, values between 3.2 and 15.8% were reported.. Whilst this is considered an acceptable level of variation for many clinical tests, for the ABI it can represent a range of values that may indicate both normal and pathological results; which could reduce the ability of ABI to reliably determine the presence and extent of PAD. For example, assuming a variation of 15%, an ABI of 1.0 (which is considered ‘borderline’ when ABI is used as a screening tool [6]) could represent a true value between 0.85 (indicative of PAD) and 1.15 (‘normal’).

Further complicating the interpretation and generalisability of the inter- and intra-rater reliability results of included studies was the heterogeneity of participant populations. Whilst the majority of studies included older people with PAD risk factors or suspected PAD, three studies also included healthy participants [13, 16, 52], and two used an exclusively young and healthy population [46, 56]. In clinical practice, ABI is used to evaluate peripheral arterial supply in people with risk factors for atherosclerosis, and in those with clinical signs and symptoms of PAD. The variation in the disease status of participants across the studies included in this review provides some difficulty in evaluating how the studies’ findings apply to the people in whom the ABI would clinically be used. The study that reported near-perfect inter- and intra- tester reliability included only healthy individuals under the age of 30 [56]. This population would not typically undergo vascular screening, and the results obtained do not indicate the ability of the ABI to perform reliably in the presence of pathology where the result is likely to be lower and therefore change in result indicative of worsening pathology is likely to be small. In contrast, inter-tester and intra-tester reliability was found to be poor in several populations in which this test is recommended including people with diabetes and without MAC, [51] and older people with risk factors for PAD [16].

Methodological differences between studies is also likely to have contributed to variable reliability outcomes, with automated oscillometric devices demonstrating marginally better reliability than manual assessment using Doppler [49, 55], while Doppler evaluation was found to be more reliable than the use of pulse palpation [13] or stethoscope [46]. Higher ABI reliability was found in more experienced raters [47]. Whilst most of the studies reported that participants rested for 5–15 min prior to testing [12, 13, 16, 46, 48, 51, 55,56,57], six studies did not describe any pre-test preparation [47, 49, 50, 52,53,54], and only one paper took steps to ensure that participants did not consume alcohol, caffeine or tobacco (which are known to affect blood pressure) in the two hours prior to testing which may have affected measurements, particularly when taken across two different testing sessions [55]. This lack of reporting of the methodology used to obtain systolic blood pressure measurements makes it difficult to compare results across the included studies as it is unknown how much external factors are likely to contribute measurement variability.

Two papers identified the presence of diabetes mellitus as a factor that may affect reliability of the ABI [12, 51], however only one study included a large enough sample of this cohort to perform statistical tests [51]. This study, which used only participants with diabetes, reported the Kappa coefficient for inter-tester measures for participants classed as having PAD or not, rather than performing ICCs on the measures obtained. The authors reported ‘good’ reproducibility of the ABI (Κ 0.7) in people classified by their ABI measurement as having PAD, but low reproducibility in those without PAD and in those with MAC. Previous research has also shown that people with diabetes demonstrate a different response to pre-measurement rest [22], and that brachial blood pressure measurement is also less reliable in these individuals [58]. Diabetes-related autonomic neuropathy has been shown to affect blood pressure regulation, with a lack of vasoconstriction arising from reduced sympathetic input, particularly in response to changes in temperature and position [59, 60].

Limitations

While the search methods employed in this study were designed to be robust, there may be some evidence that was not captured, for example unpublished data. Further limitations to this study are the inability to perform meta-analysis in order to obtain a quantitative analysis of the available reliability data for the ABI, and the inability perform any sub-analyses relating to individual populations such as those with diabetes, or methods of measurement such as automated or manual methods. Furthermore, there has been some disagreement in the literature about which pressure measurement should be used to calculate the ABI [61, 62], with no studies exploring the effect of calculation method on reliability. However, the method of calculation cannot be excluded as a factor affecting reliability that has not been considered by this review.

Conclusion

Results of included studies suggest the inter- and intra-tester reliability of the ABI is acceptable. However, inconsistencies in obtaining systolic pressure measurements, calculating ABI values, and incomplete reporting of methodologies and statistical analysis make it difficult to determine the validity of the results of included studies. Further research of ABI reliability using a more consistent approach to study design and implementation and more detailed reporting of results in populations with vascular pathology and at risk of PAD is required. Based on current available data clinicians should ensure they interpret ABI results in the context of other vascular assessment findings, and patient management is not based upon this measurement alone.