This pilot shows that TIRS demonstrated a trend towards increasing the number of correct decisions to treat PTB among DRs in the routine clinical setting in Malawi, with improvements in both sensitivity and agreement between CXR interpretation and mycobacterial culture. For COs the outcomes were different, showing no increase in the number of correct decisions, but a trend towards increased sensitivity. The associated loss in their specificity indicates a shift in decision-making from under-diagnosis to over-diagnosis. This difference between the two subgroups might be explained by the fact that DRs had received significantly more formal background training in reading CXRs; the tool may enable recall of prior knowledge, including alternative CXR diagnoses. The latter would also account for their improved specificity, as opposed to the COs’ reduction in specificity. For the subset of smear-negative films there was no difference in number of correct decisions, which poses a challenge, as this is a group of patients where CXR is particularly important [2, 3]. This may relate to the high HIV prevalence, where CXR has lower accuracy .
Interpretation by COs and DRs overlapped with expert interpretation at baseline, and with TIRS (Fig. 2). Sensitivity was similar to previous reports of non-expert readings from Nepal and Malawi and to expert readings from Kenya and South Africa (Table 4) [3, 12, 19]. Specificity was lower, which is most likely related to high HIV prevalence in our test set and the use of expert reference standards, rather than mycobacterial culture, in previous non-expert reading studies. Standard deviations (SD) for non-expert readers were wide, despite the relatively large number of readers. This reflects the difficulties non-experts encountered. Interestingly, the almost randomly achieved (κ < 0.2068) correct number of decisions by non-experts at baseline is very close to expert numbers, and performance for both groups is limited with 61–68 % correct decisions. However, when analysed in more detail, experts appear to take a more informed decision (with higher agreement corrected for chance) and err on the side of false-positive treatment decisions (higher sensitivity). With TIRS, non-experts and, in particular, DRs not only improve their overall performance, but alter their decision-making to coincide more with expert decision patterns. In the context of inadequate TB case-detection in many developing countries , a tendency to over-treat is better for TB control than a tendency to under-treat.
The CRRS training and standardised reporting were developed for prevalence surveys. Good inter-observer agreement between two experts was reported . However, subsequent use in HIV screening showed sensitivity and specificity similar to several other studies using non-standardised reporting (Table 4) . It is advocated for use in the clinical setting, but to our knowledge has not been validated. We observed an increase in the number of correct decisions from 63 % with TIRS to 71 % using the CRRS proforma, with an associated increase in agreement corrected for chance. However, this increase was less marked (63 % to 68 %) when using the question: “Would you treat?” and specificity in particular dropped to baseline levels. As completing the CRRS proforma increased reading time from 1 min to 4 min per film, this poses a challenge in a busy clinical setting
Non-experts are generally aware of their limited diagnostic accuracy and, therefore, we were interested to assess whether TIRS might increase confidence and reduce the desire for second opinions. Although DRs showed improvement in accuracy, confidence did not alter. Confidence of COs did increase, regardless of their decision, in effect providing a false sense of security with a potentially negative effect on referral patterns.
A major strength of this pilot is our rigorous assessment of the impact of TIRS in keeping with current guidelines on reporting of diagnostic accuracy studies . Studies of CXR performance in PTB vary widely in outcome (Table 4). This is partly related to variations in prevalence and imaging characteristics in the population tested, but more often to methodology applied, such as number of readers, choice of reference standard and presentation of results as observer agreement or sensitivity and specificity. This creates confusion in the interpretation of study outcomes and the perceived validity of CXRs. For example, good observer agreement between two experts does not necessarily equate to high diagnostic accuracy against a culture gold standard. In addition, an expert panel is often the only reference standard available, but in PTB is particularly prone to error. As we have shown, our expert readers were incorrect in 30 % of cases. To our knowledge, this pilot is the first assessment of non-expert performance against a gold standard of culture. In addition, we corrected for chance agreement with the gold standard and for clustering within raters. This resulted in a rigorous assessment of effect. For example, COs apparently improved with TIRS, but once corrected for chance, may actually still be guessing. Similarly, using a large number of non-expert readers and dividing them into subgroups of DRs and COs with information on levels of training helped to explain the different effect of TIRS on each subgroup.
Several limitations are present. The results for the CRRS readings should be viewed with caution as only two COs could attend. This highlights the prohibitive expenses in settings similar to Malawi of attending such courses. Using a simulation film set does not fully reflect clinical practice. Similarly, excluding patients with a previous history of TB may have influenced results. Paper prints of X-rays limit image resolution and visibility of small nodules, potentially important for PTB diagnosis. However, most films in low resource settings are of limited quality, with similar limitations in resolution. Further validation, in a clinical setting and including cases with previous TB, will be required. In addition, we used the same film set for all readings, which may have biased results, despite films being re-shuffled. In the culture-negative subset not all diagnoses could be confirmed, owing to limited resources, and culture-negative PTB may have been present.
In the population tested, it is arguable whether improving non-expert performance is required, as experts performed only marginally better. On the other hand, although improvements were modest, some promise for the DRs was noted, even in this population. Evaluation of a larger number of non-expert clinicians in a range of populations (e.g. low HIV prevalence/screening in people living with HIV) may reveal superior results. If so, low cost and simplicity are strengths that may justify implementation, even if benefits are small. The lack of effect in the smear-negative group is a limitation, but when access to laboratory tests is limited, the tool may be helpful. Until fast turn-around molecular or microbiological diagnosis of PTB is universally available, CXR will still be used across the globe by non-expert readers to inform treatment decisions. We suggest that low-cost, easily delivered interventions aimed at improving CXR interpretation will have greater overall impact than more expensive and time consuming options such as training courses or teleradiology.
In conclusion, a pulmonary tuberculosis CXR image reference set increased the number of correct decisions to treat pulmonary tuberculosis by non-experts in the operational setting in Malawi. This effect was more marked for doctors than clinical officers. Further evaluation of this tool in clinical practice may provide a validated, simple, low-cost intervention to improve non-expert reader performance.