Inter-reader variability of SPECT MPI readings in low- and middle-income countries: Results from the IAEA-MPI Audit Project (I-MAP)

Background Consistency of results between different readers is an important issue in medical imaging, as it affects portability of results between institutions and may affect patient care. The International Atomic Energy Agency (IAEA) in pursuing its mission of fostering peaceful applications of nuclear technologies has supported several training activities in the field of nuclear cardiology (NC) and SPECT myocardial perfusion imaging (MPI) in particular. The aim of this study was to verify the outcome of those activities through an international clinical audit on MPI where participants were requested to report on studies distributed from a core lab. Methods The study was run in two phases: in phase 1, SPECT MPI studies were distributed as raw data and full processing was requested as per local practice. In phase 2, images from studies pre-processed at the core lab were distributed. Data to be reported included summed stress score (SSS); summed rest score (SRS); summed difference score (SDS); left ventricular (LV) ejection fraction (EF) and end- diastolic volume (EDV). Qualitative appraisals included the assessment of perfusion and presence of ischemia, scar or mixed patterns, presence of transient ischemic dilation (TID), and risk for cardiac events (CE). Twenty-four previous trainees from low- and middle-income countries participated (core participants group) and their results were assessed for inter-observer variability in each of the two phases, and for changes between phases. The same evaluations were performed for a group of eleven international experts (experts group). Results were also compared between the groups. Results Expert readers showed an excellent level of agreement for all parameters in both phase 1 and 2. For core participants, the concordance of all parameters in phase 1 was rated as good to excellent. Two parameters which were re-evaluated in phase 2, namely SSS and SRS, showed an increased level of concordance, up to excellent in both cases. Reporting of categorical variables by expert readers remained almost unchanged between the two phases, while core participants showed an increase in phase 2. Finally, pooled LVEF values did not show a significant difference between core participants and experts. However, significant differences were found between LVEF values obtained using different software packages for cardiac analysis. Conclusions In this study, inter-observer agreement was moderate-to-good for core group readers and good-to-excellent for expert readers. The quality of reporting is affected by the quality of processing. These results confirm the important role of the IAEA training activities in improving imaging in low- and middle-income countries. Electronic supplementary material The online version of this article (10.1007/s12350-018-1407-4) contains supplementary material, which is available to authorized users.

changes between phases. The same evaluations were performed for a group of eleven international experts (experts group). Results were also compared between the groups.
Results. Expert readers showed an excellent level of agreement for all parameters in both phase 1 and 2. For core participants, the concordance of all parameters in phase 1 was rated as good to excellent. Two parameters which were re-evaluated in phase 2, namely SSS and SRS, showed an increased level of concordance, up to excellent in both cases. Reporting of categorical variables by expert readers remained almost unchanged between the two phases, while core participants showed an increase in phase 2. Finally, pooled LVEF values did not show a significant difference between core participants and experts. However, significant differences were found between LVEF values obtained using different software packages for cardiac analysis.
Conclusions. In this study, inter-observer agreement was moderate-to-good for core group readers and good-to-excellent for expert readers. The quality of reporting is affected by the quality of processing. These results confirm the important role of the IAEA training activities in improving imaging in low-and middle-income countries. (

INTRODUCTION
The International Atomic Energy Agency (IAEA) is an independent, intergovernmental science and technology-based organization which is part of the United Nations family of organizations. 1 The IAEA works with its 170 Member States (MS) and multiple partners worldwide to promote the safe, secure and peaceful use of nuclear technologies. The IAEA supports nuclear medicine through activities of the Nuclear Medicine and Diagnostic Imaging Section (NMDI) within a quality assurance framework. 2,3 The nuclear medicine programme contributes to achieving the sustainable development goals (SDGs) set by the United Nations, one of which is ''by 2030, reduce by one third premature mortality from non-communicable diseases through prevention and treatment and promote mental health and well-being''. 4 Considering the burden of cardiovascular diseases (CVD) as a major threat to public health worldwide, 5, 6 and the important role of nuclear techniques such as myocardial perfusion imaging (MPI) in the management of patients with ischemic heart disease (IHD), 7-9 the NMDI Section adopted a strategic decision of strengthening capacity building in nuclear cardiology (NC), providing training through national and regional projects, 10 supported by the Technical Cooperation Programme (TCP), which is the IAEA's main mechanism for transferring nuclear technology to Low-and Middle-Income Countries (LMICs). 11 Educational activities in NC include several Regional Training Courses (RTC) carried out over the past ten years. This paper reports the results of an audit of NC practices (the I-MAP study), initiated in 2015 to assess whether and how training provided through RTCs impacted the quality of clinical practice. The primary goal was to assess homogeneity (i.e. intra-and interobserver variability) within a group of core participants from LMICs. As secondary goals the study aims at a) evaluating the impact of IAEA activities in NC; b) comparing the readings of MPI studies in limited resource centres with those of international experts; c) evaluating the quality of reporting and d) assessing the impact of the reconstruction of MPI studies on the quality of reporting.

METHODS
Recorded contact data from all attendees to RTCs in NC was retrieved. In the preceding 10 years, 896 participants had attended a total of 41 RTCs. Their regional distribution is reported in Appendix (Table 5). To make sure that those trainees, prospective participants to this study, were still actively involved in NC, that list was cross-checked with data from an international database managed by the IAEA. 12 Of the 896 participants, 275 were identified as being currently active as nuclear cardiologists, and were approached for potential participation this study. Of these, 24/275 (8.7%), participated in the study. They formed the group referred to as ''core participants.'' Figure 1 reports their distribution around the world. The ''core participants'' group included physicians trained in nuclear medicine, with limited formal training in nuclear cardiology, in most cases acquired through short-term fellowships supported by the IAEA and/or trained ''on the job.'' Their yearly average volume of MPI studies was 880, with a minimum of 559 and a maximum of 1200.
The second group of ''expert readers'' consisted of eleven international experts identified by the Agency from a pool of its consultants and lecturers, and internationally recognized nuclear cardiologists. Overall, for the experts, the yearly volume of SPECT-MPI studies was on average double that of the core participants.
Both core participants and expert readers were requested to report anonymized case studies provided by a Core Lab, chosen on the basis of sound NC practice and significant record of research. The core lab identified 15 studies which, after anonymization, were uploaded onto a cloud-based collaborative platform (SharePoint TM ) and then downloaded from both core participants and experts.
All studies were carried out with the two-day protocol, using Tc99m labelled perfusion agents, and patients were imaged only in supine position. To provide readable studies for centres with limited technical resources, the core lab was asked to send studies processed with neither resolution recovery, nor scatter or attenuation correction, nor studies acquired with CZT cameras. Clinical data, including patients' history, rest   Table 1. We designed I-MAP to be run in two phases. In Phase 1, all 15 patient studies were provided as raw data. Both groups were requested to process them according to their own routine practice. For Phase 2, the same 15 cases were re-submitted in a different order, but pre-processed at the core lab using Myovation v3 software (GE Health Care; Haifa, Israel) with an iterative reconstruction ordered subset expectation maximization algorithm (2 iterations, 10 subsets) and motion correction. The ''cool'' GE colour scale was applied for tomographic slices representation. Both groups of participants were unaware that they were re-reading the same studies. This second phase was aimed at assessing whether reconstruction could have any impact on the overall quality of the study and consistency of interpretation. An example of a pre-processed patient study, as distributed in phase 2, is illustrated in Figure 2.
We used standardized forms for data collection which were forwarded to the core lab for statistical analysis. After onsite processing for phase 1, and based on images provided by the core lab for phase 2, readers were requested to score tracer uptake in polar maps using a 17-segment model ( Figure 3A). An important distinction is that while in phase 1 readers could accept any score given by the cardiac software, in phase 2 they had to digit their own interpretation. The severity of perfusion defects in each of the 17 myocardial segments, as defined by the American Heart Association 13 is scored on a 0-4 scale. For left ventricular function, quantitative data were reported on Left Ventricular Ejection Fraction (LVEF) and End Diastolic Volume (EDV), while regional wall motion was reported based on visual assessment. Other qualitative, or visual, appraisals included the assessment of perfusion, classified as normal or abnormal. In this latter case, readers had to report presence of ischemia, scar or mixed patterns. Another parameter visually analysed was presence or absence of Transient Ischemic Dilation (TID). Both groups were also requested to provide an overall judgment about patients being at high risk or not (PHR). Furthermore, we aimed at assessing the relationship between the overall judgment of the status of perfusion, either normal or abnormal, and uptake scores (SSS; SRS) as the sum of scores assigned to each single segment. To this purpose and to avoid the possibility that high SSS values could just be the result of the sum of mild defects scattered throughout the myocardial wall, not representing significant perfusion defects, we defined ''hypoperfusion cluster'' as the presence of a real perfusion defect, when two adjacent segments scored C2. Then, we assessed the relationship between SSS values and the number of hypoperfusion clusters identified in the polar maps.
To evaluate the inter-reader concordance of hypoperfusion assessments, SDS values were stratified into three categories, a) SDS B3; b) 4 B SDS B 7 and c) SDS C 8. 14 For each study, each group of readers (both experts and core participants), and for both phase 1 and 2, the rate of responses  For phase 1 we also tested the consistency of quantitative data, such as LVEF and EDV, since they were calculated using different software. This evaluation was run only for phase 1, since in phase 2 participants were provided pre-processed studies. Variables LVEF post stress and LVEF rest were analyzed using univariate analysis of variance (ANOVA).
Finally, we tested the repeatability of LVEF values when different processing software was used. To avoid the increased risk of Type I errors because of the multiple simultaneous hypotheses being tested, we adjusted P values using the Bonferroni method. 15

Statistical Analysis
For statistical analysis, data were collected on Excel spread sheets and analysed using the Statistical Package for Social Sciences (SPSS; IBMÒ SPSSÒ Statistics Release 24); For hypothesis testing, Student's t-test, analysis of covariance (ANCOVA), ANOVA, and Chi-square test for proportions were used as appropriate, the latter for assessing difference in response rates between groups and phases. Intra-rater and inter-rater agreement were assessed: • by means of the intra-class correlation coefficient (ICC), for continuous measurements (EDV, LVEF, SSS, SRS, SDS). ICC is a measure of agreement that combines information on both the correlation and the systematic differences between readings 16,17 ; using ICC, the level of agreement is classified into four categories • by means of the Fleiss' kappa, for categorical variables (Function, Perfusion, TID, SDS strat, patient high risk). Using Fleiss' kappa (j) scores, the level of agreement is classified into seven categories. [18][19][20] Values for either SSS and SDS reported from the two groups in phase 1, when MPI studies were supplied as raw data and each participant had to completely process and assess  using their own software, were compared with those reported from phase 2, where studies were supplied pre-processed at the core lab and participants had to visually score segmental perfusion.

RESULTS
For continuous variables (EDV; LVEF; SSS; SRS) ICC values and the corresponding concordance category are reported in Figure 3 and in Table 2.
Metrics for EDV and LVEF are assessed only for phase 1, as in phase 2 these data were already calculated at the Core lab. Expert readers showed an excellent level of agreement for all parameters in both phase 1 and 2, spanning from 0.85 for LVEF at rest to 0.94 for EDV poststress. In phase 1, concordance levels for core participants were rated as good for all parameters (from 0.64 to 0.71), except for LVEF at rest and EDV post stress, which were rated as excellent (0.75 and 0.76, respectively). Interestingly, both parameters which were re-evaluated in phase Fleiss' kappa values for categorical variables are summarized in Figure 4 and Table 3, along with the significance of concordance. In this case, reports from phase 1 and 2 are compared for all variables. For those variables, categories of agreement for expert readers between the two phases remained almost unchanged, with the exception of TID, while core participants showed an increase for all variables.
Relationship between SSS values as reported from both experts and core participants and the number of ''hypoperfusion clusters'', as derived from polar maps, is summarized in Figure 5. In more detail, Figures 5A and B represent results from experts in phases 1 and phase 2, respectively; while in Figures 5C and D the same analysis is reported for Core participants.
If we consider SSS mean values as a function of cluster number and then we determine a linear interpolation between the experimental data, we observe a tendency towards statistical significance (F=3.64 and p=0.057) for curve slopes only between phase 1 and phase 2 for core participants (Table 4).
As already described, based on SDS values, patients have been stratified (SDS-strat) as ''low risk'' (SDS B3); ''intermediate risk'' (4 B SDS B 7) and ''high risk'' (SDS C 8), according to their SDS value. Analysing differences in risk stratification as described by SDS values between phases, we found that there is a significant difference for SDS strat between phases 1 and 2, for 3 studies out of 15 in the core participants group, and in 2 of 15 in experts.
As already described, participants were encouraged to analyze and report submitted studies according to their daily routine, including use of their cardiac software. Well aware of the possible impact on calculated values such as LVEF and EDV, information was also collected on type of cardiac software utilized. For both Core participants and Experts, the distribution of the different cardiac software available on the market is reported in Appendix (Table 6).
Evaluations on LVEF values included the following factors: Group (2 levels: Experts, Core Participants); cardiac software (5 levels: 4DMCardio; CedarsSinai; EmoryCardiacToolBox; InterView; Other); Case Study number (15 levels: patient studies 1-15). Changes in variables were assessed as a function of factors and interaction between factors themselves. Results are shown in Table 7 of the Appendix. There were significant differences in the LVEF values calculated both post-stress and at rest and for values calculated from the different types of software. The Bonferroni post-hoc analysis of multiple comparisons shows that one of the software packages (EmoryCardiacToolBox) systematically produces an LVEF value significantly lower than 4DMCardio, CedarsSinai, and Other software (range of differences: -8.2% to -10.8%); while no significant differences are found with the InterView software (see Table 8 in Appendix for details).
Overall, LVEF post-stress values are not significantly different between core participants and experts ( Table 9 in Appendix). Average SD levels for the readings of core participants were about twice as high as the average SD levels for the experts group (10.4% vs 5.8%), a finding which was also expressed in the higher ICC for the latter group ( Figure 3). Case 11 that caused relatively larger SD values in both core participants and experts readers groups (18.5 and 19.4, respectively) is represented in Figure 6.

DISCUSSION
Often, in medical imaging, interpretation of results is subjective [21][22][23][24][25][26][27][28][29][30] and can be influenced by technical considerations. Quality plays a pivotal role when analysing and reporting an imaging study. Several factors can affect the results of the analysis and the value of the studies. This is true for all modalities and in the case of SPECT MPI, [31][32][33][34][35] which is the subject of this study, it is crucial to ensure that the acquisition and reconstruction parameters are consistent and optimized, thus allowing accurate and reproducible results.
Several factors, in different phases of the procedure, might influence the final results of MPI studies and require scrutiny. They include, but are not limited to, preexamination checks, such as appropriateness of reference, QA/QC of equipment and radiopharmaceutical preparation, to steps to be taken during examination, such as QA/ QC of acquisition parameters and of processing and reporting. We geared the I-MAP study towards assessing the quality of processing and reporting.
We examined the reliability of SPECT MPI studies using inter-observer variability within two groups of participants: one made of practitioners from LMICs, which are indeed the target of IAEA's educational activities, and a second group of expert readers. The first group of ''core participants'' was composed of nuclear cardiology professionals who attended training events managed by the IAEA, many of them working in settings where financial resources might be limited, therefore with limited experience and limited resources for improving their expertise. As regards the study, it was run in two phases and in both of them participants had to report the same group of 15 cases, with the important difference that in phase 1 all participants were provided raw data and were requested to process them according to their routine practice and then report. In phase 2, all participants were given, in different order, the same 15 cases pre-reconstructed and were requested to provide their segmental uptake score, visually assessed, as well as other qualitative interpretations. Both groups were unaware that in phase 2 they were reevaluating the same studies.
For quantitative data such as EDVs and LVEFs, an excellent level of concordance was found within both groups for both phase 1 and 2 (Table 2; Figure 3). Concordance was also excellent within the experts group for SSS and SRS values in both phases.
It's very Interesting that, for the latter two parameters (SSS and SRS), core readers showed an excellent intragroup agreement in phase 2 when they had to provide their own evaluation on pre-processed images (0.87 and 0.86; for SSS and SRS respectively), while in phase 1, when they had to process the studies and scores were automatically calculated by their software, concordance was only good, being 0.66 and 0.64; for SSS and SRS respectively).
It should be remembered that while in phase 1 readers could accept segmental scores from their own software, or override if needed, in phase 2 scores had to be visually assessed and manually entered into the forms, therefore reflecting a qualitative rather than a semi-quantitative evaluation. Therefore, we relate this improvement to the central role of processing: when less experienced readers are presented with well processed studies and are forced to score perfusion status, their readings are as good as experts' readings. This finding confirms that processing remains a crucial step for the overall SPECT MPI evaluation and that experience and training plays a major role for good quality processing. Furthermore, this finding tells us that, besides physicians who actually are those who read studies, IAEA training events should also involve technologists who often perform the processing.
Further confirmation of the importance of processing is found when we compare performances between the two groups for risk stratification. In this case, when we analysed differences between the experts panel and the core participants group, we have found that in phase 1 a significant difference could be seen in 2/15 cases, while no difference could be seen between the two groups for phase 2, when the core lab distributed preprocessed studies.
Fleiss' kappa value is a rather stringent index, very sensitive to even small deviations between readers which may cause an important worsening of calculated values. In this study, it showed that experts, as expected, had a greater concordance in interpretation, in both phases of the study, while for core participants concordance improved significantly between phase 1 and 2. This finding holds true for both the analysis of continuous variables and for SSS and SRS indexes. Once more, this finding supports the notion that interpretation in itself is not the issue, but what is going to be interpreted is. When study processing is not properly carried out, then interpretation suffers.
A tendency of core participants to give an overall evaluation of ''normal perfusion'' even in presence of significant SSS values and hypoperfusion clusters was observed ( Figure 5).
The greater variability in interpreting on-site processed images, as requested in phase 1, might well be affected by poor alignment of slices because of bad selection of left ventricular axes, valve planes and apex. So, while experts were able to minimize the impact of processing on the quality of images, this was not the case for core participants, who indeed markedly improved their performance when they were given studies which had been pre-processed at the core lab. Pre-processing included motion correction, careful slice realignment between stress and rest acquisitions, correct choice of slice thickness to avoid artefacts due to partial volume effect, and correct colour scale levelling in presence of extracardiac hot-spots such as sub-diaphragmatic activity.
Finally, we found, as reported by other groups 36,37 that important parameters such as LVEF, calculated through gated SPECT, may differ significantly when different processing software packages are used, as shown in Table 8. One software deviates substantially and significantly from almost all the other software packages, with a systematic bias in LVEF of -8.3% down to -10.8% which could be clinically significant when LVEF is used in clinical decision making, such as in longitudinal studies of cardio-oncological patients.
The univariate analysis of variance for LVEF poststress and LVEF at rest was run considering the different factors involved and their interactions. Results of that analysis reported in Table 7 also show significant differences for LVEF values calculated both post-stress and at rest, and for values calculated from the different types of software.
Overall, LVEF values are not significantly different between the two groups, core participants and experts, as shown in Table 9. A relatively wide SD shown for case #11 could be attributed to factors such as patient movement during acquisition (which could have been corrected for by readers), small heart with partial volume effect, hypertrophic left ventricular walls due to hypertension, and attenuation due to obesity ( Figure 6).

NEW KNOWLEDGE GAINED
This study has shown that the quality of processing remains a crucial step for SPECT MPI and that experience helps overcome possible artefacts that may hamper the quality of reporting. As concerns the IAEA, this study shows that the outcomes of training events in NC are satisfactory, as the performance of NC professionals from LMICs does not differ significantly from expert readers in many circumstances, and particularly when good quality processing was applied to clinical studies. This latter consideration supports the concept that training courses should necessarily cover basic issues such as study processing. In addition, this study shows that LVEF values may differ significantly depending on the cardiac package employed and this should be kept in mind particularly when patients are studied in different institutions or when an institution adopts a different software package.

LIMITATIONS OF THE STUDY
The small sample size of 24 participants from LMICs is a very low response rate for survey data, challenging the generalizability of findings. Furthermore, we don't know to what extent ''core participants'' are representative of the reading pattern in LMICs. This is, however, unavoidable when dealing with centres from developing world because of difficult communication as well as technical problems affecting data transfers and report transmission, which may affect active participation.
One more important limit of the study design is the choice of not requiring participants to provide images along with reporting forms. This choice was made to minimize image transmission problems, but prevented full quality checks from being performed for the processed studies.

CONCLUSIONS
The quality of reporting SPECT MPI could be rated as moderate-to-good for participants from emerging economies and good-to-excellent for expert readers. It is clearly affected by the quality of processing. Indeed, when readers with less experience are asked to report on studies pre-processed at an experienced core lab and by professionals well-trained to avoid sources of artefacts, inter-observer agreement between readers with less experience improves substantially. To our knowledge, this is the first study reporting these findings.
Significant differences were found between LVEF values obtained using different software packages for cardiac analysis. This should be kept in mind particularly when patients are studied in different institutions or when an institution adopts a different software.
This study calls for attention from scientific societies on the issue of the quality of study processing, suggesting the need for more stringent guidelines about this aspect of NC practice.
Finally, these results suggest that the outcomes of training events conducted by the IAEA in NC are satisfactory. However, in order to improve the quality of processing, future training courses should necessarily cover this issue, and should also involve technologists.