Explainable AI to improve acceptance of convolutional neural networks for automatic classification of dopamine transporter SPECT in the diagnosis of clinically uncertain parkinsonian syndromes

Purpose Deep convolutional neural networks (CNN) provide high accuracy for automatic classification of dopamine transporter (DAT) SPECT images. However, CNN are inherently black-box in nature lacking any kind of explanation for their decisions. This limits their acceptance for clinical use. This study tested layer-wise relevance propagation (LRP) to explain CNN-based classification of DAT-SPECT in patients with clinically uncertain parkinsonian syndromes. Methods The study retrospectively included 1296 clinical DAT-SPECT with visual binary interpretation as “normal” or “reduced” by two experienced readers as standard-of-truth. A custom-made CNN was trained with 1008 randomly selected DAT-SPECT. The remaining 288 DAT-SPECT were used to assess classification performance of the CNN and to test LRP for explanation of the CNN-based classification. Results Overall accuracy, sensitivity, and specificity of the CNN were 95.8%, 92.8%, and 98.7%, respectively. LRP provided relevance maps that were easy to interpret in each individual DAT-SPECT. In particular, the putamen in the hemisphere most affected by nigrostriatal degeneration was the most relevant brain region for CNN-based classification in all reduced DAT-SPECT. Some misclassified DAT-SPECT showed an “inconsistent” relevance map more typical for the true class label. Conclusion LRP is useful to provide explanation of CNN-based decisions in individual DAT-SPECT and, therefore, can be recommended to support CNN-based classification of DAT-SPECT in clinical routine. Total computation time of 3 s is compatible with busy clinical workflow. The utility of “inconsistent” relevance maps to identify misclassified cases requires further investigation. Supplementary Information The online version contains supplementary material available at 10.1007/s00259-021-05569-9.

nucleus, separately in both hemispheres. The predefined ROIs were much bigger than the anatomical putamen and caudate in order to guarantee that these structures were completely included in the standard ROIs in each individual patient, independent of some residual anatomical inter-subject variability after stereotactical normalization. The number of hottest voxels to be averaged was fixed to a total volume of 10 ml for the unilateral putamen and 5 ml for the unilateral caudate. The total volume of 15 ml for caudate and putamen is compatible with the normal range of the striatum volume in healthy subjects [4]. The unilateral SBR in putamen or caudate was calculated as mean scaled (to the reference region) voxel intensity in the corresponding hottest ROI voxels -1. From these, the following additional semi-quantitative parameters were derived: putamen-to-caudate SBR ratio in the left and in the right hemisphere, left-right asymmetry of putamen SBR and left-right asymmetry of caudate SBR (left-right asymmetry = 200*abs(left-right)/(left+right)). In order to eliminate variability of no interest associated with mainly left-sided versus mainly right-sided nigrostriatal degeneration (that might affect the performance of the unilateral semi-quantitative parameters), minimum and maximum of putamen and caudate SBR and of the putamen-tocaudate SBR ratio of both hemispheres were considered rather than left and right values. Thus, the following eight semi-quantitative parameters were tested for differentiation between positive and negative DAT-SPECT: minimum of left and right putamen SBR, maximum of left and right putamen SBR, minimum of left and right caudate SBR, maximum of left and right caudate SBR, minimum of left and right putamen-to-caudate SBR ratio, maximum of left and right putamen-to-caudate SBR ratio, left-right asymmetry of putamen SBR, and left-right asymmetry of caudate SBR.
The cutoff for each of these parameters was determined in the DAT-SPECT training set by the Youden criterion [5] applied to their receiver operating characteristic curve for identification of positive cases (supplementary Fig.   2). The cutoffs determined in the training set were then applied for classification of the DAT-SPECT images in the test set. Overall accuracy, sensitivity, specificity, positive predictive value and negative predictive value were determined as performance measures. Training set and test set were the same as for training and testing of the CNN.

3
The best overall accuracy by a single semi-quantitative parameter was achieved by the minimum putamen SBR (96.9%, supplementary Tab. 1), followed by the maximum putamen SBR (93.1%) and the minimum putamen-tocaudate SBR ratio (91.3%). Left-right asymmetry of putamen and caudate SBR achieved the lowest overall accuracy (80.2 and 81.6%) but still considerable above chance level (50%).

Classification and regression tree analysis
Classification and regression tree (CRT) analysis was tested for automatic classification of the DAT-SPECT included in this study. CRT analysis was selected as multivariable machine learning method, since not only the mechanism of the CRT learned during the training is particularly easy to understand for users but also its decision in individual cases. CRT analysis for the identification of positive DAT-SPECT included the eight semiquantitative SBR parameters described in the supplementary section "Conventional semi-quantitative analysis" as continuous variables. The CRT was trained in the training set (same as for CNN training) using the chi-square automatic interaction detection technique. The depth of the tree was fixed to two levels, the minimum number of cases for parent/child nodes was set to 100/50. IBM SPSS Statistics version 27 was used for the CRT analysis.
The CRT selected the minimum putamen SBR for branching at the root level (supplementary Fig. 3). In the test set (same as for CNN testing), the proportion of positive DAT-SPECT increased from 47.9% in the whole test set to 92.5% in DAT-SPECT with reduced minimum putamen SBR, and it decreased to 1.4% in DAT-SPECT with relatively normal minimum putamen SBR. The maximum putamen-to-caudate SBR ratio was selected for second level branching of DAT-SPECT with reduced minimum putamen SBR. The proportion of positive DAT-SPECT further increased to 98.3% in the DAT-SPECT with reduced maximum putamen-to-caudate SBR ratio. In contrast, left-right asymmetry of the putamen SBR was selected for second level branching of DAT-SPECT with relatively normal minimum putamen SBR. The proportion of patients with positive DAT-SPET further decreased from 1.4% to 0% in the DAT-SPECT with more left-right symmetric putamen SBR. Overall classification accuracy in the test set was 95.5%. Thus, the CRT achieved about the same overall accuracy as the CNN (95.5% versus 95.8%).