In this section we use two medical image datasets, chest X-rays (CXRs) and mammograms, to evaluate the classification and localization performances of STL. All training CXRs and mammograms are resized to 500\(\,\times \,\)500. The network architecture used in this experiment is slightly modified based on the network from [9]Footnote 1. For localizer, \(15\times 15\) activation maps for each class are obtained via \(1\times 1\) convolution operation. Two global pooling methods, max [10] and average poolings [15], are applied to the activation maps. The network is trained via stochastic gradient descent with momentum 0.9 and the minibatch size is set to 64. There is an additional hyperparameter \(\alpha \) on STL to determine the level of importance between classifier and localizer. We set its initial value to 0.1 so that the network more focuses on learning the representative features at the early stage, and it is increased to 0.9 after 60 epochs to fine-tune the localizer.
To compare the classification performance, an area under characteristic curve (AUC), accuracy and average precision (AP) of each class are used. For STL, class probabilities obtained from localizer is used for measuring performance. For a localization task, a similar performance metric in [10] is used. It is based on AP, but the difference is the way to count true positives and false positives. In classification, it is a true positive if its class probability exceeds some threshold. To measure the localization performance under this metric, the test image whose class probability is greater than some threshold (i.e. a true positive in case of classification) but the maximal response in the activation map does not fall within the ground truth annotations allowing some tolerance is counted as a false positive. In our experiment, only positive class is considered for localization AP since there is no ROI on negative class. First, the activation map of positive class is resized to the size of original image via simple bilinear interpolation, then it is examined whether the maximal response falls into the ground truth annotations within 16 pixels tolerances which is a half of the global stride 32 of the considered network architecture. If the response is located inside true annotations, the test image is counted as a true positive. If not, it is counted as a false positive.
Tuberculosis Detection. We use three CXRs datasets, namely KIT, Shenzhen and MC sets in this experiment. All the CXRs used in this work are de-identified by the corresponding image providers. KIT set contains 10,848 DICOM images, consisting of 7,020 normal and 3,828 abnormal (TB) cases, from the Korean Institute of Tuberculosis. ShenzhenFootnote 2 and MCFootnote 3 sets are available limited to research purpose from the authors of [1, 7, 8]. We train the models using the KIT set, and test the classification and localization performances using the Shenzhen and MC sets. To evaluate the localization performance, we obtain their detailed annotations from the TB clinician since the testsets, Shenzhen and MC sets, do not contain any annotations for TB lesions.
Table 1 summarizes the experimental results. For both classification and localization tasks, STL consistently outperforms other methods. The best performance model is STL+AvePool. A global average pooling works well for localization and it is consistent result with [15]. Since the value of localization AP is always less than that of classification AP (from the definition of measure), it is helpful to see the improvement ratio for performance comparison. Regardless of pooling methods, the localization APs for both Shenzhen and MC sets are much improved from baselines (i.e. MaxPool and AvePool) compared to classification APs. This means that STL certainly assists localizer to find the most important ROIs which define the class label. Figure 2 clearly shows the advantages of STL, faster training and better feature learning. The localization examples in testsets are visualized in Fig. 3.
Mammography. We use two public mammography databases, Digital Database for Screening Mammography (DDSM) [4, 5] and Mammographic Image Analysis Society (MIAS) [13]. DDSM and MIAS are used for training and testing, respectively. We preprocess DDSM images to have two labels, positive (abnormal) and negative (normal). Originally, abnormal mammographic images contain several types of abnormalities such as masses, microcalcification, etc. We merge all types of abnormalities into positive class to distinguish any abnormalities from normal, thus the number of positive and negative images are 4,025 and 6,338 respectively in the training set (DDSM). In testset (MIAS), there are 112 positive and 210 negative images. Note that we do not use any additional information except for image-level labels for training although the training set has boundary information of abnormal ROIs. The boundary information of testset is utilized to evaluate the localization performance.
Table 1. Classification and localization performances for CXRs and mammograms (subscripts + and - denote positive and negative class, respectively)
Table 1 reports the classification and localization resultsFootnote 4. As we can see, classification of mammograms is much difficult compared to TB detection. First of all, mammograms used for training are low quality images which contain some degree of artifact and distortion generated from the scanning process for creating digital images from films. Moreover, this task is inherently complicated since there also exist quite a few irregular patterns in negative class caused by various shapes and characteristics of normal tissues. Nevertheless, it is confirmed that STL is significantly better than other methods for both classification and localization. Again, the localization performances are much improved from baselines compared to the classification performances regardless of pooling methods. Figure 3 shows some localization examples in the testset.