Skip to main content

Self-Transfer Learning for Weakly Supervised Lesion Localization

Part of the Lecture Notes in Computer Science book series (LNIP,volume 9901)


Recent advances of deep learning have achieved remarkable performances in various computer vision tasks including weakly supervised object localization. Weakly supervised object localization is practically useful since it does not require fine-grained annotations. Current approaches overcome the difficulties of weak supervision via transfer learning from pre-trained models on large-scale general images such as ImageNet. However, they cannot be utilized for medical image domain in which do not exist such priors. In this work, we present a novel weakly supervised learning framework for lesion localization named as self-transfer learning (STL). STL jointly optimizes both classification and localization networks to help the localization network focus on correct lesions without any types of priors. We evaluate STL framework over chest X-rays and mammograms, and achieve significantly better localization performance compared to previous weakly supervised localization approaches.


  • Weakly supervised learning
  • Lesion localization
  • Convolutional neural networks

1 Introduction

Recently, deep convolutional neural networks (CNN) show promising performances in various computer vision tasks such as classification [6, 9], localization [2], and segmentation [12]. Among those tasks, object (lesion in medical images) localization is one of the challenging problems. In object localization task, a lot of training images with bounding boxes (or pixel-level) annotations of region-of-interests (ROIs) are required. However, a dataset with such location information needs heavy annotation efforts. Such annotation costs are much greater for medical images because only experts can interpret them. To alleviate the problem, several works for weakly supervised localization only using a weak-labeled (i.e. image-level label) dataset have been proposed [10, 11, 14]. These approaches require pre-trained models on relatively well-localized datasets (e.g., ImageNet [3]) to transfer good initial features for localization. Therefore, we cannot expect good performances for medical image domain since we do not have such domain-specific well-trained features.

In this work, we propose a self-transfer learning (STL) framework for weakly supervised lesion localization in medical images. STL co-optimizes both classification and localization networks simultaneously in order to guide the localization network with the most discriminative features in terms of the classification task (see Fig. 1). The proposed method does not require not only the location information but also any types of priors for training. We show that previous approaches without good initial features are not effective by themselves since errors are back-propagated through a restricted path or with insufficient information.

Fig. 1.
figure 1

Overall architecture of STL (F \(_\mathbf{cls }\) denotes fully-connected classification layers, C \(_\mathbf{loc }\) and P \(_\mathbf{loc }\) denote a \(1\times 1\) convolutional layer and a global pooling layer respectively). The final objective function Loss \(_\mathbf{total }\) is a weighted sum of Loss \(_\mathbf{cls }\) and Loss \(_\mathbf{loc }\) with a controllable hyperparameter \(\mathbf {\alpha }\). Self-transfer learning is realized by re-weighting the \(\mathbf {\alpha }\) adaptively in a training phase.

Related Work.  We consider recent methods based on CNN showing a promising performance on weakly supervised object localization [10, 11, 14, 15]. The common strategy for them is to produce activation maps (in other words, score maps) for each class, and select or extract a representative activation value. The dimensions of those maps are automatically determined by a network architecture. If such a network is trained well, it is expected that a target object can be easily localized by examining the activation map corresponding to its class.

To select or extract the representative activations for each class, typical pooling methods can be effectively used. In [10], a global max pooling method is used and its classification and localization performances are verified in the domain of general images. Another choice can be a global average pooling method. As discussed in [15], it might be better for localization since a global max pooling focuses on the most discriminative part only, while a global average pooling discovers all those parts as much as possible. Inferring segmentation map is more challenging compared to object localization, since it performs pixel-level classification. In [11], they adopt a Log-Sum-Exponential pooling method which is a smooth version of the max pooling to explore the entire feature maps. Smoothing priors are also considered to obtain fine-grained segmentation maps.

Those approaches can be interpreted as a variant of multiple instance learning (MIL), which is designed for classification where labels are associated with sets of instances, called bags, instead of individual instances. In image classification tasks, the full size image and its subsampled patches are considered as a bag and instances, respectively. For instance, if we use a global max pooling to select a representative value among activations of patches, it is equivalent to use a well-classified single patch for building the decision boundary. Strictly speaking, however, current approaches are not generally applicable since they essentially require well-trained features on semantically similar datasets.

2 Self-Transfer Learning for Weakly Supervised Learning

STL consists of three main components: shared convolutional layers, fully connected layers (i.e. classifier), and localization layers (i.e. localizer) (see Fig. 1). The key features of STL are twofold. First, it simultaneously propagates errors backward from both classifier and localizer to prevent the localizer from wandering a loss surface to find a local optimum. Second, an adjustable hyperparameter \(\alpha \) is introduced to control the relative importance between classifier and localizer. Two losses, Loss \(_\mathbf{cls }\) from classifier and Loss \(_\mathbf{loc }\) from localizer, are computed at the forward pass, and the weighted sum of those errors is propagated at the backward pass. The errors from classifier contribute to train the filters in an overall view, while those from localizer are backpropagated through the subsampled region which is the most important window to classify training set. At the early stage of training phase, the errors from classifier should be more weighted than those from localizer to prevent the localizer from falling in a bad local optimum. By reducing the effects of errors from localizer, good filters which have a discriminative power can be well trained although localizer fails to find objects associated with the class label. As training proceeds, the weight for localizer increases to focus on the subsampled region of input image. At this stage, the network’s filters are fine-tuned for the task of localization.

Consider a data set of N input-target pairs \(\{\mathbf {x}_i,\mathbf {t}_i\}_{i=1}^N\). \(\mathbf {x}_i\) and \(\mathbf {t}_i\) denote an i-th image and the corresponding K-dimensional true label vector, respectively, where K represents the number of classes. Assuming an image with a single class label, our objective function to be optimized is a weighted sum of cross-entropy losses from classifier and localizer, which can be defined as follows:

$$\begin{aligned} \mathbf Loss _\mathbf total&= (1-\alpha )\mathbf Loss _\mathbf cls + \alpha \mathbf Loss _\mathbf loc \\ \nonumber&=-(1-\alpha ) \textstyle \sum _{i=1}^{N} {\mathbf {t}_{i}^\intercal \mathbf {log}({\mathbf {y}_i^{cls}}}) - \alpha \sum _{i=1}^{N} {\mathbf {t}_{i}^\intercal \mathbf {log}({\mathbf {y}_i^{loc}}}) \end{aligned}$$

where \(\mathbf {y}_i^{cls}\) and \(\mathbf {y}_i^{loc}\) are K-dimensional class probability vectors from classifier and localizer, respectively, for i-th input, and \(\mathbf {log}(\cdot )\) denotes an element-wise log operation.

The effect of the proposed STL can be explained by examining a backpropagation process at the end of shared convolutional layers C. Suppose that the node i represents a particular node in C which is connected with H nodes in F \(_\mathbf{cls }\) and K nodes in C \(_\mathbf{loc }\). Note that C \(_\mathbf{loc }\) is obtained by \(1\times 1\) convolution on C as shown in Fig. 1 and K is equal to the number of activation maps (i.e. the number of classes). If ReLU activation function is used for the node i, the backpropagated error \(\delta _i\) at the node i can be written as follows:

$$\begin{aligned} \delta _i = \max (0, \delta ^{cls}_i + \delta ^{loc}_i)~~\text {where}~~\delta ^{cls}_i = \textstyle \sum _{j=1}^{H} {w_{ji} \delta _j},~\delta ^{loc}_i = \sum _{k=1}^{K} {w_{ki} \delta _k} \end{aligned}$$

It should be noted that the relative importance between classifier and localizer is already reflected in the errors \(\delta ^{cls}_i\) and \(\delta ^{loc}_i\) through the weighted loss function \({\mathbf {Loss}_\mathbf{total }}\). It can be seen that the errors \(\delta ^{loc}_i\) are backpropagated undesirably without \(\delta ^{cls}_i\) due to the special treatment, a global pooling, for activation maps in \(\mathbf {C}_{\mathbf {loc}}\). For instance, if a global max pooling is used to aggregate the activations within each activation map and the location corresponding to node i in \(\mathbf {C}\) is not selected as the maximum, all \(\delta _k\)’s to be backpropagated from \(\mathbf {C}_{\mathbf {loc}}\) will be zero. Therefore, the computed errors of most of nodes in \(\mathbf {C}\) except for the nodes whose locations correspond to the maximal responses for each activation map will be zero. In case of a global average pooling, zero errors will be merely replaced with a mean of errors. This situation is not certainly desirable, especially when we train the network from scratch (i.e. without pre-trained filters). By incorporating classifier into a network architecture, the shared convolutional layers \(\mathbf {C}\) can be consistently improved even if the backpropagated errors \(\delta ^{loc}_i\) from localizer do not contribute to learn useful features.

It should be noted that STL is different from multi-task learning (MTL). They look similar because of the branch architecture and several objectives. However, STL solves exactly the same tasks and therfore it does not need any extra supervision. While, MTL jointly trains several tasks with separate losses. Therefore, it is more appropriate to see the classifier in STL as an auxiliary component for successful training of localizer.

3 Computational Experiments

In this section we use two medical image datasets, chest X-rays (CXRs) and mammograms, to evaluate the classification and localization performances of STL. All training CXRs and mammograms are resized to 500\(\,\times \,\)500. The network architecture used in this experiment is slightly modified based on the network from [9]Footnote 1. For localizer, \(15\times 15\) activation maps for each class are obtained via \(1\times 1\) convolution operation. Two global pooling methods, max [10] and average poolings [15], are applied to the activation maps. The network is trained via stochastic gradient descent with momentum 0.9 and the minibatch size is set to 64. There is an additional hyperparameter \(\alpha \) on STL to determine the level of importance between classifier and localizer. We set its initial value to 0.1 so that the network more focuses on learning the representative features at the early stage, and it is increased to 0.9 after 60 epochs to fine-tune the localizer.

To compare the classification performance, an area under characteristic curve (AUC), accuracy and average precision (AP) of each class are used. For STL, class probabilities obtained from localizer is used for measuring performance. For a localization task, a similar performance metric in [10] is used. It is based on AP, but the difference is the way to count true positives and false positives. In classification, it is a true positive if its class probability exceeds some threshold. To measure the localization performance under this metric, the test image whose class probability is greater than some threshold (i.e. a true positive in case of classification) but the maximal response in the activation map does not fall within the ground truth annotations allowing some tolerance is counted as a false positive. In our experiment, only positive class is considered for localization AP since there is no ROI on negative class. First, the activation map of positive class is resized to the size of original image via simple bilinear interpolation, then it is examined whether the maximal response falls into the ground truth annotations within 16 pixels tolerances which is a half of the global stride 32 of the considered network architecture. If the response is located inside true annotations, the test image is counted as a true positive. If not, it is counted as a false positive.

Tuberculosis Detection.  We use three CXRs datasets, namely KIT, Shenzhen and MC sets in this experiment. All the CXRs used in this work are de-identified by the corresponding image providers. KIT set contains 10,848 DICOM images, consisting of 7,020 normal and 3,828 abnormal (TB) cases, from the Korean Institute of Tuberculosis. ShenzhenFootnote 2 and MCFootnote 3 sets are available limited to research purpose from the authors of [1, 7, 8]. We train the models using the KIT set, and test the classification and localization performances using the Shenzhen and MC sets. To evaluate the localization performance, we obtain their detailed annotations from the TB clinician since the testsets, Shenzhen and MC sets, do not contain any annotations for TB lesions.

Fig. 2.
figure 2

Training curves and 1st layer filters at 5000 iterations in case of average pooling

Table 1 summarizes the experimental results. For both classification and localization tasks, STL consistently outperforms other methods. The best performance model is STL+AvePool. A global average pooling works well for localization and it is consistent result with [15]. Since the value of localization AP is always less than that of classification AP (from the definition of measure), it is helpful to see the improvement ratio for performance comparison. Regardless of pooling methods, the localization APs for both Shenzhen and MC sets are much improved from baselines (i.e. MaxPool and AvePool) compared to classification APs. This means that STL certainly assists localizer to find the most important ROIs which define the class label. Figure 2 clearly shows the advantages of STL, faster training and better feature learning. The localization examples in testsets are visualized in Fig. 3.

Mammography. We use two public mammography databases, Digital Database for Screening Mammography (DDSM) [4, 5] and Mammographic Image Analysis Society (MIAS) [13]. DDSM and MIAS are used for training and testing, respectively. We preprocess DDSM images to have two labels, positive (abnormal) and negative (normal). Originally, abnormal mammographic images contain several types of abnormalities such as masses, microcalcification, etc. We merge all types of abnormalities into positive class to distinguish any abnormalities from normal, thus the number of positive and negative images are 4,025 and 6,338 respectively in the training set (DDSM). In testset (MIAS), there are 112 positive and 210 negative images. Note that we do not use any additional information except for image-level labels for training although the training set has boundary information of abnormal ROIs. The boundary information of testset is utilized to evaluate the localization performance.

Table 1. Classification and localization performances for CXRs and mammograms (subscripts + and - denote positive and negative class, respectively)

Table 1 reports the classification and localization resultsFootnote 4. As we can see, classification of mammograms is much difficult compared to TB detection. First of all, mammograms used for training are low quality images which contain some degree of artifact and distortion generated from the scanning process for creating digital images from films. Moreover, this task is inherently complicated since there also exist quite a few irregular patterns in negative class caused by various shapes and characteristics of normal tissues. Nevertheless, it is confirmed that STL is significantly better than other methods for both classification and localization. Again, the localization performances are much improved from baselines compared to the classification performances regardless of pooling methods. Figure 3 shows some localization examples in the testset.

Fig. 3.
figure 3

Localization examples for chest X-rays and mammograms. Top row shows test images with groud-truth annotations. The belows represent the results from MaxPool, AvePool, STL+MaxPool and STL+AvePool in a sequential order. The activation map for positive class is linearly scaled to the range between 0 and the maximum probability.

4 Conclusion

In this work, we propose a novel framework STL which enables training CNN for lesion localization without neither any location information nor pre-trained models. Our framework jointly learns both classifier and localizer using a weighted loss as an objective function for the purpose of preventing localizer from falling in a bad local optimum. Self-transfer is realized via a weight controlling the relative importance between classifier and localizer. Also, the effect of classifier on localizer is discussed to provide the rationale behind the advantages of the proposed framework. Computational experiments for lesion localization given only image-level labels show that the proposed framework outperforms the existing approaches in terms of both classification and localization performance metrics.


  1. 1.

    We add one convolutional layer (i.e. the 6th convolutional layer) since the resolution of the input image is relatively high compared to input images for [9].

  2. 2.

    Shenzhen set has 326 normal and 336 TB cases from Shenzhen No. 3 People’s Hospital, Guangdong Medical College, Shenzhen, China.

  3. 3.

    MC set is from National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. It consists of 80 normal and 58 TB cases.

  4. 4.

    For a global max pooling without STL, training loss is not decreased at all, i.e., it cannot be trained. Therefore, the localization performance of that is not reported in Table 1.


  1. Candemir, S., et al.: Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans. Med. Imaging 33(2), 577–590 (2014)

    CrossRef  Google Scholar 

  2. Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5_51

    CrossRef  Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  4. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer Jr., P., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Proceedings of the Fourth International Workshop on Digital Mammography, pp. 457–460 (1998)

    Google Scholar 

  5. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.P.: The digital database for screening mammography. In: Proceedings of the 5th international workshop on digital mammography, pp. 212–218 (2000)

    Google Scholar 

  6. Hwang, S., Kim, H.E., Jeong, J., Kim, H.J.: A novel approach for tuberculosis screening based on deep convolutional neural networks. In: Proceedings of SPIE Medical Imaging (2016)

    Google Scholar 

  7. Jaeger, S., Karargyris, A., Candemir, S., Siegelman, J., Folio, L., Antani, S., Thoma, G.: Automatic screening for tuberculosis in chest radiographs: a survey. Quant. Imaging Med. Surg. 3(2), 89–99 (2013)

    Google Scholar 

  8. Jaeger, S., et al.: Automatic tuberculosis screening using chest radiographs. IEEE Trans. Med. Imaging 33(2), 233–245 (2014)

    CrossRef  Google Scholar 

  9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)

    Google Scholar 

  10. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: CVPR, pp. 685–694 (2015)

    Google Scholar 

  11. Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR, pp. 1713–1721 (2015)

    Google Scholar 

  12. Roth, H.R., Lu, L., Farag, A., Shin, H.-C., Liu, J., Turkbey, E.B., Summers, R.M.: DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 556–564. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24553-9_68

    CrossRef  Google Scholar 

  13. Suckling, J., Parker, J., Dance, D., Astley, S., Hutt, I., Boggis, C., Ricketts, I., Stamatakis, E., Cerneaz, N., Kok, S., et al.: The mammographic image analysis society digital mammogram database. Exerpta Medica Int. Cong. Ser. 1069, 375–378 (1994)

    Google Scholar 

  14. Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR, pp. 3460–3469 (2015)

    Google Scholar 

  15. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. arXiv preprint (2015). arXiv:1512.04150

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sangheum Hwang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Hwang, S., Kim, HE. (2016). Self-Transfer Learning for Weakly Supervised Lesion Localization. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9901. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46722-1

  • Online ISBN: 978-3-319-46723-8

  • eBook Packages: Computer ScienceComputer Science (R0)