Abstract
Performance of convolutional neural network based image retrieval depends on the characteristics and statistics of the data being used for training. We show that for training datasets with a large number of classes but small number of images per class, the combination of cross-entropy loss and center loss works better than either of the losses alone. While cross-entropy loss tries to minimize misclassification of data, center loss minimizes the embedding space distance of each point in a class to its center, bringing together data-points belonging to the same class.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
A common approach to identifying features in CBIR is to train a multi-class deep model with a large fully supervised training set, and then use features from various layers of the network as a basis for coding database images (which need not be drawn from the classes used to train the network). Early attempts at retrieval were based on cross-entropy loss. Triplet loss has been used to train networks for image retrieval [4]. However optimizing triplet loss is challenging because the level of relative similarity or dissimilarity in each training triplet determines how fast the network learns.
In this paper we study the use of center loss [14, 16] for image retrieval. Center loss reduces the distance of each data point to its class center. It is not as difficult to train as triplet loss and performance is not based on the selection process of the training data points (triplets). Combining it with a softmax loss, prevents embeddings from collapsing.
Experiments will show that for training datasets with few images per class but with a large number of classes, the improvement using center loss for retrieval is significant.
2 Related Works
Some of the classical papers in image retrieval include [3, 7, 8, 13]. Most of the recent work is based on training CNN models [2, 5, 12]. Both [11] and [17] review these techniques.
[2] achieved huge performance improvements by training the network on datasets related to the query. [9] showed that using intermediate layers captures local patterns of objects which performs better than using the final layer output for image retrieval. Similarly [15] uses the regional maximum activations of convolutions, R-MAC, for the same purpose. R-MAC uses a CNN to obtain a local descriptor of the image, which is then max pooled from different regions in a rigid grid, normalized, whitened and sum-aggregated to give a compact output vector. [4] also uses a similar process but with region proposals instead of the rigid grid to define the aggregation regions.
Center Loss was first used for face recognition by [16]. They update centers per mini-batch based on the gradient of center loss, and combines center loss with softmax loss for stability. [14] used a similar idea for few shot learning where they apply softmax over center distances. Instead of updating centers, they recalculate the centers per mini-batch based on the image classes in the support set in the mini-batch using episodic learning.
3 Our Algorithm
Our technique combines center loss with cross-entropy loss on a Resnet18 [6] based network as shown in Fig. 1. Suppose there are K classes and that the \(k^{th}\) class has \(N_{k}\) images. Let \(f^{1}_{y_{i}}(x_{i})\) be the pre-final layer output by passing the \(i_{th}\) image (\(x_{i}\)) with label \(y_{i}\) through the network. Similarly let \(f^{2}_{y_{i}}(x_{i})\) be the final FC layer output and let B be the number of images per batch.
First the training images are passed through a network pre-trained on Imagenet, giving us \(f^{1}_{y_{i}}(x_{i})\) feature descriptor. Then the center \(c_{k}\) of the \(k^{th}\) class is computed as follows:
Also the distance \(d_{ik}\) of the feature descriptor for each image to each class center \(c_{k}\) is calculated as follows:
Let this matrix be D with each element \(d_{ik}\). Each \(d_{ik}\) is inverted to get \(\frac{1}{D}\) so that it can be equated to a normal cross-entropy loss model where the input to the loss layer is a scores array. Let each row in \(\frac{1}{D}\) be represented as \(\frac{1}{d_{i}}\) and the labels corresponding to each row be \(y_{i}\). Finally \(\frac{1}{D}\) values are passed into a cross-entropy loss function which yields the center loss, \(L_{c}\). This is combined with a normal cross-entropy loss applied on the final Fully-Connected layer with number of classes as output size, \(L_{s}\). The total loss L can be expressed as:
This is similar to the loss in [16] except that we replace the squared Euclidean center loss with cross entropy function being applied on this distance as in [14]. The difference with [14] is that we use inverse instead of negative distance function. The use of cross-entropy function on the squared Euclidean distance helps to remove the instability of the center loss. At the end of each epoch we use Eq. 1 to recompute the centers globally for the entire dataset. [16] uses an update formula to update the centers whereas [14] recomputes them, but both of them recalculate only at the mini-batch level, and not globally.
4 Dataset
Google Landmark [10] has 14951 classes with about 1 million images in the original train dataset. We split this into training set consisting of the first 8951 classes and the query set containing the remaining 6000 classes, so training and query partitions do not have any classes in common. Since each query class should have at least 2 images - one as the query and the other to be included in the retrieval/index set - the classes containing only one image are not used. We take a maximum of 10 images per class. So finally there are 8951 training classes with 72244 images, 5943 query classes with 1 query image per class and an index set consisting of 42709 images from these 5943 classes.
5 Results
We use Resnet models in Pytorch pre-trained on Imagenet as initialization. The final layer size is modified to suit the number of classes in our training set and it is initialized using Xavier uniform initialization. The output size of the pre-final layer is model dependent (512 for Resnet18), which would be the size of the feature descriptor for the image. For all networks, we used Adam optimization for training with a weight decay of 2e–4. The initial learning rate was set at 0.001 and a stepwise scheduler with drop rate of 0.92 per epoch was used. We ran the experiments with a batch size of 224.
Mean average precision or mAP score was used as the evaluation criterion. For Google Landmark dataset, given a query image all other images from the same class are correct retrieval results and images from other classes are incorrect retrieval results.
From Table 1 when the training datasets have few (\(<=10\)) images per class, center loss leads to improvement. To understand the performance of center loss based network, we conducted a t-sne analysis for all the 3 models in Table 1 as seen in Fig. 2.
One main point of difference with previous works is that we are training on a very different data distribution with huge number of classes and few images per class. Unfortunately we do not have any previous results that have been trained on a similar data distribution as the partial Google Landmarks for comparison purposes.
6 Conclusion
We explored the effect of center loss training on image retrieval applications. A combination of center loss and cross-entropy loss performs better than just using cross-entropy loss or center loss separately. Also using cross-entropy on center distance to compute center loss instead of just the squared Euclidean distance stabilizes the center loss network. Any of the earlier techniques including VLAD encoding of intermediate layers, R-MAC etc can be used on top of this network for better results. Center loss based network is most useful when the training dataset has a large number of classes with few images per class. In the future, we plan to apply the model to other applications such as clustering and few-shot learning.
References
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Chen, Y., Wang, J.Z., Krovetz, R.: Clue: cluster-based retrieval of images by unsupervised learning. IEEE Trans. Image Process. 14(8), 1187–1201 (2005)
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VI. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 124(2), 237–254 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Liu, Y., Zhang, D., Lu, G.: Region-based image retrieval with high-level semantics using decision tree learning. Pattern Recognit. 41(8), 2554–2570 (2008)
Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996)
Ng, J.Y.H., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. arXiv preprint arXiv:1504.05133 (2015)
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3456–3465 (2017)
Rafiee, G., Dlay, S.S., Woo, W.L.: A review of content-based image retrieval. In: 7th International Symposium on Communication Systems Networks and Digital Signal Processing (CSNDSP), pp. 775–779. IEEE (2010)
Salvador, A., Giró-i Nieto, X., Marqués, F., Satoh, S.: Faster r-cnn features for instance search. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2016 IEEE Conference on. pp. 394–401. IEEE (2016)
Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 19(5), 530–535 (1997)
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 4080–4090 (2017)
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Zheng, L., Yang, Y., Tian, Q.: Sift meets CNN: a decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1224 (2017)
Acknowledgement
This work was supported by the DARPA MediFor program under cooperative agreement FA87501620191, “Physical and Semantic Integrity Measures for Media Forensics”. The authors acknowledge the Maryland Advanced Research Computing Center (MARCC) for providing computing resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ghosh, P., Davis, L.S. (2019). Understanding Center Loss Based Network for Image Retrieval with Few Training Data. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-11018-5_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)