Abstract
Fine-grained image classification is active research in the field of computer vision. Specifically, animal breed classification is an arduous task due to the challenges in camera traps images like occlusion, camouflage, poor illumination, pose variation, etc. In this paper, we propose a fine-grained animal breed classification model using supervised clustering based on Multi Part-Convolutional Neural Network (MP-CNN) and Expectation–Maximization (EM) clustering. The proposed model follows a straightforward pipeline that combines the deep feature extraction using the CNN pre-trained on ImageNet and classifies unsupervised data using EM clustering. Further, we also propose a multi discriminative part selection and detection for the precise classification of animal breeds without using bounding box and annotations on both training and testing phases. The model is tested on several benchmark datasets for animals, including the largest camera trap Snapshot Serengeti dataset and has achieved a cumulative accuracy of 98.4%. The results from the proposed model strengthen the belief that supervised training of deep CNN on a large and versatile dataset, extracts better features than most of the traditional approaches, even for the unsupervised tasks.
Similar content being viewed by others
References
Swanson A, Kosmala M, Lintott C, Simpson R, Smith A, Packer C (2015) Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci Data 2:150026
Deng J, Dong W, Socher R, Li LJ, Li K, Fei LF (2009) ImageNet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 248–255
Guérin J, Gibaru O, Thiery S, Nyiri E (2017) CNN features are also great at Unsupervised Classification. arXiv:abs/1707.01700
Feng H, Wang S, Ge SS (2018) Fine-grained visual recognition with salient feature detection. arXiv:abs/1808.03935
Alexander Gomez J, Salazar A, Vargas F (2017) Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol Inform 41:24–32. https://doi.org/10.1016/j.ecoinf.2017.07.004
Norouzzadeh MS, Nguyen A, Kosmala M, Swanson A, Palmer MS, Packer C, Clune J (2018) Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc Natl Acad Sci 115(25):5716–5725
Jaskó G, Giosan I, Nedevschi S (2017) Animal detection from traffic scenarios based on monocular color vision. In: 2017 13th IEEE international conference on intelligent computer communication and processing (ICCP), IEEE, pp 363–368
Sharma SU, Shah DJ (2016) A practical animal detection and collision avoidance system using computer vision technique. IEEE Access 5:347–358
Meena SD, Agilandeeswari L (2020) Stacked convolutional autoencoder for detecting animal images in cluttered scenes with a novel feature extraction framework. In: Soft computing for problem solving, Springer, Singapore, pp 513–522
Meena SD, Agilandeeswari L (2019) Adaboost cascade classifier for classification and identification of wild animals using movidius neural compute stick. Int J Eng Adv Technol (IJEAT) 9(1S3):495–499. https://doi.org/10.35940/ijeat.a1089.1291s319
Gupta P, Verma GK (2017) Wild animal detection using discriminative feature-oriented dictionary learning. In: 2017 International conference on computing, communication and automation (ICCCA), IEEE, pp 104–109
Antônio WH, Da Silva M, Miani RS, Souza JR (2019) A proposal of an animal detection system using machine learning. Appl Artif Intell 33(13):1093–1106
Xie L, Tian Q, Hong R, Yan S, Zhang B (2013) Hierarchical part matching for fine-grained visual categorization. In: International conference of computer vision (ICCV), pp 1641–1648
Berg T, Belhumeur P (2013) Poof: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 955–962
Branson S, VanHorn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. arxiv:1406.2952
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part based R-CNNs for fine-grained category detection. In: European conference on computer vision (ECCV), pp 834–849
Lin D, Shen X, Lu C, Jia J (2015) Deep lac: deep localization, alignment and classification for fine-grained recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1666–1674
Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked CNN for fine-grained visual categorization. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1173–1182
Yao H, Zhang S, Zhang Y, Li J, Tian Q (2016) Coarse-to-fine description for fine-grained visual categorization. IEEE Trans Image Process (TIP) 25(10):4858–4872
Xu Z, Huang S, Zhang Y, Tao D (2016) Webly-supervised fine-grained visual categorization via deep domain adaptation. In: IEEE transactions on pattern analysis and machine intelligence (TPAMI)
Xu Z, Tao D, Huang S, Zhang Y (2017) Friend or foe: fine-grained categorization with weak supervision. IEEE Trans Image Process (TIP) 26(1):135–146
Xie L, Tian Q, Wang M, Zhang B (2014) Spatial pooling of heterogeneous features for image classification. IEEE Trans Image Process (TIP) 23(5):1994–2008
Krause J, Jin H, Yang J, Fei-Fei L (2015) Fine-grained recognition without part annotations. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5546–5555
Simon M, Rodner E (2015) Neural activation constellations: unsupervised part model discovery with convolutional networks. In: International conference of computer vision (ICCV), pp 1143–1151
Lin TY, Chowdhury AR, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: International conference of computer vision (ICCV), pp 1449–1457
Zhang X, Xiong H, Zhou W, Tian Q (2016) Fused one-vs-all features with semantic alignments for fine-grained visual categorization. IEEE Trans Image Process (TIP) 25(2):878–892
Zhang L, Yang Y, Wang M, Hong R, Nie L, Li X (2016) Detecting densely distributed graph patterns for fine grained image categorization. IEEE Trans Image Process (TIP) 25(2):553–565
Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: International conference of computer (ICCV), pp 5209–5217
Liu J, Kanazawa A, Jacobs D, Belhumeur P (2012) Dog breed classification using part localization. In: European conference on computer vision, Springer, Berlin, pp 172–185
Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 3498–3505
Khosla A, Jayadevaprakash N, Yao B, Li FF (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol 2, no 1
Mulligan K, Rivas P (2019) Dog breed identification with a neural network over learned representations from the exception cnn architecture. In: 21st International conference on artificial intelligence (ICAI 2019)
Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 113–123
Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy. In: Advances in neural information processing systems, pp 8250–8260
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2019). Large scale learning of general visual representations for transfer. arXiv preprint arXiv:1912.11370
Lee J, Won T, Hong K (2020) Compounding the performance improvements of assembled techniques in a convolutional neural network. arXiv preprint arXiv:2001.06268
Meena SD, Agilandeeswari L (2019) An efficient framework for animal breeds classification using semi-supervised learning and multi-part convolutional neural network (MP-CNN). IEEE Access 7:151783–151802
Liu X, Xia T, Wang J, Yang Y, Zhou F, Lin Y (2016) Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765
Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 5209–5217
Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In; Proceedings of the European conference on computer vision (ECCV), pp 805–821
Dubey A, Gupta O, Guo P, Raskar R, Farrell R, Naik N (2018) Pairwise confusion for fine-grained visual classification. In: Proceedings of the European conference on computer vision (ECCV). pp 70–86
Sun G, Cholakkal H, Khan S, Khan FS, Shao L (2019) Fine-grained recognition: accounting for subtle differences between similar classes. arXiv preprint arXiv:1912.06842
Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891
Zhuang P, Wang Y, Qiao . (2020) Learning attentive pairwise interaction for fine-grained classification. arXiv preprint arXiv:2002.10191
Guo J, Ma S, Guo S (2019) MAANet: multi-view aware attention networks for image super-resolution. arXiv preprint arXiv:1904.06252
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3146–3154
Hu T, Yang P, Zhang C, Yu G, Mu Y, Snoek CG (2019) Attention-based multi-context guiding for few-shot semantic segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8441–8448
Zhang L, Nizampatnam S, Gangopadhyay A, Conde MV (2019) Multi-attention networks for temporal localization of video-level labels. arXiv preprint arXiv:1911.06866
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2001) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Yan X, Ai T, Yang M, Yin H (2019) A graph convolutional neural network for classification of building patterns using spatial vector data. ISPRS J Photogram Remote Sens 150:259–273
Liu JE, An FP (2020) Image classification algorithm based on deep learning-kernel function. In: Scientific programming
Huang C, Li H, Xie Y, Qingbo W, Luo B (2017) PBC: Polygon-based classifier for fine-grained categorization. IEEE Trans Multimed (TMM) 19(4):673–684
Guérin J, Boots B (2018) Improving image clustering with multiple pretrained cnn feature extractors. arXiv preprint arXiv:1807.07760
Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: purely attention based local feature integration for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7834–7843
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
The two important factors in deep learning models are the dataset split and dataset balancing. The proportion of train and test data plays an important role in computational time and complexity, besides influencing the performance. Similarly, dataset balancing is an essential step in training. Hence, the first set of preliminary results with SC-MPEM is analysing the effects of dataset splitting and data balancing.
The effects of different proportions of training data on the accuracy of the proposed system are evaluated on the benchmark animal datasets and the result for the same is presented in Table 10. The proposed model is validated against a different proportion of training data like 10%, 20%, 30%, 40% and 60%. SC-MPEM has achieved good accuracy even with 10% training data. Nevertheless, the accuracy increases with an increase in the proportion of training data. However, the accuracy starts saturating at around 40% of training data and further on, it does not have much impact on the performance. But, increasing the size of training data increases the computational cost and complexity. Hence, the proposed system is trained with 40% training data. Thus, the following experiments are trialed with a 40:40:20 proportions, where 40% is for training and testing and the remaining 20% is for validation. The effect of a balanced vs. imbalanced dataset is studied and the result is presented in Fig. 22. Following the previous experiment, the dataset proportion is maintained at 40:40:20 on all the benchmark datasets.
It is inferred that an imbalanced dataset affects performance. The difference between the balanced and imbalanced dataset is hard to neglect. Generally, it is good to have a balanced dataset for any classification problem, as an imbalanced dataset will have a bias favouring the majority class. Usually, the problem of an imbalanced dataset is predominant in a classification problem, specifically in a multi-class classification problem, where more than one class may have minimal data. Hence, we synthetically balanced the dataset using SMOTE. However, it is not always the case that a balanced dataset performs well. If the classes are balanced, then we will miss some valuable patterns in the dataset. Hence, it is preferred to have a large dataset though unbalanced. Balancing a large dataset by data augmentation techniques doesn’t make much difference in the overall results. Among the datasets, the Snapshot Serengeti is the largest dataset and so data augmentation had minimal effect on the performance when compared to the other three datasets. Hence, the Snapshot Serengeti dataset was not balanced, however, the remaining datasets were balanced. There exists a trade-off in every application and so is the performance metrics. For a balanced dataset, accuracy is the best metric. On the other hand, for an imbalanced dataset, precision and recall are the appropriate metrics to measure the performance.
The second set of results discusses the details of transfer learning (see Table 11). Specifically, we discuss various details like layer from which the features are extracted, NMI (Normalized Mutual Information) score and the time complexity. For estimating the NMI, we ensured that no image contains more than one label, as it makes it difficult to analyze if the clustering classified the images accurately or not. In order to understand the above three factors, we tried clustering the dataset using 5 different CNN architectures, 8 different clustering algorithms, and various choices of layers. For assessing the results of these, we utilized the NMI score and time complexity. For simplicity, we used the default value for all the hyper-parameters. Both KM and MKM were randomly initialized. Every experiment was run for a total of 10 times and an average value is taken over these 10 runs. In Table 11, the layer names are unchanged from the architecture and the NMI score is given in bold. The time complexity of each clustering for the various layers is also given below the NMI score. The best results are highlighted in bold italic values.
From the table, it is inferred that features extracted from the penultimate layers of Inception v3 network and clustered using EM clustering produced the state-of-the-art result. Although AHC had the least time complexity, however, we do not consider it due to its poor NMI score. Besides, the tenfold cross-validation also stands in favor of the Inception v3 network. The result of each of the model is depicted in Fig. 23.
The final sets of results discuss the misclassification and how they are tackled by MHKC. Figure 24 represents the training accuracy for the class pre-trained using MHKC. The highest score for the chosen images is 6.60 and the least one is 1.96. The precision for the training dataset is found to be 100%, which is a good indicator for better testing accuracy. We have improved the accuracy of misclassified horse images to 100% with MHKC. With 100% PR on training data, we achieved 99.97% on testing data. This is far better than the accuracy of TensorFlow and the same is depicted in Fig. 25. The feature vector is pre-computed and so the accuracy of the classifier corresponds merely to its kernel functions.
As additional information regarding the performance of our proposed model on the Snapshot Serengeti dataset, we present the confusion matrix of the dataset in Fig. 26. Despite being the largest camera trap dataset, it has not been predominantly used for testing. Some of the reasons are the size of the dataset and the lack of resources to train and test such a huge dataset [5]. To counter this problem, Gomez et al. [5] have trained only 26 classes out of the 40 mammalian classes. In fact, the original Snapshot Serengeti has 48 classes, of which 40 classes are mammalian species. The remaining 8 classes include humans, birds, rodents, etc. We have intentionally excluded these 8 classes as they do not belong to animal species and by adding them; we needlessly increase the computational burden. Thus, we have trained and tested only 40 mammalian species as Norouzzadeh et al. [6].
Rights and permissions
About this article
Cite this article
Sundaram, D., Loganathan, A. A New Supervised Clustering Framework Using Multi Discriminative Parts and Expectation–Maximization Approach for a Fine-Grained Animal Breed Classification (SC-MPEM). Neural Process Lett 52, 727–766 (2020). https://doi.org/10.1007/s11063-020-10246-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10246-3