Skip to main content
Log in

A New Supervised Clustering Framework Using Multi Discriminative Parts and Expectation–Maximization Approach for a Fine-Grained Animal Breed Classification (SC-MPEM)

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Fine-grained image classification is active research in the field of computer vision. Specifically, animal breed classification is an arduous task due to the challenges in camera traps images like occlusion, camouflage, poor illumination, pose variation, etc. In this paper, we propose a fine-grained animal breed classification model using supervised clustering based on Multi Part-Convolutional Neural Network (MP-CNN) and Expectation–Maximization (EM) clustering. The proposed model follows a straightforward pipeline that combines the deep feature extraction using the CNN pre-trained on ImageNet and classifies unsupervised data using EM clustering. Further, we also propose a multi discriminative part selection and detection for the precise classification of animal breeds without using bounding box and annotations on both training and testing phases. The model is tested on several benchmark datasets for animals, including the largest camera trap Snapshot Serengeti dataset and has achieved a cumulative accuracy of 98.4%. The results from the proposed model strengthen the belief that supervised training of deep CNN on a large and versatile dataset, extracts better features than most of the traditional approaches, even for the unsupervised tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Swanson A, Kosmala M, Lintott C, Simpson R, Smith A, Packer C (2015) Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci Data 2:150026

    Google Scholar 

  2. Deng J, Dong W, Socher R, Li LJ, Li K, Fei LF (2009) ImageNet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 248–255

  3. Guérin J, Gibaru O, Thiery S, Nyiri E (2017) CNN features are also great at Unsupervised Classification. arXiv:abs/1707.01700

  4. Feng H, Wang S, Ge SS (2018) Fine-grained visual recognition with salient feature detection. arXiv:abs/1808.03935

  5. Alexander Gomez J, Salazar A, Vargas F (2017) Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol Inform 41:24–32. https://doi.org/10.1016/j.ecoinf.2017.07.004

    Google Scholar 

  6. Norouzzadeh MS, Nguyen A, Kosmala M, Swanson A, Palmer MS, Packer C, Clune J (2018) Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc Natl Acad Sci 115(25):5716–5725

    Google Scholar 

  7. Jaskó G, Giosan I, Nedevschi S (2017) Animal detection from traffic scenarios based on monocular color vision. In: 2017 13th IEEE international conference on intelligent computer communication and processing (ICCP), IEEE, pp 363–368

  8. Sharma SU, Shah DJ (2016) A practical animal detection and collision avoidance system using computer vision technique. IEEE Access 5:347–358

    Google Scholar 

  9. Meena SD, Agilandeeswari L (2020) Stacked convolutional autoencoder for detecting animal images in cluttered scenes with a novel feature extraction framework. In: Soft computing for problem solving, Springer, Singapore, pp 513–522

  10. Meena SD, Agilandeeswari L (2019) Adaboost cascade classifier for classification and identification of wild animals using movidius neural compute stick. Int J Eng Adv Technol (IJEAT) 9(1S3):495–499. https://doi.org/10.35940/ijeat.a1089.1291s319

    Google Scholar 

  11. Gupta P, Verma GK (2017) Wild animal detection using discriminative feature-oriented dictionary learning. In: 2017 International conference on computing, communication and automation (ICCCA), IEEE, pp 104–109

  12. Antônio WH, Da Silva M, Miani RS, Souza JR (2019) A proposal of an animal detection system using machine learning. Appl Artif Intell 33(13):1093–1106

    Google Scholar 

  13. Xie L, Tian Q, Hong R, Yan S, Zhang B (2013) Hierarchical part matching for fine-grained visual categorization. In: International conference of computer vision (ICCV), pp 1641–1648

  14. Berg T, Belhumeur P (2013) Poof: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 955–962

  15. Branson S, VanHorn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. arxiv:1406.2952

  16. Zhang N, Donahue J, Girshick R, Darrell T (2014) Part based R-CNNs for fine-grained category detection. In: European conference on computer vision (ECCV), pp 834–849

  17. Lin D, Shen X, Lu C, Jia J (2015) Deep lac: deep localization, alignment and classification for fine-grained recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1666–1674

  18. Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked CNN for fine-grained visual categorization. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1173–1182

  19. Yao H, Zhang S, Zhang Y, Li J, Tian Q (2016) Coarse-to-fine description for fine-grained visual categorization. IEEE Trans Image Process (TIP) 25(10):4858–4872

    MathSciNet  Google Scholar 

  20. Xu Z, Huang S, Zhang Y, Tao D (2016) Webly-supervised fine-grained visual categorization via deep domain adaptation. In: IEEE transactions on pattern analysis and machine intelligence (TPAMI)

  21. Xu Z, Tao D, Huang S, Zhang Y (2017) Friend or foe: fine-grained categorization with weak supervision. IEEE Trans Image Process (TIP) 26(1):135–146

    MathSciNet  MATH  Google Scholar 

  22. Xie L, Tian Q, Wang M, Zhang B (2014) Spatial pooling of heterogeneous features for image classification. IEEE Trans Image Process (TIP) 23(5):1994–2008

    MathSciNet  MATH  Google Scholar 

  23. Krause J, Jin H, Yang J, Fei-Fei L (2015) Fine-grained recognition without part annotations. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5546–5555

  24. Simon M, Rodner E (2015) Neural activation constellations: unsupervised part model discovery with convolutional networks. In: International conference of computer vision (ICCV), pp 1143–1151

  25. Lin TY, Chowdhury AR, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: International conference of computer vision (ICCV), pp 1449–1457

  26. Zhang X, Xiong H, Zhou W, Tian Q (2016) Fused one-vs-all features with semantic alignments for fine-grained visual categorization. IEEE Trans Image Process (TIP) 25(2):878–892

    MathSciNet  MATH  Google Scholar 

  27. Zhang L, Yang Y, Wang M, Hong R, Nie L, Li X (2016) Detecting densely distributed graph patterns for fine grained image categorization. IEEE Trans Image Process (TIP) 25(2):553–565

    MathSciNet  MATH  Google Scholar 

  28. Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: International conference of computer (ICCV), pp 5209–5217

  29. Liu J, Kanazawa A, Jacobs D, Belhumeur P (2012) Dog breed classification using part localization. In: European conference on computer vision, Springer, Berlin, pp 172–185

  30. Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 3498–3505

  31. Khosla A, Jayadevaprakash N, Yao B, Li FF (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol 2, no 1

  32. Mulligan K, Rivas P (2019) Dog breed identification with a neural network over learned representations from the exception cnn architecture. In: 21st International conference on artificial intelligence (ICAI 2019)

  33. Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 113–123

  34. Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy. In: Advances in neural information processing systems, pp 8250–8260

  35. Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2019). Large scale learning of general visual representations for transfer. arXiv preprint arXiv:1912.11370

  36. Lee J, Won T, Hong K (2020) Compounding the performance improvements of assembled techniques in a convolutional neural network. arXiv preprint arXiv:2001.06268

  37. Meena SD, Agilandeeswari L (2019) An efficient framework for animal breeds classification using semi-supervised learning and multi-part convolutional neural network (MP-CNN). IEEE Access 7:151783–151802

    Google Scholar 

  38. Liu X, Xia T, Wang J, Yang Y, Zhou F, Lin Y (2016) Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765

  39. Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 5209–5217

  40. Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In; Proceedings of the European conference on computer vision (ECCV), pp 805–821

  41. Dubey A, Gupta O, Guo P, Raskar R, Farrell R, Naik N (2018) Pairwise confusion for fine-grained visual classification. In: Proceedings of the European conference on computer vision (ECCV). pp 70–86

  42. Sun G, Cholakkal H, Khan S, Khan FS, Shao L (2019) Fine-grained recognition: accounting for subtle differences between similar classes. arXiv preprint arXiv:1912.06842

  43. Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891

  44. Zhuang P, Wang Y, Qiao . (2020) Learning attentive pairwise interaction for fine-grained classification. arXiv preprint arXiv:2002.10191

  45. Guo J, Ma S, Guo S (2019) MAANet: multi-view aware attention networks for image super-resolution. arXiv preprint arXiv:1904.06252

  46. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3146–3154

  47. Hu T, Yang P, Zhang C, Yu G, Mu Y, Snoek CG (2019) Attention-based multi-context guiding for few-shot semantic segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8441–8448

  48. Zhang L, Nizampatnam S, Gangopadhyay A, Conde MV (2019) Multi-attention networks for temporal localization of video-level labels. arXiv preprint arXiv:1911.06866

  49. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2001) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  50. Yan X, Ai T, Yang M, Yin H (2019) A graph convolutional neural network for classification of building patterns using spatial vector data. ISPRS J Photogram Remote Sens 150:259–273

    Google Scholar 

  51. Liu JE, An FP (2020) Image classification algorithm based on deep learning-kernel function. In: Scientific programming

  52. Huang C, Li H, Xie Y, Qingbo W, Luo B (2017) PBC: Polygon-based classifier for fine-grained categorization. IEEE Trans Multimed (TMM) 19(4):673–684

    Google Scholar 

  53. Guérin J, Boots B (2018) Improving image clustering with multiple pretrained cnn feature extractors. arXiv preprint arXiv:1807.07760

  54. Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: purely attention based local feature integration for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7834–7843

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Agilandeeswari Loganathan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The two important factors in deep learning models are the dataset split and dataset balancing. The proportion of train and test data plays an important role in computational time and complexity, besides influencing the performance. Similarly, dataset balancing is an essential step in training. Hence, the first set of preliminary results with SC-MPEM is analysing the effects of dataset splitting and data balancing.

The effects of different proportions of training data on the accuracy of the proposed system are evaluated on the benchmark animal datasets and the result for the same is presented in Table 10. The proposed model is validated against a different proportion of training data like 10%, 20%, 30%, 40% and 60%. SC-MPEM has achieved good accuracy even with 10% training data. Nevertheless, the accuracy increases with an increase in the proportion of training data. However, the accuracy starts saturating at around 40% of training data and further on, it does not have much impact on the performance. But, increasing the size of training data increases the computational cost and complexity. Hence, the proposed system is trained with 40% training data. Thus, the following experiments are trialed with a 40:40:20 proportions, where 40% is for training and testing and the remaining 20% is for validation. The effect of a balanced vs. imbalanced dataset is studied and the result is presented in Fig. 22. Following the previous experiment, the dataset proportion is maintained at 40:40:20 on all the benchmark datasets.

Table 10 Effects of different proportion of training data on benchmark datasets
Fig. 22
figure 22

Performance of balanced versus imbalanced dataset

It is inferred that an imbalanced dataset affects performance. The difference between the balanced and imbalanced dataset is hard to neglect. Generally, it is good to have a balanced dataset for any classification problem, as an imbalanced dataset will have a bias favouring the majority class. Usually, the problem of an imbalanced dataset is predominant in a classification problem, specifically in a multi-class classification problem, where more than one class may have minimal data. Hence, we synthetically balanced the dataset using SMOTE. However, it is not always the case that a balanced dataset performs well. If the classes are balanced, then we will miss some valuable patterns in the dataset. Hence, it is preferred to have a large dataset though unbalanced. Balancing a large dataset by data augmentation techniques doesn’t make much difference in the overall results. Among the datasets, the Snapshot Serengeti is the largest dataset and so data augmentation had minimal effect on the performance when compared to the other three datasets. Hence, the Snapshot Serengeti dataset was not balanced, however, the remaining datasets were balanced. There exists a trade-off in every application and so is the performance metrics. For a balanced dataset, accuracy is the best metric. On the other hand, for an imbalanced dataset, precision and recall are the appropriate metrics to measure the performance.

The second set of results discusses the details of transfer learning (see Table 11). Specifically, we discuss various details like layer from which the features are extracted, NMI (Normalized Mutual Information) score and the time complexity. For estimating the NMI, we ensured that no image contains more than one label, as it makes it difficult to analyze if the clustering classified the images accurately or not. In order to understand the above three factors, we tried clustering the dataset using 5 different CNN architectures, 8 different clustering algorithms, and various choices of layers. For assessing the results of these, we utilized the NMI score and time complexity. For simplicity, we used the default value for all the hyper-parameters. Both KM and MKM were randomly initialized. Every experiment was run for a total of 10 times and an average value is taken over these 10 runs. In Table 11, the layer names are unchanged from the architecture and the NMI score is given in bold. The time complexity of each clustering for the various layers is also given below the NMI score. The best results are highlighted in bold italic values.

Table 11 Details of transfer learning—architectures, feature extraction layer, NMI score and clustering

From the table, it is inferred that features extracted from the penultimate layers of Inception v3 network and clustered using EM clustering produced the state-of-the-art result. Although AHC had the least time complexity, however, we do not consider it due to its poor NMI score. Besides, the tenfold cross-validation also stands in favor of the Inception v3 network. The result of each of the model is depicted in Fig. 23.

Fig. 23
figure 23

Cross-validation results of the networks

The final sets of results discuss the misclassification and how they are tackled by MHKC. Figure 24 represents the training accuracy for the class pre-trained using MHKC. The highest score for the chosen images is 6.60 and the least one is 1.96. The precision for the training dataset is found to be 100%, which is a good indicator for better testing accuracy. We have improved the accuracy of misclassified horse images to 100% with MHKC. With 100% PR on training data, we achieved 99.97% on testing data. This is far better than the accuracy of TensorFlow and the same is depicted in Fig. 25. The feature vector is pre-computed and so the accuracy of the classifier corresponds merely to its kernel functions.

Fig. 24
figure 24

Training accuracy and Precision-recall curve of misclassified class retrained using MHKC

Fig. 25
figure 25

Testing accuracy and Precision-recall curve of MHKC re-trained class

As additional information regarding the performance of our proposed model on the Snapshot Serengeti dataset, we present the confusion matrix of the dataset in Fig. 26. Despite being the largest camera trap dataset, it has not been predominantly used for testing. Some of the reasons are the size of the dataset and the lack of resources to train and test such a huge dataset [5]. To counter this problem, Gomez et al. [5] have trained only 26 classes out of the 40 mammalian classes. In fact, the original Snapshot Serengeti has 48 classes, of which 40 classes are mammalian species. The remaining 8 classes include humans, birds, rodents, etc. We have intentionally excluded these 8 classes as they do not belong to animal species and by adding them; we needlessly increase the computational burden. Thus, we have trained and tested only 40 mammalian species as Norouzzadeh et al. [6].

Fig. 26
figure 26

Confusion matrix of Snapshot Serengeti dataset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sundaram, D., Loganathan, A. A New Supervised Clustering Framework Using Multi Discriminative Parts and Expectation–Maximization Approach for a Fine-Grained Animal Breed Classification (SC-MPEM). Neural Process Lett 52, 727–766 (2020). https://doi.org/10.1007/s11063-020-10246-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10246-3

Keywords

Navigation