Multimedia Tools and Applications

, Volume 75, Issue 3, pp 1701–1720 | Cite as

Fine-grained object recognition in underwater visual data

  • C. SpampinatoEmail author
  • S. Palazzo
  • P. H. Joalland
  • S. Paris
  • H. Glotin
  • K. Blanc
  • D. Lingrand
  • F. Precioso


In this paper we investigate the fine-grained object categorization problem of determining fish species in low-quality visual data (images and videos) recorded in real-life settings. We first describe a new annotated dataset of about 35,000 fish images (MA-35K dataset), derived from the Fish4Knowledge project, covering 10 fish species from the Eastern Indo-Pacific bio-geographic zone. We then resort to a label propagation method able to transfer the labels from the MA-35K to a set of 20 million fish images in order to achieve variability in fish appearance. The resulting annotated dataset, containing over one million annotations (AA-1M), was then manually checked by removing false positives as well as images with occlusions between fish or showing partially fish. Finally, we randomly picked more than 30,000 fish images distributed among ten fish species and extracted from about 400 10-minute videos, and used this data (both images and videos) for the fish task of the LifeCLEF 2014 contest. Together with the fine-grained visual dataset release, we also present two approaches for fish species classification in, respectively, still images and videos. Both approaches showed high performance (for some fish species the precision and recall were close to one) in object classification and outperformed state-of-the-art methods. In addition, despite the fact that dataset is unbalanced in the number of images per species, both methods (especially the one operating on still images) appear to be rather robust against the long-tail curse of data, showing the best performance on the less populated object classes.


Object classification Marine ecosystem analysis Environmental monitoring 



We thank the Ministére du Redressement Productif (DGCIS) for the support to the RAPID PHRASE project, and the BPI, PACA, TPM for the FUI14 SYCIE project.


  1. 1.
    Barnich O, Van Droogenbroeck M (June 2011) Vibe: A universal background subtraction algorithm for video sequences. IEEE Trans Image Process 20(6):1709–1724MathSciNetCrossRefGoogle Scholar
  2. 2.
    Blanc FPK, Lingrand D (2014) Fish species recognition from video using SVM classifier, in LifeClef’14 - Proceedings,
  3. 3.
    Boom BJ, He J, Palazzo S, Huang PX, Beyan C, Chou H-M, Lin F-P, Spampinato C, Fisher RB (2014) A research tool for long-term and continuous analysis of fish assemblage in coral-reefs using underwater camera footage. Ecological Informatics 23(0):83–97CrossRefGoogle Scholar
  4. 4.
    Boureau Y (2012) Learning hierarchical feature extractors for image recognition, Ph.D. dissertation, New York UniversityGoogle Scholar
  5. 5.
    Branson S, Wah C, Schroff F, Babenko B, Welinder P, Perona P, Belongie S (2010) Visual recognition with humans in the loop. In: 11th European Conference on Computer Vision, vol 6314. Springer, pp 438–451CrossRefGoogle Scholar
  6. 6.
    Deng J, Krause J, Fei-Fei L (2013) Fine-grained crowdsourcing for fine-grained recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 580–587Google Scholar
  7. 7.
    Duan K, Parikh D, Crandall D, Grauman K (2012) Discovering localized attributes for fine-grained recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3474–3481Google Scholar
  8. 8.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  9. 9.
    Farrell R, Oza O, Zhang N, Morariu V, Darrell T, Davis L (2011) Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp 161–168Google Scholar
  10. 10.
    Fei-Fei L, Fergus R, Perona P (2003) A bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ser. ICCV ’03, pp 1134–1141Google Scholar
  11. 11.
    Giordano D, Kavasidis I, Palazzo S, Spampinato C (2015) Nonparametric label propagation using mutual local similarity in nearest neighbors. Comp Vision Image Underst 131:116–127CrossRefGoogle Scholar
  12. 12.
    Huang P, Boom B, Fisher R (2013) Underwater live fish recognition using a balance-guaranteed optimized tree, in Computer Vision ACCV 2012, ser. Lecture Notes in Computer Science. In: Lee K, Matsushita Y, Rehg J, Hu Z (eds), vol 7724. Springer, Berlin Heidelberg, pp 422–433. [Online]. Available:, doi: 10.1007/978-3-642-37331-2_32 CrossRefGoogle Scholar
  13. 13.
    Huang P, Boom B, Fisher R (2015) Hierarchical classification with reject option for live fish recognition. Mach Vis Appl 26(1):89–102CrossRefGoogle Scholar
  14. 14.
    Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (SIGIR ’03), pp 119–126Google Scholar
  15. 15.
    Joalland P, Paris S, Glotin H (2014) Efficient instance-based fish species visual identification by global representation, in LifeClef’14 - Proceedings,
  16. 16.
    Joly A, Muller H, Goeau H, Glotin H, Spampinato C, Rauber A, Bonnet P, Vellinga W, Fisher B (2014) Multimedia life species identification challenges. In: Proceedings of CLEF 2014, vol 1Google Scholar
  17. 17.
    Khan FS, van de Weijer J, Bagdanov AD, Vanrell M (2011) Portmanteau vocabularies for multi-cue image representation. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger K (eds) Advances in Neural Information Processing Systems (NIPS 2011), pp 1323–1331Google Scholar
  18. 18.
    Khosla A, Yao B, Fei-Fei L (2014) Integrating randomization and discrimination for classifying human-object interaction activities, in Human-Centered Social Media AnalyticsCrossRefGoogle Scholar
  19. 19.
    Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez I, Soares JVB (2012) Leafsnap: A computer vision system for automatic plant species identification. In: The 12th European Conference on Computer Vision (ECCV)CrossRefGoogle Scholar
  20. 20.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, pp 2169–2178Google Scholar
  21. 21.
    Lowe D (1999) Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol 2, pp 1150–1157Google Scholar
  22. 22.
    Mairal J, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: ICML ’09Google Scholar
  23. 23.
    Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7):971–987CrossRefGoogle Scholar
  24. 24.
    Paris S, Halkias X, Glotin H (2012) Sparse coding for histograms of local binary patterns applied for image categorization: Toward a bag-of-scenes analysis. In: 21st International Conference on Pattern Recognition (ICPR), pp 2817–2820Google Scholar
  25. 25.
    Paris S, Halkias X, Glotin H (2013) Efficient bag of scenes analysis for image categorization. In: ICPRAM, pp 335–344Google Scholar
  26. 26.
    Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 3498–3505Google Scholar
  27. 27.
    Snchez J, Perronnin F, de Campos T (2012) Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn Lett 33(16):2216–2223CrossRefGoogle Scholar
  28. 28.
    Spampinato C, Beauxis-Aussalet E, Palazzo S, Beyan C, Ossenbruggen J, He J, Boom B, Huang X (2014) A rule-based event detection system for real-life underwater domain. Mach Vis Appl 25(1):99–117CrossRefGoogle Scholar
  29. 29.
    Spampinato C, Fisher R, Boom BJ (2014) CLEF working notes 2014, LifeCLEF Fish Identification Task 2014. In: Proceedings of CLEF 2014, vol 1Google Scholar
  30. 30.
    Spampinato C, Palazzo S, Giordano D, Kavasidis I, Lin F, Lin Y (2012) Covariance based fish tracking in real-life underwater environment. In: VISAPP 2012 - Proceedings of the International Conference on Computer Vision Theory and Applications, Volume 2, Rome, Italy, 24–26 February, 2012, pp 409–414Google Scholar
  31. 31.
    Spampinato C, Palazzo S, Kavasidis I (2014) A texton-based kernel density estimation approach for background modeling under extreme conditions. Comp Vision Image Underst 122(0):74–83CrossRefGoogle Scholar
  32. 32.
    Tan X, Triggs B (2010) Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans Image Process 19(6):1635–1650MathSciNetCrossRefGoogle Scholar
  33. 33.
    Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions of Pattern Analysis and Machine Intelligence 30(11):1958–1970CrossRefGoogle Scholar
  34. 34.
    Vedaldi A, Fulkerson B (2010) VLFeat - an open and portable library of computer vision algorithms. In: ACM International Conference on MultimediaGoogle Scholar
  35. 35.
    Wah C, Branson S, Perona P, Belongie S (2011) Interactive localization and recognition of fine-grained visual categories. In: 2011 IEEE International Conference on Computer Vision (ICCV)Google Scholar
  36. 36.
    Yao B, Bradski GR, Li F-F (2012) A codebook-free and annotation-free approach for fine-grained image categorization. In: CVPR, pp 3466–3473Google Scholar
  37. 37.
    Yao B, Khosla A, Fei-Fei L (2011) Combining randomization and discrimination for fine-grained image categorization. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition , pp 1577–1584Google Scholar
  38. 38.
    Yao B, Li F-F (2010) Grouplet: A structured image representation for recognizing human and object interactions. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 9–16Google Scholar
  39. 39.
    Yang J, Yu K, Gong Y, Huang TS (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. [Online]. Available: doi: 10.1109/CVPRW.2009.5206757, pp 1794–1801
  40. 40.
    Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol 2, pp 28–31Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • C. Spampinato
    • 1
    Email author
  • S. Palazzo
    • 1
  • P. H. Joalland
    • 2
    • 3
  • S. Paris
    • 2
    • 3
    • 4
  • H. Glotin
    • 2
    • 3
  • K. Blanc
    • 5
  • D. Lingrand
    • 5
  • F. Precioso
    • 5
  1. 1.Department of Electrical, Electronics and Computer EngineeringUniversity of CataniaCataniaItaly
  2. 2.Aix-Marseille UniversitéMarseilleFrance
  3. 3.Université de ToulonLa GardeFrance
  4. 4.Institut Universitaire de France (IUF)ParisFrance
  5. 5.I3S, UMR UNS-CNRS 7271University of Nice Sophia AntipolisNiceFrance

Personalised recommendations