Multimedia Tools and Applications

, Volume 76, Issue 5, pp 7497–7517 | Cite as

Speeding up the multimedia feature extraction: a comparative study on the big data approach



The current explosion of multimedia data is significantly increasing the amount of potential knowledge. However, to get to the actual information requires to apply novel content-based techniques which in turn require time consuming extraction of indexable features from the raw data. In order to deal with large datasets, this task needs to be parallelized. However, there are multiple approaches to choose from, each with its own benefits and drawbacks. There are also several parameters that must be taken into consideration, for example the amount of available resources, the size of the data and their availability. In this paper, we empirically evaluate and compare approaches based on Apache Hadoop, Apache Storm, Apache Spark, and Grid computing, employed to distribute the extraction task over an outsourced and distributed infrastructure.


Big data Image feature extraction Map Reduce Apache Storm Apache Spark Grid computing 


  1. 1.
    Apache hadoop. Online. Accessed: 2015-11-06Google Scholar
  2. 2.
    Apache spark. Online. Accessed: 2015-25-11Google Scholar
  3. 3.
    Apache storm. Online. Accessed: 2015-11-06Google Scholar
  4. 4.
    Batko M, Novak D, Zezula P (2007) Messif: Metric similarity search implementation framework. In: Digital Libraries: Research and Development, pp 1–10. SpringerGoogle Scholar
  5. 5.
    Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Piccioli T, Rabitti F (2009) CoPhIR: a test collection for content-based image retrievalGoogle Scholar
  6. 6.
    Chen C, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information SciencesGoogle Scholar
  7. 7.
    Chlumsky V, Klusacek D, Ruda M (2012) The extension of torque scheduler allowing the use of planning and optimizing in grids. Comput Sci 13(2)Google Scholar
  8. 8.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun of the ACM 51(1):107–113CrossRefGoogle Scholar
  9. 9.
    Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Int Conf on Mach Learning :647–655Google Scholar
  10. 10.
    Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the international conference on Multimedia, pp 1459– 1462. ACMGoogle Scholar
  11. 11.
    Huang FC, Huang SY, Ker JW, Chen YC (2012) High-performance sift hardware accelerator for real-time image feature extraction. Circuits and Sys for Video Tech, IEEE Trans on 22(3):340– 351CrossRefGoogle Scholar
  12. 12.
    IBM research department (2013) Global technology outlook. Research note, IBM CorporationGoogle Scholar
  13. 13.
    Jogalekar P, Woodside M (2000) Evaluating the scalability of distributed systems. Parall Distri Sys, IEEE Trans on 11(6):589–603CrossRefGoogle Scholar
  14. 14.
    Kao O (2008) On parallel image retrieval with dynamically extracted features. Parall comput 34(12):700–709CrossRefGoogle Scholar
  15. 15.
    Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark: Lightning-Fast Big Data Analysis. ” O’Reilly Media, Inc.”Google Scholar
  16. 16.
    Kruliš M, Lokoč J, Skopal T (2015) Efficient extraction of clustering-based feature signatures using gpu architectures Multimedia Tools and Applications:1–33Google Scholar
  17. 17.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  18. 18.
    Marz N, Warren J (2014) Big Data: Principles and best practices of scalable realtime data systems O’Reilly MediaGoogle Scholar
  19. 19.
    Moise D, Shestakov D, Gudmundsson G, Amsaleg L (2013) Indexing and searching 100m images with map-reduce. In: Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pp 17–24. ACMGoogle Scholar
  20. 20.
    MPEG-7: (2002) Multimedia content description interfaces. Part 3: Visual. ISO/IEC 15938-3:2002Google Scholar
  21. 21.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175.CrossRefMATHGoogle Scholar
  22. 22.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10. IEEEGoogle Scholar
  23. 23.
    Stupar A., Michel S., Schenkel R. (2010) Rankreduceprocessing k-nearest neighbor queries on top of mapreduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp 13–18. CiteseerGoogle Scholar
  24. 24.
    Šustr Z, Sitera J, Mulac M, Ruda M, Antoš D, Hejtmánek L, Holub P, Salvet Z, Matyska L (2009) Metacentrum, the czech virtualized ngi.. In: EGEE Technical ForumGoogle Scholar
  25. 25.
    Sweeney C (2011) Hipi: A Hadoop Image Processing Interface for Image-Based MapReduce Tasks. B.S. Thesis, University of Virginia Department of Computer ScienceGoogle Scholar
  26. 26.
    Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M., Donham J, (2014) Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 147–156. ACMGoogle Scholar
  27. 27.
    White T (2012) Hadoop: The definitive guide. ” O’Reilly Media, Inc.”Google Scholar
  28. 28.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp 2–2. USENIX AssociationGoogle Scholar
  29. 29.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp 10–10Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Centro Singular de Investigación en Tecnoloxías da Información (CITIUS)Universidade de Santiago de Compostela, Rúa de Jenaro de la Fuente DomínguezSantiago de CompostelaSpain
  2. 2.Laboratory of Data Intensive Systems and Applications, Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations