Skip to main content
Log in

A survey on indexing techniques for big data: taxonomy and performance evaluation

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The explosive growth in volume, velocity, and diversity of data produced by mobile devices and cloud applications has contributed to the abundance of data or ‘big data.’ Available solutions for efficient data storage and management cannot fulfill the needs of such heterogeneous data where the amount of data is continuously increasing. For efficient retrieval and management, existing indexing solutions become inefficient with the rapidly growing index size and seek time and an optimized index scheme is required for big data. Regarding real-world applications, the indexing issue with big data in cloud computing is widespread in healthcare, enterprises, scientific experiments, and social networks. To date, diverse soft computing, machine learning, and other techniques in terms of artificial intelligence have been utilized to satisfy the indexing requirements, yet in the literature, there is no reported state-of-the-art survey investigating the performance and consequences of techniques for solving indexing in big data issues as they enter cloud computing. The objective of this paper is to investigate and examine the existing indexing techniques for big data. Taxonomy of indexing techniques is developed to provide insight to enable researchers understand and select a technique as a basis to design an indexing mechanism with reduced time and space consumption for BD-MCC. In this study, 48 indexing techniques have been studied and compared based on 60 articles related to the topic. The indexing techniques’ performance is analyzed based on their characteristics and big data indexing requirements. The main contribution of this study is taxonomy of categorized indexing techniques based on their method. The categories are non-artificial intelligence, artificial intelligence, and collaborative artificial intelligence indexing methods. In addition, the significance of different procedures and performance is analyzed, besides limitations of each technique. In conclusion, several key future research topics with potential to accelerate the progress and deployment of artificial intelligence-based cooperative indexing in BD-MCC are elaborated on.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Gärtner M, Rauber A, Berger H (2013) Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowl Inf Syst 1–32. doi:10.1007/s10115-013-0678-y

  2. Demirkan H, Delen D (2013) Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decis Support Syst 55(1):412–421. doi:10.1016/j.dss.2012.05.048

    Article  Google Scholar 

  3. Amer-Yahia S, Doan A, Kleinberg J, Koudas N, Franklin M (2010) Crowds, clouds, and algorithms: exploring the human side of “big data” applications. Paper presented at the proceedings of the 2010 ACM SIGMOD international conference on management of data, Indianapolis, Indiana, USA

  4. Dixon Z, Moxley J (2013) Everything is illuminated: what big data can tell us about teacher commentary. Assess Writ 18(4):241–256. doi:10.1016/j.asw.2013.08.002

    Article  Google Scholar 

  5. Liu W, Peng S, Du W, Wang W, Zeng GS (2014) Security-aware intermediate data placement strategy in scientific cloud workflows. Knowl Inf Syst 41:1–25

  6. Dopazo J (2013) Genomics and transcriptomics in drug discovery. Drug Discov Today 19(2):126–132. doi:10.1016/j.drudis.2013.06.003

    Article  Google Scholar 

  7. Wang J, Wu S, Gao H, Li J, Ooi BC (2010) Indexing multi-dimensional data in a cloud system. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 591–602

  8. Fiore S, D’Anca A, Palazzo C, Foster I, Williams DN, Aloisio G (2013) Ophidia: toward big data analytics for escience. Proc Comput Sci 18:2376–2385. doi:10.1016/j.procs.2013.05.409

    Article  Google Scholar 

  9. Chen J, Chen Y, Du X, Li C, Lu J, Zhao S, Zhou X (2013) Big data challenge: a data management perspective. Front Comput Sci 7(2):157–164. doi:10.1007/s11704-013-3903-7

    Article  MathSciNet  Google Scholar 

  10. Wang M, Holub V, Murphy J, O’Sullivan P (2013) High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future Gener Comput Syst 29(8):1943–1962

    Article  Google Scholar 

  11. Rodríguez-García MÁ, Valencia-García R, García-Sánchez F, Samper-Zapater JJ (2013) Creating a semantically-enhanced cloud services environment through ontology evolution. Future Gener Comput Syst 32:295–306. doi:10.1016/j.future.2013.08.003

    Article  Google Scholar 

  12. Cambazoglu BB, Kayaaslan E, Jonassen S, Aykanat C (2013) A term-based inverted index partitioning model for efficient distributed query processing. ACM Trans Web 7(3):1–23. doi:10.1145/2516633.2516637

    Article  Google Scholar 

  13. Bast H, Celikik M (2013) Efficient fuzzy search in large text collections. ACM Trans Inf Syst 31(2):1–59. doi:10.1145/2457465.2457470

    Article  Google Scholar 

  14. Paul A, Chen B-W, Bharanitharan K, Wang J-F (2013) Video search and indexing with reinforcement agent for interactive multimedia services. ACM Trans Embed Comput Syst 12(2):1–16. doi:10.1145/2423636.2423643

    Article  Google Scholar 

  15. Kadiyala S, Shiri N (2008) A compact multi-resolution index for variable length queries in time series databases. Knowl Inf Syst 15(2):131–147

    Article  Google Scholar 

  16. Wu K, Shoshani A, Stockinger K (2010) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52. doi:10.1145/1670243.1670245

    Article  Google Scholar 

  17. Cheng J, Ke Y, Fu AW-C, Yu JX (2011) Fast graph query processing with a low-cost index. VLDB J 20(4):521–539

    Article  Google Scholar 

  18. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283

    Article  Google Scholar 

  19. Shamshirband S, Anuar NB, Kiah MLM, Patel A (2013) An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Eng Appl Artif Intell 26(9):2105–2127. doi:10.1016/j.engappai.2013.04.010

    Article  Google Scholar 

  20. Fan C-Y, Chang P-C, Lin J-J, Hsieh JC (2011) A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Appl Soft Comput 11(1):632–644. doi:10.1016/j.asoc.2009.12.023

    Article  Google Scholar 

  21. Chang RM, Kauffman RJ, Kwon Y (2014) Understanding the paradigm shift to computational social science in the presence of big data. Decis Support Syst 63:67–80. doi:10.1016/j.dss.2013.08.008

    Article  Google Scholar 

  22. Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inform Syst 47:98–115. doi:10.1016/j.is.2014.07.006

    Article  Google Scholar 

  23. Katal A, Wazid M, Goudar RH (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3), 2013, pp 404–409. doi:10.1109/IC3.2013.6612229

  24. Kaisler S, Armour F, Espinosa JA, Money W (2013) Big data: issues and challenges moving forward. In: 2013 46th Hawaii international conference on system sciences (HICSS), 2013, pp 995–1004. doi:10.1109/HICSS.2013.645

  25. Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J (2014) A spatiotemporal compression based approach for efficient big data processing on Cloud. J Comput Syst Sci 80(8):1563–1583. doi:10.1016/j.jcss.2014.04.022

    Article  MathSciNet  MATH  Google Scholar 

  26. Philip Chen C, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  27. Wang X, Luo X, Liu H (2014) Measuring the veracity of web event via uncertainty. J Syst Softw 1–11. doi:10.1016/j.jss.2014.07.023

  28. LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N (2013) Big data, analytics and the path from insights to value. MIT Sloan Manag Rev 21:21–31

  29. Barbierato E, Gribaudo M, Iacono M (2014) Performance evaluation of NoSQL big-data applications using multi-formalism models. Future Gener Comput Syst 37:345–353. doi:10.1016/j.future.2013.12.036

    Article  Google Scholar 

  30. Zhu X, Huang Z, Cheng H, Cui J, Shen HT (2013) Sparse hashing for fast multimedia search. ACM Trans Inf Syst 31(2):1–24. doi:10.1145/2457465.2457469

    Article  MATH  Google Scholar 

  31. Li G, Feng J, Zhou X, Wang J (2011) Providing built-in keyword search capabilities in RDBMS. VLDB J 20(1):1–19

    Article  Google Scholar 

  32. Graefe G (2010) A survey of B-tree locking techniques. ACM Trans Database Syst 35(3):16

    Article  Google Scholar 

  33. Li F, Yi K, Le W (2010) Top-k queries on temporal data. VLDB J 19(5):715–733

    Article  Google Scholar 

  34. Sandu Popa I, Zeitouni K, Oria V, Barth D, Vial S (2011) Indexing in-network trajectory flows. VLDB J 20(5):643–669

    Article  Google Scholar 

  35. Sellis TK, Roussopoulos N, Faloutsos C (1987) The R\(+\)-tree: a dynamic index for multi-dimensional objects. Paper presented at the proceedings of the 13th international conference on very large data bases

  36. Wei L-Y, Hsu Y-T, Peng W-C, Lee W-C (2013) Indexing spatial data in cloud data managements. Pervasive Mobile Comput 1–14. doi:10.1016/j.pmcj.2013.07.001

  37. MacNicol R, French B (2004) Sybase IQ multiplex-designed for analytics. Paper presented at the proceedings of the thirteenth international conference on very large data bases, vol 30, Toronto, Canada

  38. Shang L, Yang L, Wang F, Chan K-P, Hua X-S (2010) Real-time large scale near-duplicate web video retrieval. In: Proceedings of the international conference on multimedia, 2010. ACM, pp 531–540

  39. Chakrabarti S, Pathak A, Gupta M (2011) Index design and query processing for graph conductance search. VLDB J 20(3):445–470. doi:10.1007/s00778-010-0204-8

    Article  Google Scholar 

  40. Wang Y (2008) On contemporary denotational mathematics for computational intelligence. In: Gavrilova ML, Kenneth Tan CJ, Wang Y, Yao Y, Wang G (eds) Transactions on computational science II. Springer, Berlin, pp 6–29

    Chapter  Google Scholar 

  41. Chen-Yu C, Ta-Cheng W, Jhing-Fa W, Li Pang S (2009) SVM-based state transition framework for dynamical human behavior identification. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 1933–1936. doi:10.1109/ICASSP.2009.4959988

  42. Ohbuchi R, Kobayashi J (2006) Unsupervised learning from a corpus for shape-based 3D model retrieval. Paper presented at the proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa Barbara, CA, USA

  43. Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res 4:119–155. doi:10.1162/153244304322972667

    MathSciNet  Google Scholar 

  44. He J, Li M, Zhang H-J, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. Paper presented at the proceedings of the 12th annual ACM international conference on multimedia, New York, NY, USA

  45. Bordogna G, Pagani M, Pasi G (2006) A dynamic hierarchical fuzzy clustering algorithm for information filtering. In: Herrera-Viedma E, Pasi G, Crestani F (eds) Soft computing in web information retrieval. Springer, Berlin, pp 3–23

    Chapter  Google Scholar 

  46. Dittrich J, Blunschi L, Vaz Salles M (2011) MOVIES: indexing moving objects by shooting index images. Geoinformatica 15(4):727–767. doi:10.1007/s10707-011-0122-y

    Article  Google Scholar 

  47. Dillenbourg P, Järvelä S, Fischer F (2009) The evolution of research on computer-supported collaborative learning. In: Balacheff N, Ludvigsen S, de Jong T, Lazonder A, Barnes S (eds) Technology-enhanced learning. Springer, Berlin, pp 3–19

    Chapter  Google Scholar 

  48. Wai-Tat F (2012) Collaborative indexing and knowledge exploration: a social learning model. IEEE Intell Syst 27:39–46

    Article  Google Scholar 

  49. Wu S, Wang Z, Xia S (2009) Indexing and retrieval of human motion data by a hierarchical tree. Paper presented at the proceedings of the 16th ACM symposium on virtual reality software and technology, Kyoto, Japan

  50. Dieng-Kuntz R, Minier D, Růžička M, Corby F, Corby O, Alamarguy L (2006) Building and using a medical ontology for knowledge management and cooperative work in a health care network. Comput Biol Med 36(7–8):871–892. doi:10.1016/j.compbiomed.2005.04.015

    Article  Google Scholar 

  51. Huang Z, Lu X, Duan H, Zhao C (2012) Collaboration-based medical knowledge recommendation. Artif Intell Med 55(1):13–24

    Article  Google Scholar 

  52. Weng M-F, Chuang Y-Y (2012) Collaborative video reindexing via matrix factorization. ACM Trans Multimed Comput Commun Appl 8(2):23

    Article  Google Scholar 

  53. Effelsberg W (2013) A personal look back at twenty years of research in multimedia content analysis. ACM Trans Multimed Comput Commun Appl 9(1s):43

    Article  Google Scholar 

  54. The ORL Database of Faces. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. Accessed 31 Oct 2014

  55. Data set of NCI. http://discover.nci.nih.gov/datasets.jsp. Accessed 31 Oct 2014

  56. Keogh E, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series dataset. http://www.cs.ucr.edu/~eamonn/time_series_data/

  57. Ongenae F, Claeys M, Dupont T, Kerckhove W, Verhoeve P, Dhaene T, De Turck F (2013) A probabilistic ontology-based platform for self-learning context-aware healthcare applications. Expert Syst Appl 40(18):7629–7646. doi:10.1016/j.eswa.2013.07.038

    Article  Google Scholar 

  58. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. Paper presented at the proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa Barbara, CA, USA

  59. Zhuang Y, Jiang N, Wu Z, Li Q, Chiu DK, Hu H (2013) Efficient and robust large medical image retrieval in mobile cloud computing environment. Inf Sci 263:60–86. doi:10.1016/j.ins.2013.10.013

    Article  Google Scholar 

  60. Wu D, Cong G, Jensen CS (2012) A framework for efficient spatial web object retrieval. VLDB J 21(6):797–822

    Article  Google Scholar 

  61. Maier M, Rattigan M, Jensen D (2011) Indexing network structure with shortest-path trees. ACM Trans Knowl Discov Data 5(3):15

    Article  Google Scholar 

  62. Yeh S-C, Su M-Y, Chen H-H, Lin C-Y (2013) An efficient and secure approach for a cloud collaborative editing. J Netw Comput Appl 36(6):1632–1641. doi:10.1016/j.jnca.2013.05.012

    Article  Google Scholar 

  63. Li F, Hadjieleftheriou M, Kollios G, Reyzin L (2010) Authenticated index structures for aggregation queries. ACM Trans Inf Syst Secur 13(4):1–35. doi:10.1145/1880022.1880026

    Article  Google Scholar 

  64. Qian X, Tagare HD, Fulbright RK, Long R, Antani S (2010) Optimal embedding for shape indexing in medical image databases. Med Image Anal 14(3):243–254. doi:10.1016/j.media.2010.01.001

    Article  Google Scholar 

  65. Hsu W, Lee ML, Ooi BC, Mohanty PK, Teo KL, Xia C (2002) Advanced database technologies in a diabetic healthcare system. Paper presented at the proceedings of the 28th international conference on very large data bases, Hong Kong, China

  66. Yuan D, Mitra P (2013) Lindex: a lattice-based index for graph databases. VLDB J 22(2):229–252. doi:10.1007/s00778-012-0284-8

    Article  Google Scholar 

  67. Sinha RR, Winslett M (2007) Multi-resolution bitmap indexes for scientific data. ACM Trans Database Syst 32(3):16. doi:10.1145/1272743.1272746

    Article  Google Scholar 

  68. Gündem Tİ, Armağan Ö (2006) Efficient storage of healthcare data in XML-based smart cards. Comput Methods Programs Biomed 81(1):26–40. doi:10.1016/j.cmpb.2005.10.007

    Article  Google Scholar 

  69. Wang J, Kumar S, Chang S (2012) Semi-supervised hashing for large scale search. IEEE Trans Pattern Anal Mach Intell 34(12). doi:10.1109/TPAMI.2012.48

  70. Ali ST, Sivaraman V, Ostry D (2013) Authentication of lossy data in body-sensor networks for cloud-based healthcare monitoring. Future Gener Comput Syst 35:80–90. doi:10.1016/j.future.2013.09.007

    Article  Google Scholar 

  71. Thilakanathan D, Chen S, Nepal S, Calvo R, Alem L (2013) A platform for secure monitoring and sharing of generic health data in the Cloud. Future Gener Comput Syst 35:102–113. doi:10.1016/j.future.2013.09.011

    Article  Google Scholar 

  72. Jayaraman U, Prakash S, Gupta P (2013) Use of geometric features of principal components for indexing a biometric database. Math Comput Model 58(1–2):147–164. doi:10.1016/j.mcm.2012.06.005

    Article  Google Scholar 

  73. Kaushik VD, Umarani J, Gupta AK, Gupta AK, Gupta P (2013) An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing 116:208–221. doi:10.1016/j.neucom.2011.12.056

    Article  Google Scholar 

  74. Mehrotra H, Majhi B, Gupta P (2010) Robust iris indexing scheme using geometric hashing of SIFT keypoints. J Netw Comput Appl 33(3):300–313. doi:10.1016/j.jnca.2009.12.005

    Article  Google Scholar 

  75. Ferragina P, Venturini R (2010) The compressed permuterm index. ACM Trans Algorithms 7(1):1–21. doi:10.1145/1868237.1868248

    Article  MathSciNet  Google Scholar 

  76. Wang C-H, Jiau HC, Chung P-C, Ssu K-F, Yang T-L, Tsai F-J (2010) A novel indexing architecture for the provision of smart playback functions in collaborative telemedicine applications. Comput Biol Med 40(2):138–148

    Article  MATH  Google Scholar 

  77. Richter S, Quiané-Ruiz J-A, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:12123480

  78. Lazaridis M, Axenopoulos A, Rafailidis D, Daras P (2013) Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Sig Process Image Commun 28(4):351–367. doi:10.1016/j.image.2012.04.001

    Article  Google Scholar 

  79. Done B, Khatri P, Done A, Draghici S (2010) Predicting novel human gene ontology annotations using semantic analysis. IEEE/ACM Trans Comput Biol Bioinform 7(1):91–99

    Article  Google Scholar 

  80. Yıldırım H, Chaoji V, Zaki M (2012) GRAIL: a scalable index for reachability queries in very large graphs. VLDB J 21(4):509–534. doi:10.1007/s00778-011-0256-4

    Article  Google Scholar 

  81. Zou Z, Wang Y, Cao K, Qu T, Wang Z (2013) Semantic overlay network for large-scale spatial information indexing. Comput Geosci 57:208–217. doi:10.1016/j.cageo.2013.04.019

    Article  Google Scholar 

  82. Chu WW, Liu Z, Mao W, Zou Q (2005) A knowledge-based approach for retrieving scenario-specific medical text documents. Control Eng Pract 13(9):1105–1121. doi:10.1016/j.conengprac.2004.12.011

    Article  Google Scholar 

  83. van der Spek P, Klusener S (2011) Applying a dynamic threshold to improve cluster detection of LSI. Sci Comput Program 76(12):1261–1274. doi:10.1016/j.scico.2010.12.004

    Article  Google Scholar 

  84. Cuggia M, Mougin F, Beux PL (2005) Indexing method of digital audiovisual medical resources with semantic Web integration. Int J Med Inform 74(2–4):169–177. doi:10.1016/j.ijmedinf.2004.04.027

    Article  Google Scholar 

  85. Komkhao M, Lu J, Li Z, Halang WA (2013) Incremental collaborative filtering based on Mahalanobis distance and fuzzy membership for recommender systems. Int J Gen Syst 42(1):41–66

    Article  MATH  Google Scholar 

  86. Leung CHC, Chan WS (2010) Semantic music information retrieval using collaborative indexing and filtering. In: Gelenbe E, Lent R, Sakellari G, Sacan A, Toroslu H, Yazici A (eds) Computer and information sciences, vol 62. Lecture notes in electrical engineering. Springer, Netherlands, pp 345–350. doi:10.1007/978-90-481-9794-1_65

  87. Elleuch N, Zarka M, Ammar AB, Alimi AM (2011) A fuzzy ontology: based framework for reasoning in visual video content analysis and indexing. Paper presented at the proceedings of the eleventh international workshop on multimedia data mining, San Diego, CA, USA

  88. Gacto MJ, Alcala R, Herrera F (2010) Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems. IEEE Trans Fuzzy Syst 18(3):515–531. doi:10.1109/TFUZZ.2010.2041008

    Article  Google Scholar 

  89. Pandey S, Voorsluys W, Niu S, Khandoker A, Buyya R (2012) An autonomic cloud environment for hosting ECG data analysis services. Future Gener Comput Syst 28(1):147–154

    Article  Google Scholar 

  90. van Zuylen H (2012) Artificial intelligence applications to critical transportation issues. Transportation Research E-Circular, Transportation Research Board, pp 3–5

  91. Doelitzscher F, Reich C, Knahl M, Passfall A, Clarke N (2012) An agent based business aware incident detection system for cloud environments. J Cloud Comput 1(1):1–19

    Article  Google Scholar 

  92. Russo LM, Navarro G, Oliveira AL (2008) Fully-compressed suffix trees. In: LATIN 2008: Theoretical informatics. Springer, Berlin, pp 362–373

Download references

Acknowledgments

The authors would like to thank the University of Malaya for grant “Big Data and Mobile Cloud For Collaborative Experiments”, Project Number: P012C-13AFR and Malaysian Ministry of Higher Education under the University of Malaya High Impact Research Grant “Mobile Cloud Computing: Device and Connectivity”, Project Number: M.C/625/1/HIR/MOE/FCSIT/03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shahaboddin Shamshirband.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gani, A., Siddiqa, A., Shamshirband, S. et al. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46, 241–284 (2016). https://doi.org/10.1007/s10115-015-0830-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0830-y

Keywords

Navigation