Skip to main content
Log in

An efficient parallel indexing structure for multi-dimensional big data using spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the increasing daily production of data in recent years, indexing, storing and retrieving huge amounts of data have become a common problem, especially for multi-dimensional big data. Although R-tree has proved to be efficient for indexing multi-dimensional big data, the R-tree suffers from the curse of dimensionality problem. Many researchers continue to use the R-tree in their studies as it is the most famous tree-like structure for indexing multi-dimensional data. However, with increasing numbers of dimensions in multi-dimensional data the performance of R-Tree will decrease. This paper proposes a new indexing structure called Parallel Indexing System Structure based on Spark (ParISSS), which is an efficient system for indexing multi-dimensional big data, to overcome these problems. ParISSS introduces six types of computing nodes, the reception-node is used to insert and index data, the normal-node is used to store indexed data, the resolution-node is used to distribute a reception-index to a normal-node, the representative-node is used to receive queries from the user, and the reply-node and check-node are used to send the results to the user. We also introduced BR*-tree structure to improve the storing and searching processes. We present an extensive experimental evaluation of our system, comparing several indexing systems. The experimental results show that ParISSS outperforms other indexing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Liu Y, Hu S, Rabl T, Liu W, Jacobsen H-A, Wu K, Chen J, Li J. DGF index for smart grid: Enhancing hive with a cost- effective multidimensional range index. https://doi.org/10.14778/2733004.2733021

  2. Funaki K, Hochin T, Nomiya H, Nakanishi H, Kojima M (2013) Parallel indexing of large multi-dimensional data in advanced applied informatics (IIAIAAI). In: 2013 IIAI international conference on. IEEE. pp 324–329. https://doi.org/10.1109/IIAI-AAI.2013.62

  3. Kim J, Abbasi H, Chacon L, Docan C, Klasky S, Liu Q, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: 2011 IEEE Symposium on Large Data Analysis And Visualization (LDAV), pp 65–72. https://doi.org/10.1109/LDAV.2011.6092319

  4. Nakanishi K, Hochin T, Nomiya H (2016) Performance improvement of multi-dimensional indexing system for big data analysis. In: 2016 IEEE/ACIS 15th International Conference on computer and information science (ICIS), pp 1–6. https://doi.org/10.1109/ICIS.2016.7550840

  5. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. ACM. 14(2); 47–57. https://doi.org/10.1007/978-3-319-23519-6_1151-2

  6. Kim M, Liu L, Choi W (2018) A GPU-aware Parallel Index for Processing High-dimensional Big Data. IEEE Trans Comput. https://doi.org/10.1109/TC.2018.2823760

    Article  MathSciNet  MATH  Google Scholar 

  7. Niu Z, He B, Zhou C, Lau C. T (2017) Multi-objective Optimizations in Geo-Distributed Data Analytics Systems. In 2017 IEEE 23rd International Conference On Parallel And Distributed Systems (ICPADS). 519–528. https://doi.org/10.1109/ICPADS.2017.00074

  8. Tong G, Jin H, Xie X, Cao W, Yuan P (2011) Measuring and analyzing CPU overhead of virtualization system. 250. In 2011 IEEE Asia-Pacific Services Computing Conference. 243. https://doi.org/10.1109/APSCC.2011.40

  9. Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. Journal of Grid Computing 16(4):607–626

    Article  Google Scholar 

  10. Hadoop, http://hadoop.apache.org/

  11. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. HotCloud. 10(10–10); 95.

  12. Apache Spark: http://spark.apache.org/

  13. Finkel RA, Bentley JL (1974) Quad trees: A data structure for retrieval on composite keys. Acta Informatica. 4(1). https://doi.org/https://doi.org/10.1007/bf00288933

  14. Chen L, Gao Y, Li X, Jensen CS, Chen G (2017) Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans Knowl Data Eng 29(3):556–571. https://doi.org/10.1109/TKDE.2015.2506556

    Article  Google Scholar 

  15. Yang L, Di M, Huang X, Duan F (2015) A new index structure combines a cluster algorithm with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 520–526. https://doi.org/10.1109/CISP.2015.7407935

  16. Yang L, Di M, Huang X, Duan F (2015) BlockB-Tree: a new index structure combined compact B+-Tree with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 533–538. https://doi.org/10.1109/CISP.2015.7407937

  17. Schuh MA, Angryk RA (2016) On the theory and practice of high-dimensional data indexing with iDistance. In Big Data (Big Data), 2016 IEEE International Conference on. 3593–3600. https://doi.org/10.1109/BigData.2016.7841023

  18. Antaris S, Rafailidis D (2017) In-memory Stream Indexing of Massive and Fast Incoming Multimedia Content. IEEE Transac-tions on Big Data. https://doi.org/10.1109/TBDATA.2017.2697441

    Article  Google Scholar 

  19. Sha EHM, Jiang W, Dong H, Ma Z, Zhang R, Chen X, Zhuge Q (2018) Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory. IEEE Trans Comput 67(3):432–448. https://doi.org/10.1109/TC.2017.2754381

    Article  MathSciNet  MATH  Google Scholar 

  20. Li X, Ma H, Wang X (2018) Feature proposal model on multidimensional data clustering and its application. Parallel Computing Pattern Recognition Letters 112:41–48. https://doi.org/10.1016/j.patrec.2018.05.025

    Article  Google Scholar 

  21. Aparna, K., & Nair, M. K (2017) A pragmatic approach for multidimensional data clustering. In Computing, Communication and Networking Technologies (ICCCNT), 2017 8th International Conference on. 1–6. https://doi.org/10.1109/ICCCNT.2017.8203928

  22. Kim, H. I., Kim, H. J., & Chang, J. W (2016) A kNN query processing algorithm using a tree index structure on the encrypted database. In Big Data and Smart Computing (BigComp), 2016 International Conference on. 93–100.

  23. Talha, A. M., Kamel, I., & Al Aghbari, Z (2017) DISC: Query processing on the cloud service provider for dynamic spatial databases. In Multimedia Big Data (BigMM), 2017 IEEE Third International Conference on. 318–321. https://doi.org/10.1109/BigMM.2017.24

  24. Kamel, I., & Faloutsos, C (1993) Hilbert R-tree: An improved R- tree using fractals.

  25. Ezatpoor P, Zhan J, Wu JMT, Chiu C (2018) Finding Top-$ k $ Dominance on Incomplete Big Data Using MapReduce Framework. IEEE Access 6:7872–7887. https://doi.org/10.1109/access.2018.2797048

    Article  Google Scholar 

  26. Miao X, Gao Y, Zheng B, Chen G, Cui H (2016) Top-k dominating queries on incomplete data. IEEE Trans Knowl Data Eng 28(1):252–266. https://doi.org/10.1109/TKDE.2015.2460742

    Article  Google Scholar 

  27. Memarzia, P., Patrou, M., Alam, M. M., Ray, S., Bhavsar, V. C., & Kent, K. B (2019) Toward efficient processing of spatio-temporal workloads in a distributed in-memory system. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 118–127. https://doi.org/10.1109/MDM.2019.00-66

  28. V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky and O. Tardieu (2010) “The asynchronous partitioned global address space model,” in The First Workshop on Advances in Message Passing. 1–8.

  29. Drake DE et al (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett. https://doi.org/10.1016/s0020-0190(02)00393-9

    Article  MathSciNet  MATH  Google Scholar 

  30. Fu, Z., Yu, J., & Sarwat, M (2019) Building a large-scale microscopic road network traffic simulator in apache spark. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 320–328. https://doi.org/10.1109/MDM.2019.00-42

  31. Bao, L., & Le, Y (2018) A spatial big data framework for maritime traffic data. In 2018 3rd International Conference on Computational Intelligence and Applications (ICCIA). 244–248. https://doi.org/10.1109/ICCIA.2018.00054

  32. Hussain, M. M., & Fujimoto, N (2018) Parallel multi-objective particle swarm optimization for large swarm and high dimensional problems. In 2018 IEEE Congress on Evolutionary Computation (CEC). 1–10. https://doi.org/https://doi.org/10.1016/j.parco.2019.102589

  33. Sprenger, S., Schäfer, P., & Leser, U (2019) BB-Tree: A practical and efficient main-memory index structure for multidimensional workloads. In EDBT. 169–180. https://doi.org/https://doi.org/10.1109/icde.2019.00143

  34. Jon Louis Bentley (1975) Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM (1975) https://doi.org/https://doi.org/10.1145/361002.361007

  35. Elghamrawy SM, Hassanien AE (2017) A partitioning framework for Cassandra NoSQL database using Rendezvous hashing. The Journal of Supercomputing 73(10):4444–4465

    Article  Google Scholar 

  36. Z¨aschke T, Zimmerli C, Norrie MC (2014) The PH-tree: A space-efficient storage structure and multidimensional index. In: The international conference on management of data (SIGMOD’14). 397–408. https://doi.org/https://doi.org/10.1145/2588555.2588564

  37. Beckmann, N., Kriegel, H. P., Schneider, R., & Seeger, B (1990) The R*-tree: an efficient and robust access method for points and rectangles. In Acm Sigmod Record. Acm. 19(2); 322–331. https://doi.org/https://doi.org/10.1145/93597.98741

  38. Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790. https://doi.org/10.1021/ci400084k

    Article  Google Scholar 

  39. Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331. https://doi.org/10.1080/01621459.1983.10477973

    Article  MathSciNet  MATH  Google Scholar 

  40. Alippi, C., & Roveri, M (2010) Virtual k-fold cross validation: An effective method for accuracy assessment. In The 2010 International Joint Conference on Neural Networks (IJCNN). 1–6. https://doi.org/10.1109/IJCNN.2010.5596899

  41. Ahmed Eldawy and Mohamed F. Mokbel (2015) "SpatialHadoop: A MapReduce Framework for Spatial Data", In Proceedings of the IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea.

  42. (https://www2.informatik.hu-berlin.de/~sprengsz/bb-tree/).

  43. (https://github.com/tzaeschke/phtree-1)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sally M. Elghamrawy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elmeiligy, M.A., Desouky, A.I.E. & Elghamrawy, S.M. An efficient parallel indexing structure for multi-dimensional big data using spark. J Supercomput 77, 11187–11214 (2021). https://doi.org/10.1007/s11227-021-03718-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03718-3

Keywords

Navigation