An efficient parallel indexing structure for multi-dimensional big data using spark

Elmeiligy, Manar A.; Desouky, Ali I. El; Elghamrawy, Sally M.

doi:10.1007/s11227-021-03718-3

An efficient parallel indexing structure for multi-dimensional big data using spark

Published: 22 March 2021

Volume 77, pages 11187–11214, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Manar A. Elmeiligy¹,
Ali I. El Desouky¹ &
Sally M. Elghamrawy ORCID: orcid.org/0000-0002-5430-390X²

606 Accesses
5 Citations
Explore all metrics

Abstract

With the increasing daily production of data in recent years, indexing, storing and retrieving huge amounts of data have become a common problem, especially for multi-dimensional big data. Although R-tree has proved to be efficient for indexing multi-dimensional big data, the R-tree suffers from the curse of dimensionality problem. Many researchers continue to use the R-tree in their studies as it is the most famous tree-like structure for indexing multi-dimensional data. However, with increasing numbers of dimensions in multi-dimensional data the performance of R-Tree will decrease. This paper proposes a new indexing structure called Parallel Indexing System Structure based on Spark (ParISSS), which is an efficient system for indexing multi-dimensional big data, to overcome these problems. ParISSS introduces six types of computing nodes, the reception-node is used to insert and index data, the normal-node is used to store indexed data, the resolution-node is used to distribute a reception-index to a normal-node, the representative-node is used to receive queries from the user, and the reply-node and check-node are used to send the results to the user. We also introduced BR*-tree structure to improve the storing and searching processes. We present an extensive experimental evaluation of our system, comparing several indexing systems. The experimental results show that ParISSS outperforms other indexing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Efficient Distributed Multi-dimensional Index for Big Data Management

Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

ABR-Tree: An Efficient Distributed Multidimensional Indexing Approach for Massive Data

References

Liu Y, Hu S, Rabl T, Liu W, Jacobsen H-A, Wu K, Chen J, Li J. DGF index for smart grid: Enhancing hive with a cost- effective multidimensional range index. https://doi.org/10.14778/2733004.2733021
Funaki K, Hochin T, Nomiya H, Nakanishi H, Kojima M (2013) Parallel indexing of large multi-dimensional data in advanced applied informatics (IIAIAAI). In: 2013 IIAI international conference on. IEEE. pp 324–329. https://doi.org/10.1109/IIAI-AAI.2013.62
Kim J, Abbasi H, Chacon L, Docan C, Klasky S, Liu Q, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: 2011 IEEE Symposium on Large Data Analysis And Visualization (LDAV), pp 65–72. https://doi.org/10.1109/LDAV.2011.6092319
Nakanishi K, Hochin T, Nomiya H (2016) Performance improvement of multi-dimensional indexing system for big data analysis. In: 2016 IEEE/ACIS 15th International Conference on computer and information science (ICIS), pp 1–6. https://doi.org/10.1109/ICIS.2016.7550840
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. ACM. 14(2); 47–57. https://doi.org/10.1007/978-3-319-23519-6_1151-2
Kim M, Liu L, Choi W (2018) A GPU-aware Parallel Index for Processing High-dimensional Big Data. IEEE Trans Comput. https://doi.org/10.1109/TC.2018.2823760
Article MathSciNet MATH Google Scholar
Niu Z, He B, Zhou C, Lau C. T (2017) Multi-objective Optimizations in Geo-Distributed Data Analytics Systems. In 2017 IEEE 23rd International Conference On Parallel And Distributed Systems (ICPADS). 519–528. https://doi.org/10.1109/ICPADS.2017.00074
Tong G, Jin H, Xie X, Cao W, Yuan P (2011) Measuring and analyzing CPU overhead of virtualization system. 250. In 2011 IEEE Asia-Pacific Services Computing Conference. 243. https://doi.org/10.1109/APSCC.2011.40
Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. Journal of Grid Computing 16(4):607–626
Article Google Scholar
Hadoop, http://hadoop.apache.org/
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. HotCloud. 10(10–10); 95.
Apache Spark: http://spark.apache.org/
Finkel RA, Bentley JL (1974) Quad trees: A data structure for retrieval on composite keys. Acta Informatica. 4(1). https://doi.org/https://doi.org/10.1007/bf00288933
Chen L, Gao Y, Li X, Jensen CS, Chen G (2017) Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans Knowl Data Eng 29(3):556–571. https://doi.org/10.1109/TKDE.2015.2506556
Article Google Scholar
Yang L, Di M, Huang X, Duan F (2015) A new index structure combines a cluster algorithm with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 520–526. https://doi.org/10.1109/CISP.2015.7407935
Yang L, Di M, Huang X, Duan F (2015) BlockB-Tree: a new index structure combined compact B+-Tree with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 533–538. https://doi.org/10.1109/CISP.2015.7407937
Schuh MA, Angryk RA (2016) On the theory and practice of high-dimensional data indexing with iDistance. In Big Data (Big Data), 2016 IEEE International Conference on. 3593–3600. https://doi.org/10.1109/BigData.2016.7841023
Antaris S, Rafailidis D (2017) In-memory Stream Indexing of Massive and Fast Incoming Multimedia Content. IEEE Transac-tions on Big Data. https://doi.org/10.1109/TBDATA.2017.2697441
Article Google Scholar
Sha EHM, Jiang W, Dong H, Ma Z, Zhang R, Chen X, Zhuge Q (2018) Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory. IEEE Trans Comput 67(3):432–448. https://doi.org/10.1109/TC.2017.2754381
Article MathSciNet MATH Google Scholar
Li X, Ma H, Wang X (2018) Feature proposal model on multidimensional data clustering and its application. Parallel Computing Pattern Recognition Letters 112:41–48. https://doi.org/10.1016/j.patrec.2018.05.025
Article Google Scholar
Aparna, K., & Nair, M. K (2017) A pragmatic approach for multidimensional data clustering. In Computing, Communication and Networking Technologies (ICCCNT), 2017 8th International Conference on. 1–6. https://doi.org/10.1109/ICCCNT.2017.8203928
Kim, H. I., Kim, H. J., & Chang, J. W (2016) A kNN query processing algorithm using a tree index structure on the encrypted database. In Big Data and Smart Computing (BigComp), 2016 International Conference on. 93–100.
Talha, A. M., Kamel, I., & Al Aghbari, Z (2017) DISC: Query processing on the cloud service provider for dynamic spatial databases. In Multimedia Big Data (BigMM), 2017 IEEE Third International Conference on. 318–321. https://doi.org/10.1109/BigMM.2017.24
Kamel, I., & Faloutsos, C (1993) Hilbert R-tree: An improved R- tree using fractals.
Ezatpoor P, Zhan J, Wu JMT, Chiu C (2018) Finding Top-$ k $ Dominance on Incomplete Big Data Using MapReduce Framework. IEEE Access 6:7872–7887. https://doi.org/10.1109/access.2018.2797048
Article Google Scholar
Miao X, Gao Y, Zheng B, Chen G, Cui H (2016) Top-k dominating queries on incomplete data. IEEE Trans Knowl Data Eng 28(1):252–266. https://doi.org/10.1109/TKDE.2015.2460742
Article Google Scholar
Memarzia, P., Patrou, M., Alam, M. M., Ray, S., Bhavsar, V. C., & Kent, K. B (2019) Toward efficient processing of spatio-temporal workloads in a distributed in-memory system. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 118–127. https://doi.org/10.1109/MDM.2019.00-66
V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky and O. Tardieu (2010) “The asynchronous partitioned global address space model,” in The First Workshop on Advances in Message Passing. 1–8.
Drake DE et al (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett. https://doi.org/10.1016/s0020-0190(02)00393-9
Article MathSciNet MATH Google Scholar
Fu, Z., Yu, J., & Sarwat, M (2019) Building a large-scale microscopic road network traffic simulator in apache spark. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 320–328. https://doi.org/10.1109/MDM.2019.00-42
Bao, L., & Le, Y (2018) A spatial big data framework for maritime traffic data. In 2018 3rd International Conference on Computational Intelligence and Applications (ICCIA). 244–248. https://doi.org/10.1109/ICCIA.2018.00054
Hussain, M. M., & Fujimoto, N (2018) Parallel multi-objective particle swarm optimization for large swarm and high dimensional problems. In 2018 IEEE Congress on Evolutionary Computation (CEC). 1–10. https://doi.org/https://doi.org/10.1016/j.parco.2019.102589
Sprenger, S., Schäfer, P., & Leser, U (2019) BB-Tree: A practical and efficient main-memory index structure for multidimensional workloads. In EDBT. 169–180. https://doi.org/https://doi.org/10.1109/icde.2019.00143
Jon Louis Bentley (1975) Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM (1975) https://doi.org/https://doi.org/10.1145/361002.361007
Elghamrawy SM, Hassanien AE (2017) A partitioning framework for Cassandra NoSQL database using Rendezvous hashing. The Journal of Supercomputing 73(10):4444–4465
Article Google Scholar
Z¨aschke T, Zimmerli C, Norrie MC (2014) The PH-tree: A space-efficient storage structure and multidimensional index. In: The international conference on management of data (SIGMOD’14). 397–408. https://doi.org/https://doi.org/10.1145/2588555.2588564
Beckmann, N., Kriegel, H. P., Schneider, R., & Seeger, B (1990) The R*-tree: an efficient and robust access method for points and rectangles. In Acm Sigmod Record. Acm. 19(2); 322–331. https://doi.org/https://doi.org/10.1145/93597.98741
Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790. https://doi.org/10.1021/ci400084k
Article Google Scholar
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331. https://doi.org/10.1080/01621459.1983.10477973
Article MathSciNet MATH Google Scholar
Alippi, C., & Roveri, M (2010) Virtual k-fold cross validation: An effective method for accuracy assessment. In The 2010 International Joint Conference on Neural Networks (IJCNN). 1–6. https://doi.org/10.1109/IJCNN.2010.5596899
Ahmed Eldawy and Mohamed F. Mokbel (2015) "SpatialHadoop: A MapReduce Framework for Spatial Data", In Proceedings of the IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea.
(https://www2.informatik.hu-berlin.de/~sprengsz/bb-tree/).
(https://github.com/tzaeschke/phtree-1)

Download references

Author information

Authors and Affiliations

Dept. of Computer Engineering & Systems, Mansoura University, Mansoura, Egypt
Manar A. Elmeiligy & Ali I. El Desouky
Department of Computer Engineering, MISR Higher Institute for Engineering & Technology, Scientific Research Group in Egypt (SRGE), IEEE Member, Mansoura, Egypt
Sally M. Elghamrawy

Authors

Manar A. Elmeiligy
View author publications
You can also search for this author in PubMed Google Scholar
Ali I. El Desouky
View author publications
You can also search for this author in PubMed Google Scholar
Sally M. Elghamrawy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sally M. Elghamrawy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elmeiligy, M.A., Desouky, A.I.E. & Elghamrawy, S.M. An efficient parallel indexing structure for multi-dimensional big data using spark. J Supercomput 77, 11187–11214 (2021). https://doi.org/10.1007/s11227-021-03718-3

Download citation

Accepted: 25 February 2021
Published: 22 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11227-021-03718-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient parallel indexing structure for multi-dimensional big data using spark

Abstract

Access this article

Similar content being viewed by others

Efficient Distributed Multi-dimensional Index for Big Data Management

Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

ABR-Tree: An Efficient Distributed Multidimensional Indexing Approach for Massive Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient parallel indexing structure for multi-dimensional big data using spark

Abstract

Access this article

Similar content being viewed by others

Efficient Distributed Multi-dimensional Index for Big Data Management

Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

ABR-Tree: An Efficient Distributed Multidimensional Indexing Approach for Massive Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation