Abstract
With the increasing daily production of data in recent years, indexing, storing and retrieving huge amounts of data have become a common problem, especially for multi-dimensional big data. Although R-tree has proved to be efficient for indexing multi-dimensional big data, the R-tree suffers from the curse of dimensionality problem. Many researchers continue to use the R-tree in their studies as it is the most famous tree-like structure for indexing multi-dimensional data. However, with increasing numbers of dimensions in multi-dimensional data the performance of R-Tree will decrease. This paper proposes a new indexing structure called Parallel Indexing System Structure based on Spark (ParISSS), which is an efficient system for indexing multi-dimensional big data, to overcome these problems. ParISSS introduces six types of computing nodes, the reception-node is used to insert and index data, the normal-node is used to store indexed data, the resolution-node is used to distribute a reception-index to a normal-node, the representative-node is used to receive queries from the user, and the reply-node and check-node are used to send the results to the user. We also introduced BR*-tree structure to improve the storing and searching processes. We present an extensive experimental evaluation of our system, comparing several indexing systems. The experimental results show that ParISSS outperforms other indexing systems.
Similar content being viewed by others
References
Liu Y, Hu S, Rabl T, Liu W, Jacobsen H-A, Wu K, Chen J, Li J. DGF index for smart grid: Enhancing hive with a cost- effective multidimensional range index. https://doi.org/10.14778/2733004.2733021
Funaki K, Hochin T, Nomiya H, Nakanishi H, Kojima M (2013) Parallel indexing of large multi-dimensional data in advanced applied informatics (IIAIAAI). In: 2013 IIAI international conference on. IEEE. pp 324–329. https://doi.org/10.1109/IIAI-AAI.2013.62
Kim J, Abbasi H, Chacon L, Docan C, Klasky S, Liu Q, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: 2011 IEEE Symposium on Large Data Analysis And Visualization (LDAV), pp 65–72. https://doi.org/10.1109/LDAV.2011.6092319
Nakanishi K, Hochin T, Nomiya H (2016) Performance improvement of multi-dimensional indexing system for big data analysis. In: 2016 IEEE/ACIS 15th International Conference on computer and information science (ICIS), pp 1–6. https://doi.org/10.1109/ICIS.2016.7550840
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. ACM. 14(2); 47–57. https://doi.org/10.1007/978-3-319-23519-6_1151-2
Kim M, Liu L, Choi W (2018) A GPU-aware Parallel Index for Processing High-dimensional Big Data. IEEE Trans Comput. https://doi.org/10.1109/TC.2018.2823760
Niu Z, He B, Zhou C, Lau C. T (2017) Multi-objective Optimizations in Geo-Distributed Data Analytics Systems. In 2017 IEEE 23rd International Conference On Parallel And Distributed Systems (ICPADS). 519–528. https://doi.org/10.1109/ICPADS.2017.00074
Tong G, Jin H, Xie X, Cao W, Yuan P (2011) Measuring and analyzing CPU overhead of virtualization system. 250. In 2011 IEEE Asia-Pacific Services Computing Conference. 243. https://doi.org/10.1109/APSCC.2011.40
Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. Journal of Grid Computing 16(4):607–626
Hadoop, http://hadoop.apache.org/
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. HotCloud. 10(10–10); 95.
Apache Spark: http://spark.apache.org/
Finkel RA, Bentley JL (1974) Quad trees: A data structure for retrieval on composite keys. Acta Informatica. 4(1). https://doi.org/https://doi.org/10.1007/bf00288933
Chen L, Gao Y, Li X, Jensen CS, Chen G (2017) Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans Knowl Data Eng 29(3):556–571. https://doi.org/10.1109/TKDE.2015.2506556
Yang L, Di M, Huang X, Duan F (2015) A new index structure combines a cluster algorithm with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 520–526. https://doi.org/10.1109/CISP.2015.7407935
Yang L, Di M, Huang X, Duan F (2015) BlockB-Tree: a new index structure combined compact B+-Tree with block distance. In Image and Signal Processing (CISP), 2015 8th International Congress on. 533–538. https://doi.org/10.1109/CISP.2015.7407937
Schuh MA, Angryk RA (2016) On the theory and practice of high-dimensional data indexing with iDistance. In Big Data (Big Data), 2016 IEEE International Conference on. 3593–3600. https://doi.org/10.1109/BigData.2016.7841023
Antaris S, Rafailidis D (2017) In-memory Stream Indexing of Massive and Fast Incoming Multimedia Content. IEEE Transac-tions on Big Data. https://doi.org/10.1109/TBDATA.2017.2697441
Sha EHM, Jiang W, Dong H, Ma Z, Zhang R, Chen X, Zhuge Q (2018) Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory. IEEE Trans Comput 67(3):432–448. https://doi.org/10.1109/TC.2017.2754381
Li X, Ma H, Wang X (2018) Feature proposal model on multidimensional data clustering and its application. Parallel Computing Pattern Recognition Letters 112:41–48. https://doi.org/10.1016/j.patrec.2018.05.025
Aparna, K., & Nair, M. K (2017) A pragmatic approach for multidimensional data clustering. In Computing, Communication and Networking Technologies (ICCCNT), 2017 8th International Conference on. 1–6. https://doi.org/10.1109/ICCCNT.2017.8203928
Kim, H. I., Kim, H. J., & Chang, J. W (2016) A kNN query processing algorithm using a tree index structure on the encrypted database. In Big Data and Smart Computing (BigComp), 2016 International Conference on. 93–100.
Talha, A. M., Kamel, I., & Al Aghbari, Z (2017) DISC: Query processing on the cloud service provider for dynamic spatial databases. In Multimedia Big Data (BigMM), 2017 IEEE Third International Conference on. 318–321. https://doi.org/10.1109/BigMM.2017.24
Kamel, I., & Faloutsos, C (1993) Hilbert R-tree: An improved R- tree using fractals.
Ezatpoor P, Zhan J, Wu JMT, Chiu C (2018) Finding Top-$ k $ Dominance on Incomplete Big Data Using MapReduce Framework. IEEE Access 6:7872–7887. https://doi.org/10.1109/access.2018.2797048
Miao X, Gao Y, Zheng B, Chen G, Cui H (2016) Top-k dominating queries on incomplete data. IEEE Trans Knowl Data Eng 28(1):252–266. https://doi.org/10.1109/TKDE.2015.2460742
Memarzia, P., Patrou, M., Alam, M. M., Ray, S., Bhavsar, V. C., & Kent, K. B (2019) Toward efficient processing of spatio-temporal workloads in a distributed in-memory system. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 118–127. https://doi.org/10.1109/MDM.2019.00-66
V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky and O. Tardieu (2010) “The asynchronous partitioned global address space model,” in The First Workshop on Advances in Message Passing. 1–8.
Drake DE et al (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett. https://doi.org/10.1016/s0020-0190(02)00393-9
Fu, Z., Yu, J., & Sarwat, M (2019) Building a large-scale microscopic road network traffic simulator in apache spark. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 320–328. https://doi.org/10.1109/MDM.2019.00-42
Bao, L., & Le, Y (2018) A spatial big data framework for maritime traffic data. In 2018 3rd International Conference on Computational Intelligence and Applications (ICCIA). 244–248. https://doi.org/10.1109/ICCIA.2018.00054
Hussain, M. M., & Fujimoto, N (2018) Parallel multi-objective particle swarm optimization for large swarm and high dimensional problems. In 2018 IEEE Congress on Evolutionary Computation (CEC). 1–10. https://doi.org/https://doi.org/10.1016/j.parco.2019.102589
Sprenger, S., Schäfer, P., & Leser, U (2019) BB-Tree: A practical and efficient main-memory index structure for multidimensional workloads. In EDBT. 169–180. https://doi.org/https://doi.org/10.1109/icde.2019.00143
Jon Louis Bentley (1975) Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM (1975) https://doi.org/https://doi.org/10.1145/361002.361007
Elghamrawy SM, Hassanien AE (2017) A partitioning framework for Cassandra NoSQL database using Rendezvous hashing. The Journal of Supercomputing 73(10):4444–4465
Z¨aschke T, Zimmerli C, Norrie MC (2014) The PH-tree: A space-efficient storage structure and multidimensional index. In: The international conference on management of data (SIGMOD’14). 397–408. https://doi.org/https://doi.org/10.1145/2588555.2588564
Beckmann, N., Kriegel, H. P., Schneider, R., & Seeger, B (1990) The R*-tree: an efficient and robust access method for points and rectangles. In Acm Sigmod Record. Acm. 19(2); 322–331. https://doi.org/https://doi.org/10.1145/93597.98741
Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790. https://doi.org/10.1021/ci400084k
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331. https://doi.org/10.1080/01621459.1983.10477973
Alippi, C., & Roveri, M (2010) Virtual k-fold cross validation: An effective method for accuracy assessment. In The 2010 International Joint Conference on Neural Networks (IJCNN). 1–6. https://doi.org/10.1109/IJCNN.2010.5596899
Ahmed Eldawy and Mohamed F. Mokbel (2015) "SpatialHadoop: A MapReduce Framework for Spatial Data", In Proceedings of the IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Elmeiligy, M.A., Desouky, A.I.E. & Elghamrawy, S.M. An efficient parallel indexing structure for multi-dimensional big data using spark. J Supercomput 77, 11187–11214 (2021). https://doi.org/10.1007/s11227-021-03718-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03718-3