Abstract
A data stream exhibits as a massive unbounded sequence of data elements continuously generated at a high rate. Stream databases raise new challenges for query processing due to both the streaming nature of data which constantly changes over time and the wider range of queries submitted by the user when compared with the traditional databases. In this paper, we propose a system architecture which includes components for both distributed indexing of streaming data and distributed processing of range queries on streaming data. Instead of creating a large and centralized B+Tree index structure, we create a set of small B+Tree indexes in such a way that a B+Tree index can be created for every partition of streaming data. We also design a distributed range search algorithm which can be used by each individual machine inside a Spark cluster to independently process range queries on each partition of streaming data. By exploiting the proposed system architecture, the process of indexing of streaming data and the process of querying over streaming data can be performed in a distributed and parallel manner. By performing several experiments, we demonstrate that our proposed indexing method is scalable and efficient for processing range queries on streaming data compared to the existing centralized B+Tree indexing methods and therefore, it can be used for applications involving data streams with a large volume of data elements and a large number of range queries.
Similar content being viewed by others
Data availability
The source codes and datasets used in the paper are available from the first author on reasonable request.
Notes
References
Margara, A., Rabl, T.: Definition of data streams. In: Encyclopedia of Big Data Technologies, pp. 648–652. Springer, Cham (2019)
Bifet, A., Gama, J.: IoT data stream analytics. Ann. Telecommun. 75(9–10), 491–492 (2020)
Tiwari, S., Agarwal, S.: Data stream management for CPS-based healthcare: a contemporary review. IETE Tech. Rev. (Inst. Electron. Telecommun. Eng. India) 39(5), 1–24 (2021)
Mohamed, F., Ismail, R.M., Badr, N.L., Tolba, M.F.: Data streams processing techniques. Intell. Syst. Ref. Libr. 115, 279–305 (2017)
Law, Y.N., Wang, H., Zaniolo, C.: Relational languages and data models for continuous queries on sequences and data streams. ACM Trans. Database Syst. 36(2), 1–32 (2011)
Panigati, E., Schreiber, F.A., Zaniolo, C.: Data streams and data stream management systems and languages. In: Data Management in Pervasive Systems, Data-Centric Systems and Applications, pp. 93–111. Springer, Cham (2015)
Yue-Jie, L.: Data stream of wireless sensor networks based on deep learning. Int. J. Online Eng. 12(11), 22–27 (2016)
Chakravarthy, S., Jiang, Q.: DSMs challenges. In: Stream Data Processing: A Quality of Service Perspective: Modeling, Scheduling, Load Shedding, and Complex Event Processing, pp. 23–31. Springer, Boston (2009)
Behrend, A., Gawlick, D., Nicklas, D.: DBMS meets DSMS: towards a federated solution. In: DATA 2012—Proceedings of the International Conference on Data Technology and Applications, February 2017, pp. 157–162 (2012)
Lehman, P.L., Yao, S.B.: Efficient locking for concurrent operations on B-trees. ACM Trans. Database Syst. 6(4), 650–670 (1981)
Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)
Kholghi, M., Keyvanpour, M.: Comparative evaluation of data stream indexing models. Int. J. Mach. Learn. Comput. 2(3), 257–260 (2012)
Shivakumar, N., García-Molina, H.: Wave-indices: indexing evolving databases. SIGMOD Rec. (ACM Spec. Interest Group Manag. Data) 26(2), 381–392 (1997)
Leung, T.Y.C., Muntz, R.R.: Generalized data stream indexing and temporal query processing. In: 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, 1992, pp. 124–131 (1992)
Adamu, F.B., Habbal, A., Hassan, S., Les Cottrell, R., White, B., Abdullahi, I.: A survey on big data indexing strategies. In: NETAPPS2015, 2015 (2015)
Badiozamany, S., Risch, T.: Scalable ordered indexing of streaming data. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 2012 (2012)
Deng, Z., et al.: An efficient indexing approach for continuous spatial approximate keyword queries over geo-textual streaming data. ISPRS Int. J. Geo-Inf. 8(2), 57 (2019)
Deng, Z., et al.: Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)
Aguilera, M.K., Golab, W., Shah, M.A.: A practical scalable distributed B-tree. Proc. VLDB Endow. 1(1), 598–609 (2008)
Sumalatha, M.R., Ananthi, M.: Efficient data retrieval using adaptive clustered indexing for continuous queries over streaming data. Clust. Comput. 22(55), 1–15 (2017)
Ananthi, M., Sreedhevi, D.K., Sumalatha, M.R.: Dynamic continuous query processing over streaming data. In: 2016 International Conference on Computation of Power, Energy, Information and Communication, ICCPEIC 2016, 2016, pp. 183–187 (2016)
Kalashnikov, D., Prabhakar, S., Hambrusch, S., Aref, W.: Efficient evaluation of continuous range queries on moving objects. In: Lecture Notes in Computer Science (including Subseries Lecture Notes on Artificial Intelligence, Lecture Notes on Bioinformatics), vol. 2453, pp. 731–740 (2002)
Wang, H., Belhassena, A.: Parallel trajectory search based on distributed index. Inf. Sci. (N.Y.) 388–389, 62–83 (2017)
Rao, J., Ross, K.A.: Making B +-Trees cache conscious in main memory. SIGMOD Rec. (ACM Spec. Interest Group Manag. Data) 29(2), 475–486 (2000)
Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: MapReduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)
Ishwarappa K., Anuradha, J.: A brief introduction on big data 5Vs characteristics and Hadoop technology. Procedia Comput. Sci. 48(3), 319–324 (2015)
Zeebaree, S.R.M., Shukur, H., Haji, L., Zebari, R.: Characteristics and analysis of Hadoop distributed systems. Technol. Rep. Kansai Univ. 62(4), 1555–1564 (2020)
Apache Spark. http://spark.apache.org/. Accessed 5 May 2023
Bansal, A., Jain, R., Modi, K.: Big Data Streaming with Spark. Springer, Singapore (2019)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Hazarika, A.V., Jagadeesh Sai Raghu Ram, G., Jain, E.: Performance comparison of Hadoop and Spark engine. In: Proceedings of the International Conference on IoT in Social, Mobile, Analytics and Cloud, I-SMAC 2017, 2017, pp. 671–674 (2017)
Samadi, Y., Zbakh, M., Tadonki, C.: Comparative study between Hadoop and Spark based on Hibench benchmarks. In: Proceedings of 2016 International Conference on Cloud Computing Technologies and Applications, CloudTech 2016, 2017, pp. 267–275 (207)
Zhao, X., Garg, S., Queiroz, C., Buyya, R.: A Taxonomy and Survey of Stream Processing Systems, 1st edn. Elsevier, Inc., Amsterdam (2017)
Šaltenis, S., Jensen, C.S., Leutenegger, S.T., Lopez, M.A.: Indexing the positions of continuously moving objects. ACM SIGMOD Rec. 29(2), 331–342 (2000)
Park, J., Hong, B., Ban, C.: A query index for continuous queries on RFID streaming data. Sci. China F 51(12), 2047–2061 (2008)
Wu, K.L., Chen, S.K., Yu, P.S.: Processing continual range queries over moving objects using VCR-based query indexes. In: Proceedings of MOBIQUITOUS 2004—1st Annual International Conference on Mobile and Ubiquitous Systems: Networking and Services, 2004, pp. 226–235 (2004)
Hankins, R.A., Patel, J.M.: Effect of node size on the performance of cache-conscious B +-trees. Perform. Eval. Rev. 31(1), 283–295 (2003)
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)
Silverstein, A., Baskins, D.: Judy IV Shop Manual (2002)
Baskins, D.: Judy home page (2003). http://judy.sourceforge.net. Accessed 5 May 2023
Yu, X., Pu, K.Q., Koudas, N.: Monitoring k-nearest neighbor queries over moving objects. In: Proceedings—International Conference on Data Engineering, 2005, pp. 631–642 (2005)
Singh, H., Bawa, S.: A MapReduce-based scalable discovery and indexing of structured big data. Future Gener. Comput. Syst. 73, 32–43 (2017)
Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: a new paradigm for building scalable distributed systems. In: SOSP’07—Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles, 2007, pp. 159–174 (2007)
Safaei, A.A.: Real-time processing of streaming big data. Real-Time Syst. 53(1), 1–44 (2017)
Silberschatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts, 7th edn, vol. 4. McGraw-Hill, New York (2019)
Pollari-malmi, K.: B +-Trees. https://www.cs.helsinki.fi/u/mluukkai/tirak2010/B-tree.pdf. Accessed 5 May 2023
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms 4. The MIT Press, Cambridge (2022)
grouplens, MovieLens Dataset. https://grouplens.org/datasets/movielens/. Accessed 5 May 2023
Taniar, D., Leung, C.H.C., Rahayu, W., Goel, S.: High-Performance Parallel Database Processing and Grid Databases. Wiley, Hoboken (2008)
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. The first draft of the manuscript was written by SS and MM, then it is reviewed by AMR and AAS. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose. The authors declare no conflict of interest.
Ethical approval
This paper does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Safaee, S., Mirabi, M., Rahmani, A.M. et al. A distributed B+Tree indexing method for processing range queries over streaming data. Cluster Comput 27, 1251–1274 (2024). https://doi.org/10.1007/s10586-023-04015-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-023-04015-9