Skip to main content
Log in

A distributed B+Tree indexing method for processing range queries over streaming data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

A data stream exhibits as a massive unbounded sequence of data elements continuously generated at a high rate. Stream databases raise new challenges for query processing due to both the streaming nature of data which constantly changes over time and the wider range of queries submitted by the user when compared with the traditional databases. In this paper, we propose a system architecture which includes components for both distributed indexing of streaming data and distributed processing of range queries on streaming data. Instead of creating a large and centralized B+Tree index structure, we create a set of small B+Tree indexes in such a way that a B+Tree index can be created for every partition of streaming data. We also design a distributed range search algorithm which can be used by each individual machine inside a Spark cluster to independently process range queries on each partition of streaming data. By exploiting the proposed system architecture, the process of indexing of streaming data and the process of querying over streaming data can be performed in a distributed and parallel manner. By performing several experiments, we demonstrate that our proposed indexing method is scalable and efficient for processing range queries on streaming data compared to the existing centralized B+Tree indexing methods and therefore, it can be used for applications involving data streams with a large volume of data elements and a large number of range queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Data availability

The source codes and datasets used in the paper are available from the first author on reasonable request.

Notes

  1. https://grouplens.org/.

References

  1. Margara, A., Rabl, T.: Definition of data streams. In: Encyclopedia of Big Data Technologies, pp. 648–652. Springer, Cham (2019)

  2. Bifet, A., Gama, J.: IoT data stream analytics. Ann. Telecommun. 75(9–10), 491–492 (2020)

    Article  Google Scholar 

  3. Tiwari, S., Agarwal, S.: Data stream management for CPS-based healthcare: a contemporary review. IETE Tech. Rev. (Inst. Electron. Telecommun. Eng. India) 39(5), 1–24 (2021)

    Google Scholar 

  4. Mohamed, F., Ismail, R.M., Badr, N.L., Tolba, M.F.: Data streams processing techniques. Intell. Syst. Ref. Libr. 115, 279–305 (2017)

    Google Scholar 

  5. Law, Y.N., Wang, H., Zaniolo, C.: Relational languages and data models for continuous queries on sequences and data streams. ACM Trans. Database Syst. 36(2), 1–32 (2011)

    Article  Google Scholar 

  6. Panigati, E., Schreiber, F.A., Zaniolo, C.: Data streams and data stream management systems and languages. In: Data Management in Pervasive Systems, Data-Centric Systems and Applications, pp. 93–111. Springer, Cham (2015)

  7. Yue-Jie, L.: Data stream of wireless sensor networks based on deep learning. Int. J. Online Eng. 12(11), 22–27 (2016)

    Article  Google Scholar 

  8. Chakravarthy, S., Jiang, Q.: DSMs challenges. In: Stream Data Processing: A Quality of Service Perspective: Modeling, Scheduling, Load Shedding, and Complex Event Processing, pp. 23–31. Springer, Boston (2009)

  9. Behrend, A., Gawlick, D., Nicklas, D.: DBMS meets DSMS: towards a federated solution. In: DATA 2012—Proceedings of the International Conference on Data Technology and Applications, February 2017, pp. 157–162 (2012)

  10. Lehman, P.L., Yao, S.B.: Efficient locking for concurrent operations on B-trees. ACM Trans. Database Syst. 6(4), 650–670 (1981)

    Article  Google Scholar 

  11. Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)

    Article  Google Scholar 

  12. Kholghi, M., Keyvanpour, M.: Comparative evaluation of data stream indexing models. Int. J. Mach. Learn. Comput. 2(3), 257–260 (2012)

    Article  Google Scholar 

  13. Shivakumar, N., García-Molina, H.: Wave-indices: indexing evolving databases. SIGMOD Rec. (ACM Spec. Interest Group Manag. Data) 26(2), 381–392 (1997)

    Google Scholar 

  14. Leung, T.Y.C., Muntz, R.R.: Generalized data stream indexing and temporal query processing. In: 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, 1992, pp. 124–131 (1992)

  15. Adamu, F.B., Habbal, A., Hassan, S., Les Cottrell, R., White, B., Abdullahi, I.: A survey on big data indexing strategies. In: NETAPPS2015, 2015 (2015)

  16. Badiozamany, S., Risch, T.: Scalable ordered indexing of streaming data. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 2012 (2012)

  17. Deng, Z., et al.: An efficient indexing approach for continuous spatial approximate keyword queries over geo-textual streaming data. ISPRS Int. J. Geo-Inf. 8(2), 57 (2019)

    Article  Google Scholar 

  18. Deng, Z., et al.: Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)

    Article  Google Scholar 

  19. Aguilera, M.K., Golab, W., Shah, M.A.: A practical scalable distributed B-tree. Proc. VLDB Endow. 1(1), 598–609 (2008)

    Article  Google Scholar 

  20. Sumalatha, M.R., Ananthi, M.: Efficient data retrieval using adaptive clustered indexing for continuous queries over streaming data. Clust. Comput. 22(55), 1–15 (2017)

    Google Scholar 

  21. Ananthi, M., Sreedhevi, D.K., Sumalatha, M.R.: Dynamic continuous query processing over streaming data. In: 2016 International Conference on Computation of Power, Energy, Information and Communication, ICCPEIC 2016, 2016, pp. 183–187 (2016)

  22. Kalashnikov, D., Prabhakar, S., Hambrusch, S., Aref, W.: Efficient evaluation of continuous range queries on moving objects. In: Lecture Notes in Computer Science (including Subseries Lecture Notes on Artificial Intelligence, Lecture Notes on Bioinformatics), vol. 2453, pp. 731–740 (2002)

  23. Wang, H., Belhassena, A.: Parallel trajectory search based on distributed index. Inf. Sci. (N.Y.) 388–389, 62–83 (2017)

    Article  Google Scholar 

  24. Rao, J., Ross, K.A.: Making B +-Trees cache conscious in main memory. SIGMOD Rec. (ACM Spec. Interest Group Manag. Data) 29(2), 475–486 (2000)

    Google Scholar 

  25. Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: MapReduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)

    Article  Google Scholar 

  26. Ishwarappa K., Anuradha, J.: A brief introduction on big data 5Vs characteristics and Hadoop technology. Procedia Comput. Sci. 48(3), 319–324 (2015)

  27. Zeebaree, S.R.M., Shukur, H., Haji, L., Zebari, R.: Characteristics and analysis of Hadoop distributed systems. Technol. Rep. Kansai Univ. 62(4), 1555–1564 (2020)

    Google Scholar 

  28. Apache Spark. http://spark.apache.org/. Accessed 5 May 2023

  29. Bansal, A., Jain, R., Modi, K.: Big Data Streaming with Spark. Springer, Singapore (2019)

    Book  Google Scholar 

  30. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)

    Article  Google Scholar 

  31. Hazarika, A.V., Jagadeesh Sai Raghu Ram, G., Jain, E.: Performance comparison of Hadoop and Spark engine. In: Proceedings of the International Conference on IoT in Social, Mobile, Analytics and Cloud, I-SMAC 2017, 2017, pp. 671–674 (2017)

  32. Samadi, Y., Zbakh, M., Tadonki, C.: Comparative study between Hadoop and Spark based on Hibench benchmarks. In: Proceedings of 2016 International Conference on Cloud Computing Technologies and Applications, CloudTech 2016, 2017, pp. 267–275 (207)

  33. Zhao, X., Garg, S., Queiroz, C., Buyya, R.: A Taxonomy and Survey of Stream Processing Systems, 1st edn. Elsevier, Inc., Amsterdam (2017)

    Google Scholar 

  34. Šaltenis, S., Jensen, C.S., Leutenegger, S.T., Lopez, M.A.: Indexing the positions of continuously moving objects. ACM SIGMOD Rec. 29(2), 331–342 (2000)

    Article  Google Scholar 

  35. Park, J., Hong, B., Ban, C.: A query index for continuous queries on RFID streaming data. Sci. China F 51(12), 2047–2061 (2008)

    Google Scholar 

  36. Wu, K.L., Chen, S.K., Yu, P.S.: Processing continual range queries over moving objects using VCR-based query indexes. In: Proceedings of MOBIQUITOUS 2004—1st Annual International Conference on Mobile and Ubiquitous Systems: Networking and Services, 2004, pp. 226–235 (2004)

  37. Hankins, R.A., Patel, J.M.: Effect of node size on the performance of cache-conscious B +-trees. Perform. Eval. Rev. 31(1), 283–295 (2003)

    Article  Google Scholar 

  38. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)

    Article  Google Scholar 

  39. Silverstein, A., Baskins, D.: Judy IV Shop Manual (2002)

  40. Baskins, D.: Judy home page (2003). http://judy.sourceforge.net. Accessed 5 May 2023

  41. Yu, X., Pu, K.Q., Koudas, N.: Monitoring k-nearest neighbor queries over moving objects. In: Proceedings—International Conference on Data Engineering, 2005, pp. 631–642 (2005)

  42. Singh, H., Bawa, S.: A MapReduce-based scalable discovery and indexing of structured big data. Future Gener. Comput. Syst. 73, 32–43 (2017)

    Article  Google Scholar 

  43. Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: a new paradigm for building scalable distributed systems. In: SOSP’07—Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles, 2007, pp. 159–174 (2007)

  44. Safaei, A.A.: Real-time processing of streaming big data. Real-Time Syst. 53(1), 1–44 (2017)

    Article  MathSciNet  Google Scholar 

  45. Silberschatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts, 7th edn, vol. 4. McGraw-Hill, New York (2019)

  46. Pollari-malmi, K.: B +-Trees. https://www.cs.helsinki.fi/u/mluukkai/tirak2010/B-tree.pdf. Accessed 5 May 2023

  47. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms 4. The MIT Press, Cambridge (2022)

    Google Scholar 

  48. grouplens, MovieLens Dataset. https://grouplens.org/datasets/movielens/. Accessed 5 May 2023

  49. Taniar, D., Leung, C.H.C., Rahayu, W., Goel, S.: High-Performance Parallel Database Processing and Grid Databases. Wiley, Hoboken (2008)

    Book  Google Scholar 

Download references

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. The first draft of the manuscript was written by SS and MM, then it is reviewed by AMR and AAS. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Meghdad Mirabi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose. The authors declare no conflict of interest.

Ethical approval

This paper does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Safaee, S., Mirabi, M., Rahmani, A.M. et al. A distributed B+Tree indexing method for processing range queries over streaming data. Cluster Comput 27, 1251–1274 (2024). https://doi.org/10.1007/s10586-023-04015-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-023-04015-9

Keywords

Navigation