Parallel strategy for multiple scan operations with data replication

Abstract

To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14

Notes

  1. 1.

    Similar results are also tested in [15]

  2. 2.

    http://www.tpc.org/tpch/default.asp

References

  1. 1.

    Apache. HBase. http://hbase.apache.org/

  2. 2.

    Bal, H.E., Kaashoek, M.F., Tanenbaum, A.S., Jansen, J.: Replication techniques for speeding up parallel applications on distributed systems. Concurr. Pract. Exper. 4, 337–355 (1992)

    Article  Google Scholar 

  3. 3.

    Bouganim, L., Florescu, D., Valduriez, P.: Dynamic load balancing in hierarchical parallel database systems. In: Proc. of the Int. Conf. on Very Large Data Bases (VLDB). Mumbai (1996)

  4. 4.

    Bouganim, L., Florescu, D., Valduriez, P.: Load balancing for parallel query execution on NUMA multiprocessors. Distrib. Parallel Datab. 7(1), 99–121 (1999)

    Article  Google Scholar 

  5. 5.

    Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: Proceedings of 7th Symposium on Operating System Design and Implementation (OSDI), pp. 205218 (2006)

  6. 6.

    Chen, M.-S., Yu, P.S., Wu, K.-L.: Scheduling and processor allocation for parallel execution of multi-join queries. In: Proceedings of the Eighth International Conference on Data Engineering, pp 58–67. IEEE Computer Society, Washington, DC (1992)

  7. 7.

    Cockshott, W.P.: Addressing mechanisms and persistent programming chapter 15 in Atkinson others (1988)

    Google Scholar 

  8. 8.

    DeWitt, D., Gray, J.: Parallel database systems: The future of high performance database processing. Commun. ACM 36, 6 (1992)

    Google Scholar 

  9. 9.

    Du, J., Leung, J.Y.T.: Complexity of scheduling parallel task systems. SIAM J. Discret Math. SIAM (1989)

  10. 10.

    Ferhatosmanoglu, H., Tosun, A.S., Canahuate, G., Ramachandran, A.: Efficient parallel processing of range queries through replicated declustering. Distrib. Parallel Datab. 20(2), 117–147 (2006)

    Article  Google Scholar 

  11. 11.

    Frikken, K., Atallah, M., Prabhakar, S., Safavi-Naini, R.: Optimal parallel i/o for range queries through replication. In: Proceedings of 13th International Conference of Database and Expert Systems Applications (DEXA), pp. 669–678 (2002)

    Google Scholar 

  12. 12.

    Graefe, G.: Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1) (1994)

    Article  Google Scholar 

  13. 13.

    IBM: DB2. intra-partition parallelism https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005323.html (2009)

  14. 14.

    Johnson, R., Hardavellas, N., Pandis, I., Mancheril, N., Harizopoulos, S., Sabirli, K., Ailamaki, A., Falsafi, B.: To share or not to share? In: VLDB (2007)

  15. 15.

    Krikellas, K., Cintra, M., Viglas, S.: Scheduling threads for intra-query parallelism on multicore processors. In: EDBT (2010)

  16. 16.

    Krompass, S., Kuno, H., Dayal, U., Kemper, A.: Dynamic workload management for very large data warehouses: Juggling feathers and bowling balls. In: Proc. of the 33rd Intl. Conf. on Very Large Databases (VLDB), pp. 1105–1115 (2007)

  17. 17.

    Kuo, T.-W., Wei, C.-H., Lam, K.-y.: Real-time data access control on B-tree index structures. In: IEEE 15th International Conference on Data Engineering. Sydney (1999)

  18. 18.

    Lee, R., Ding, X., Chen, F., Lu, Q., Zhang, X.: MCC-DB: Minimizing cache conflicts in multi-core processors for databases. PVLDB 2(1), 373–384 (2009)

    Google Scholar 

  19. 19.

    Lim, L., Wang, M., Vitter, J.S.: SASH: A self-adaptive histogram set for dynamically changing workloads. In: Proceedings of 29th VLDB Conference. Berlin (2003)

  20. 20.

    Microsoft: SQL Server parallelism enhancements http://sqlmag.com/sql-server-2008/parallelism-enhancements-sql-server-2008 (2008)

  21. 21.

    OceanBase. https://github.com/alibaba/oceanbase/

  22. 22.

    Open Source DB. https://www.postgresql.org/

  23. 23.

    Oracle Database 11g. Parallel execution https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm. (2007)

  24. 24.

    Pan, C.S., Zymbler, M.L.: Encapsulation of partitioned parallelism into open-source database management systems. Program Comput. Softw. 41(6), 350–360 (2015)

    Article  Google Scholar 

  25. 25.

    Percival, C.: Cache missing for fun and profit. In: Proc. of BSDCan 2005 (2005)

  26. 26.

    Pivotal. GREENPLUM DB. http://greenplum.org/

  27. 27.

    Qiao, L., Raman, V., Reiss, F., Haas, P.J., Lohman, G.M.: Main-memory scan sharing for multi-core CPUs. Proc. VLDB Endow. 1(1), 610–621 (2008)

    Article  Google Scholar 

  28. 28.

    Rahm, E., Stöhr, T.: Analysis of parallel scan processing in parallel shared disk database systems. In: Proc. EURO-PAR Conf., LNCS, p. 966. Springer (1995)

  29. 29.

    Ristau, B., Fettweis, G.: An optimization methodology for memory allocation and task scheduling in SoCs via linear programming SAMOS 89–98 (2006)

  30. 30.

    Sokolinsky, LB.: Survey of architectures of parallel database system. Program Comput. Softw. 30(6), 337–346 (2004)

    Article  Google Scholar 

  31. 31.

    Son, SH.: Replicated data management in distributed database systems, ACM SIGMOD, vol. 17 Issue 4, pp 62–69. ACM, New York (1988)

    Google Scholar 

  32. 32.

    Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Proceeding ecs’07 Experimental computer science on Experimental computer science, pp. 3–3. San Diego (2007)

  33. 33.

    Valduriez, P.: Parallel Database Systems: Open Problems and New Issues, Distributed and Parallel Databases. Springer (1993)

Download references

Acknowledgments

This is work is partially supported by National Science Foundation of China under grant numbers 61702189, 61432006 and 61672232, and Youth Science and Technology - “Yang Fan” Program of Shanghai under grant number 17YF1427800. Huiqi Hu is the corresponding author.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Huiqi Hu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wei, X., Hu, H., Duan, H. et al. Parallel strategy for multiple scan operations with data replication. World Wide Web 22, 2561–2587 (2019). https://doi.org/10.1007/s11280-018-0625-7

Download citation

Keywords

  • Parallel scan
  • Load balancing
  • Parallel scheduling
  • Distributed data management system