Skip to main content
Log in

Correlation-aware partitioning for skewed range query optimization

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Data partitioning is an effective way to reduce cost and improve query performance in large-scale Web data analytical applications. State-of-the-art partitioning approaches on range queries is lacking of considering the correlation of data in a certain access patterns, especially in some skewed patterns. This paper presents a correlation-aware partitioning model for skewed range queries. It formulates partitioning optimization issue on continuous correlated data as a geometrical step curve fitting problem. Then, we prove that the optimal partitioning should split data on range query boundaries. On this basis, Range Boundary Based DP Partitioning is designed to induce the optimal partition and significantly reduce the computation cost compared to the baseline dynamic programming algorithm. Local is better than global. For efficiency, Bottom-up Merging Partitioning is proposed further to improve partitioning by bottom-up merging instead of searching. To evaluate the proposed approaches, sets of experiments are conducted under skewed range query workloads on skewed and uniform datasets, and show they do optimize the efficiency of data partitioning by hundreds of times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18

Similar content being viewed by others

References

  1. Alexiou, K., Kossmann, D., Larson, P.: Adaptive range filters for cold data: avoiding trips to Siberia. In: The 39th International Conference on Very Large Data Bases, pp 1714–1725. VLDB Endowment, Riva del Garda (2013)

  2. Avni, H., Shavit, N., Suissa, A.: Leaplist: lessons learned in designing TM-supported range queries. In: ACM Symposium on Principles of Distributed Computing (PoDC), pp 299–308. ACM, Montreal (2013)

  3. Cao, L., Rundensteiner, E. A.: High performance stream query processing with correlation-aware partitioning. VLDB Endow. 7(4), 265–276 (2013)

    Article  Google Scholar 

  4. Cooper, B. F., Silberstein, A., Tam, E., et al.: Benchmarking cloud serving systems with YCSB. In: The 1st ACM Symposium on Cloud Computing (SoCC), pp 143–154. ACM, Indianapolis (2010)

  5. Demertzis, I., Papadopoulos, S., Papapetrou, O., et al.: Practical private range search revisited. In: ACM international conference on management of data (SIGMOD), pp 185–198. ACM, San Francisco (2016)

  6. Eldawy, A., Levandoski, J., Larson, P. Ä.: Trekking through siberia: managing cold data in a memory-optimized database. In: The 40th International Conference on Very Large Data Bases, pp 931–942. VLDB Endowment, Hangzhou (2014)

  7. Eltabakh, M. Y., Özcan, F., Sismanis, Y., et al.: Eagle-eyed elephant: split-oriented indexing in Hadoop. In: The 16th International Conference on Extending Database Technology (EDBT), pp. 89–100. Genoa (2013)

  8. Fu, X., Miao, X., Xu, J., et al.: Continuous range-based skyline queries in road networks. World Wide Web-internet Web Inf. Syst. 20(6), 1–25 (2017)

    Google Scholar 

  9. Ge, W., Chen, M., Yuan, C., Huang, Y.: An adaptive partition-based caching approach for efficient range queries on key-value data. In: The 18th Asia-Pacific Web Conference (APWeb), pp 343–354. Springer International Publishing, Suzhou (2016)

  10. Gu, Y., Yu, G., Guo, N.: Triggered moving range queries over RFID monitored objects. J. Inf. Sci. Eng. 29(3), 401–416 (2013)

    Google Scholar 

  11. Jiang, W., Zhu, J., Xu, J., et al.: A feature based method for trajectory dataset segmentation and profiling. World Wide Web-Internet Web Inf. Syst. 20(1), 5–22 (2017)

    Article  Google Scholar 

  12. Le, W., Li, F., Tao, Y., et al.: Optimal splitters for temporal and multi-version databases. In: ACM International Conference on Management of Data (SIGMOD), pp 109–120. ACM, New York (2013)

  13. Levandoski, J. J., Larson, P., Stoica, R.: Identifying hot and cold data in main-memory databases. In: The 29th IEEE International Conference on Data Engineering (ICDE), pp 26–C37. IEEE Press, Brisbane (2013)

  14. Li, C., Hay, M., Miklau, G., Wang, Y.: A data- and workload-aware query answering algorithm for range queries under differential privacy. In: The 40th International Conference on Very Large Data Bases, pp 341–352. VLDB Endowment, Hangzhou (2014)

  15. Meyer, C.A., Boissier, M., Michaud, A., et al.: Dynamic and transparent data tiering for in-memory databases in mixed workload environments. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS). Hawaii (2015)

  16. Nasir, M. A. U., Morales, G. D. F., Kourtellis, N., et al.: When two choices are not enough: balancing at scale in distributed stream processing. In: 32nd IEEE International Conference on Data Engineering (ICDE), pp 589–600. IEEE Press, Helsinki (2016)

  17. Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: ACM International Conference on Management of Data (SIGMOD), pp 61–72. ACM, Scottsdale (2012)

  18. Serafini, M., Taft, R., Elmore, A. J., et al.: Clay: fine-grained adaptive partitioning for general database schemas. Proc. VLDB Endow. 10(4), 445–456 (2016)

    Article  Google Scholar 

  19. Sun, L., Franklin, M. J., Krishnan, S., et al.: Fine-grained partitioning for aggressive data skipping. In: ACM International Conference on Management of Data (SIGMOD), pp 1115–1126. ACM, Snowbird (2014)

  20. Vigfusson, Y., Silberstein, A., Cooper, B. F., et al.: Adaptively parallelizing distributed range queries. In: The 35th International Conference on Very Large Data Bases, pp 682–693. VLDB Endowment, Lyon (2009)

  21. Vo, H. T., Chen, C., Ooi, B. C.: Towards elastic transactional cloud storage with range query support. In: The 36th International Conference on Very Large Data Bases, pp 506–514. VLDB Endowment, Singapore (2010)

  22. Wikimedia Traffic Analysis Report - Wikipedia Page Views Per Country - Breakdown. https://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryBreakdown.htm

  23. Wikipedia Statistics. https://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm

  24. Xu, J., Güting, R. H., Zheng, Y.: The TM-RTree: an index on generic moving objects for range queries. GeoInformatica 19(3), 487–524 (2015)

    Article  Google Scholar 

  25. Zhang, S.: Data partitioning view of mining big data. CoRR 1611.09691 (2016)

  26. Zhu, H., Yang, X., Wang, B., et al.: Range-based obstructed nearest neighbor queries. In: ACM International Conference on Management of Data (SIGMOD), pp 2053–2068. ACM, San Francisco (2016)

Download references

Acknowledgements

The authors are grateful for the anonymous reviewers’ insightful comments and valuable suggestions sincerely, which can substantially improve the quality of this paper. This work is funded by Chinese National Natural Science Fund Grants (61572250, 61362006, 61672176, 61763003), Jiangsu Province Science & Technology Research Grant (BE2014131), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Guangxi Natural Science Fund Grants (2016GXNSFAA380192) and Guangxi IBAYT Program (KY2016YB065).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Ge.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, W., Li, X., Yuan, C. et al. Correlation-aware partitioning for skewed range query optimization. World Wide Web 22, 125–151 (2019). https://doi.org/10.1007/s11280-018-0547-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0547-4

Keywords

Navigation