Abstract
Data partitioning is an effective way to reduce cost and improve query performance in large-scale Web data analytical applications. State-of-the-art partitioning approaches on range queries is lacking of considering the correlation of data in a certain access patterns, especially in some skewed patterns. This paper presents a correlation-aware partitioning model for skewed range queries. It formulates partitioning optimization issue on continuous correlated data as a geometrical step curve fitting problem. Then, we prove that the optimal partitioning should split data on range query boundaries. On this basis, Range Boundary Based DP Partitioning is designed to induce the optimal partition and significantly reduce the computation cost compared to the baseline dynamic programming algorithm. Local is better than global. For efficiency, Bottom-up Merging Partitioning is proposed further to improve partitioning by bottom-up merging instead of searching. To evaluate the proposed approaches, sets of experiments are conducted under skewed range query workloads on skewed and uniform datasets, and show they do optimize the efficiency of data partitioning by hundreds of times.
Similar content being viewed by others
References
Alexiou, K., Kossmann, D., Larson, P.: Adaptive range filters for cold data: avoiding trips to Siberia. In: The 39th International Conference on Very Large Data Bases, pp 1714–1725. VLDB Endowment, Riva del Garda (2013)
Avni, H., Shavit, N., Suissa, A.: Leaplist: lessons learned in designing TM-supported range queries. In: ACM Symposium on Principles of Distributed Computing (PoDC), pp 299–308. ACM, Montreal (2013)
Cao, L., Rundensteiner, E. A.: High performance stream query processing with correlation-aware partitioning. VLDB Endow. 7(4), 265–276 (2013)
Cooper, B. F., Silberstein, A., Tam, E., et al.: Benchmarking cloud serving systems with YCSB. In: The 1st ACM Symposium on Cloud Computing (SoCC), pp 143–154. ACM, Indianapolis (2010)
Demertzis, I., Papadopoulos, S., Papapetrou, O., et al.: Practical private range search revisited. In: ACM international conference on management of data (SIGMOD), pp 185–198. ACM, San Francisco (2016)
Eldawy, A., Levandoski, J., Larson, P. Ä.: Trekking through siberia: managing cold data in a memory-optimized database. In: The 40th International Conference on Very Large Data Bases, pp 931–942. VLDB Endowment, Hangzhou (2014)
Eltabakh, M. Y., Özcan, F., Sismanis, Y., et al.: Eagle-eyed elephant: split-oriented indexing in Hadoop. In: The 16th International Conference on Extending Database Technology (EDBT), pp. 89–100. Genoa (2013)
Fu, X., Miao, X., Xu, J., et al.: Continuous range-based skyline queries in road networks. World Wide Web-internet Web Inf. Syst. 20(6), 1–25 (2017)
Ge, W., Chen, M., Yuan, C., Huang, Y.: An adaptive partition-based caching approach for efficient range queries on key-value data. In: The 18th Asia-Pacific Web Conference (APWeb), pp 343–354. Springer International Publishing, Suzhou (2016)
Gu, Y., Yu, G., Guo, N.: Triggered moving range queries over RFID monitored objects. J. Inf. Sci. Eng. 29(3), 401–416 (2013)
Jiang, W., Zhu, J., Xu, J., et al.: A feature based method for trajectory dataset segmentation and profiling. World Wide Web-Internet Web Inf. Syst. 20(1), 5–22 (2017)
Le, W., Li, F., Tao, Y., et al.: Optimal splitters for temporal and multi-version databases. In: ACM International Conference on Management of Data (SIGMOD), pp 109–120. ACM, New York (2013)
Levandoski, J. J., Larson, P., Stoica, R.: Identifying hot and cold data in main-memory databases. In: The 29th IEEE International Conference on Data Engineering (ICDE), pp 26–C37. IEEE Press, Brisbane (2013)
Li, C., Hay, M., Miklau, G., Wang, Y.: A data- and workload-aware query answering algorithm for range queries under differential privacy. In: The 40th International Conference on Very Large Data Bases, pp 341–352. VLDB Endowment, Hangzhou (2014)
Meyer, C.A., Boissier, M., Michaud, A., et al.: Dynamic and transparent data tiering for in-memory databases in mixed workload environments. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS). Hawaii (2015)
Nasir, M. A. U., Morales, G. D. F., Kourtellis, N., et al.: When two choices are not enough: balancing at scale in distributed stream processing. In: 32nd IEEE International Conference on Data Engineering (ICDE), pp 589–600. IEEE Press, Helsinki (2016)
Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: ACM International Conference on Management of Data (SIGMOD), pp 61–72. ACM, Scottsdale (2012)
Serafini, M., Taft, R., Elmore, A. J., et al.: Clay: fine-grained adaptive partitioning for general database schemas. Proc. VLDB Endow. 10(4), 445–456 (2016)
Sun, L., Franklin, M. J., Krishnan, S., et al.: Fine-grained partitioning for aggressive data skipping. In: ACM International Conference on Management of Data (SIGMOD), pp 1115–1126. ACM, Snowbird (2014)
Vigfusson, Y., Silberstein, A., Cooper, B. F., et al.: Adaptively parallelizing distributed range queries. In: The 35th International Conference on Very Large Data Bases, pp 682–693. VLDB Endowment, Lyon (2009)
Vo, H. T., Chen, C., Ooi, B. C.: Towards elastic transactional cloud storage with range query support. In: The 36th International Conference on Very Large Data Bases, pp 506–514. VLDB Endowment, Singapore (2010)
Wikimedia Traffic Analysis Report - Wikipedia Page Views Per Country - Breakdown. https://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryBreakdown.htm
Wikipedia Statistics. https://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm
Xu, J., Güting, R. H., Zheng, Y.: The TM-RTree: an index on generic moving objects for range queries. GeoInformatica 19(3), 487–524 (2015)
Zhang, S.: Data partitioning view of mining big data. CoRR 1611.09691 (2016)
Zhu, H., Yang, X., Wang, B., et al.: Range-based obstructed nearest neighbor queries. In: ACM International Conference on Management of Data (SIGMOD), pp 2053–2068. ACM, San Francisco (2016)
Acknowledgements
The authors are grateful for the anonymous reviewers’ insightful comments and valuable suggestions sincerely, which can substantially improve the quality of this paper. This work is funded by Chinese National Natural Science Fund Grants (61572250, 61362006, 61672176, 61763003), Jiangsu Province Science & Technology Research Grant (BE2014131), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Guangxi Natural Science Fund Grants (2016GXNSFAA380192) and Guangxi IBAYT Program (KY2016YB065).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ge, W., Li, X., Yuan, C. et al. Correlation-aware partitioning for skewed range query optimization. World Wide Web 22, 125–151 (2019). https://doi.org/10.1007/s11280-018-0547-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0547-4