Locality-aware allocation of multi-dimensional correlated files on the cloud platform

Zhang, Xiaofei; Tong, Yongxin; Chen, Lei; Wang, Min; Feng, Shicong

doi:10.1007/s10619-014-7153-y

Locality-aware allocation of multi-dimensional correlated files on the cloud platform

Published: 06 August 2014

Volume 33, pages 353–380, (2015)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Xiaofei Zhang¹,
Yongxin Tong¹,
Lei Chen¹,
Min Wang² &
…
Shicong Feng³

356 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The effective management of enormous data volumes on the Cloud platform has attracted devoting research efforts. In this paper, we study the problem of allocating files with multidimensional correlations on the Cloud platform, such that files can be retrieved and processed more efficiently. Currently, most prevailing Cloud file systems allocate data following the principles of fault tolerance and availability, while inter-file correlations, i.e. files correlated with each other, are often neglected. As a matter of fact, data files are commonly correlated in various ways in real practices. And correlated files are most likely to be involved in the same computation process. Therefore, it raises a new challenge of allocating files with multi-dimensional correlations with the “subspace locality” taken into consideration to improve the system throughput. We propose two allocation methods for multi-dimensional correlated files stored on the Cloud platform, such that the I/O efficiency and data access locality are improved in the MapReduce processing paradigm, without hurting the fault tolerance and availability properties of the underlying file systems. Different from the techniques proposed in [1,2], which quickly map the locations of desired data for a given query \({\mathcal {Q}}\), we focus on improving the system throughput for batch jobs over correlated data files. We clearly formulate the problem and study a series of solutions on HDFS [9]. Evaluations with real application scenarios prove the effectiveness of our proposals: significant I/O and network costs can be saved during the data retrieval and processing. Especially for batch OLAP jobs, our solution demonstrates well balanced workload among distributed computing nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Locality-Aware GC Optimisations for Big Data Workloads

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Notes

Small files will be placed in the same block until a block is full.
We believe the closeness measurement is application dependent and consider it as a predefined metric.
We consider each partition group in \({\mathcal {P}}_i\) is equally important. So is the \(m\) different feature subspaces.
two tasks are orthogonal if they are not performed on the same partition group of \({\mathcal {P}}_b\).

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceedings of VLDB Endow, pp. 922–933 (2009)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of EDBT, pp. 99–110 (2010)
Amazon Web Service. http://s3.amazonaws.com
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of SIGMOD, pp. 975–986 (2010)
Brunet, J., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. PNAS 101(12), 4164–4169 (2004)
Article Google Scholar
Chen, Y., Wang, W., Du, X., Zhou, X.: Continuously monitoring the correlations of massive discrete streams. In: Proceedings of CIKM, pp. 1571–1576 (2011)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–328 (2010)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: Proceedings of SIGMOD, pp. 1115–1118 (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)
Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: Proceedings of VLDB Endow, pp. 515–529 (2010)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. In: Proceedings of VLDB Endow, pp. 575–585 (2011)
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of SOSP, pp. 29–43 (2003)
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc. VLDB 4(11), 1123–1134 (2011)
Google Scholar
IMDb. http://www.imdb.com/interfacesplain
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. In: Proceedings of VLDB Endow, pp. 472–483 (2010)
Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)
Article Google Scholar
Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002)
Google Scholar
Lei, M., Vrbsky, S.V., Hong, X.: An on-line replication strategy to increase availability in Data Grids. J. Futur. Gener. Comput. Syst. 24(2), 85–98 (2008)
Article MATH Google Scholar
Lieberman, H., Selker, T.: Out of context: computer systems that adapt to, and learn from, context. IBM Syst. J. 39(3–4), 617–632 (2000)
Article MATH Google Scholar
Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: Proceedings of SIGMOD, pp. 1137–1148 (2011)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of SIGMOD, pp. 165–178 (2009)
Ranganathan, K., Iamnitchi, A., Foster, I.: Improving data availability through dynamic model-driven replication in large peer-to-peer communities. In: Proceedings of CCGRID, pp. 376–381 (2002)
Samet, H.: Foundations of Multi-dimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc. (2005)
Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts, 5th edn. MCGraw-Hill Inc. (2006)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
The Apache Software Foundation. Hadoop. http://hadoop.apache.org/
The Apache Software Foundation. HDFS architecture guide. https://hadoop.apache.org/hdfs/docs/current/hdfs_design.html
The Apache Software Foundation. Hive. http://hive.apache.org/
Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C.: Indexing multi-dimensional data in a cloud system. In: Proceedings of SIGMOD, pp. 591–602 (2010)
Wang, J., Jea, K.: A near-optimal database allocation for reducing the average waiting time in the grid computing environment. J. Inf. Sci. 179(21), 3772–3790 (2009)
Article MathSciNet MATH Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of OSDI, pp. 307–320 (2006)
Zhang, X., Ai, J., Wang, Z., Lu, J., Meng, X.: An efficient multi-dimensional index for cloud data management. In: Proceedings of CloudDB, pp. 17–24 (2009)

Download references

Author information

Authors and Affiliations

HKUST, Hong Kong, Hong Kong
Xiaofei Zhang, Yongxin Tong & Lei Chen
Google Research USA, New York, NY, USA
Min Wang
Miao Zhen Company, Beijing, People’s Republic of China
Shicong Feng

Authors

Xiaofei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongxin Tong
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Min Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shicong Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, X., Tong, Y., Chen, L. et al. Locality-aware allocation of multi-dimensional correlated files on the cloud platform. Distrib Parallel Databases 33, 353–380 (2015). https://doi.org/10.1007/s10619-014-7153-y

Download citation

Published: 06 August 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10619-014-7153-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locality-aware allocation of multi-dimensional correlated files on the cloud platform

Abstract

Access this article

Similar content being viewed by others

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Locality-Aware GC Optimisations for Big Data Workloads

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Locality-aware allocation of multi-dimensional correlated files on the cloud platform

Abstract

Access this article

Similar content being viewed by others

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Locality-Aware GC Optimisations for Big Data Workloads

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation