Skip to main content
Log in

Locality-aware allocation of multi-dimensional correlated files on the cloud platform

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The effective management of enormous data volumes on the Cloud platform has attracted devoting research efforts. In this paper, we study the problem of allocating files with multidimensional correlations on the Cloud platform, such that files can be retrieved and processed more efficiently. Currently, most prevailing Cloud file systems allocate data following the principles of fault tolerance and availability, while inter-file correlations, i.e. files correlated with each other, are often neglected. As a matter of fact, data files are commonly correlated in various ways in real practices. And correlated files are most likely to be involved in the same computation process. Therefore, it raises a new challenge of allocating files with multi-dimensional correlations with the “subspace locality” taken into consideration to improve the system throughput. We propose two allocation methods for multi-dimensional correlated files stored on the Cloud platform, such that the I/O efficiency and data access locality are improved in the MapReduce processing paradigm, without hurting the fault tolerance and availability properties of the underlying file systems. Different from the techniques proposed in [1,2], which quickly map the locations of desired data for a given query \({\mathcal {Q}}\), we focus on improving the system throughput for batch jobs over correlated data files. We clearly formulate the problem and study a series of solutions on HDFS [9]. Evaluations with real application scenarios prove the effectiveness of our proposals: significant I/O and network costs can be saved during the data retrieval and processing. Especially for batch OLAP jobs, our solution demonstrates well balanced workload among distributed computing nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Small files will be placed in the same block until a block is full.

  2. We believe the closeness measurement is application dependent and consider it as a predefined metric.

  3. We consider each partition group in \({\mathcal {P}}_i\) is equally important. So is the \(m\) different feature subspaces.

  4. two tasks are orthogonal if they are not performed on the same partition group of \({\mathcal {P}}_b\).

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceedings of VLDB Endow, pp. 922–933 (2009)

  2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of EDBT, pp. 99–110 (2010)

  3. Amazon Web Service. http://s3.amazonaws.com

  4. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of SIGMOD, pp. 975–986 (2010)

  5. Brunet, J., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. PNAS 101(12), 4164–4169 (2004)

    Article  Google Scholar 

  6. Chen, Y., Wang, W., Du, X., Zhou, X.: Continuously monitoring the correlations of massive discrete streams. In: Proceedings of CIKM, pp. 1571–1576 (2011)

  7. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–328 (2010)

  8. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: Proceedings of SIGMOD, pp. 1115–1118 (2010)

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)

  10. Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: Proceedings of VLDB Endow, pp. 515–529 (2010)

  11. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. In: Proceedings of VLDB Endow, pp. 575–585 (2011)

  12. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of SOSP, pp. 29–43 (2003)

  13. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc. VLDB 4(11), 1123–1134 (2011)

    Google Scholar 

  14. IMDb. http://www.imdb.com/interfacesplain

  15. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. In: Proceedings of VLDB Endow, pp. 472–483 (2010)

  16. Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)

    Article  Google Scholar 

  17. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002)

    Google Scholar 

  18. Lei, M., Vrbsky, S.V., Hong, X.: An on-line replication strategy to increase availability in Data Grids. J. Futur. Gener. Comput. Syst. 24(2), 85–98 (2008)

    Article  MATH  Google Scholar 

  19. Lieberman, H., Selker, T.: Out of context: computer systems that adapt to, and learn from, context. IBM Syst. J. 39(3–4), 617–632 (2000)

    Article  MATH  Google Scholar 

  20. Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: Proceedings of SIGMOD, pp. 1137–1148 (2011)

  21. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of SIGMOD, pp. 165–178 (2009)

  22. Ranganathan, K., Iamnitchi, A., Foster, I.: Improving data availability through dynamic model-driven replication in large peer-to-peer communities. In: Proceedings of CCGRID, pp. 376–381 (2002)

  23. Samet, H.: Foundations of Multi-dimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc. (2005)

  24. Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts, 5th edn. MCGraw-Hill Inc. (2006)

  25. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  26. The Apache Software Foundation. Hadoop. http://hadoop.apache.org/

  27. The Apache Software Foundation. HDFS architecture guide. https://hadoop.apache.org/hdfs/docs/current/hdfs_design.html

  28. The Apache Software Foundation. Hive. http://hive.apache.org/

  29. Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C.: Indexing multi-dimensional data in a cloud system. In: Proceedings of SIGMOD, pp. 591–602 (2010)

  30. Wang, J., Jea, K.: A near-optimal database allocation for reducing the average waiting time in the grid computing environment. J. Inf. Sci. 179(21), 3772–3790 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  31. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of OSDI, pp. 307–320 (2006)

  32. Zhang, X., Ai, J., Wang, Z., Lu, J., Meng, X.: An efficient multi-dimensional index for cloud data management. In: Proceedings of CloudDB, pp. 17–24 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Tong, Y., Chen, L. et al. Locality-aware allocation of multi-dimensional correlated files on the cloud platform. Distrib Parallel Databases 33, 353–380 (2015). https://doi.org/10.1007/s10619-014-7153-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7153-y

Keywords

Navigation