Skip to main content

Big Data Management in the Cloud: Evolution or Crossroad?

Part of the Communications in Computer and Information Science book series (CCIS,volume 613)

Abstract

In this paper, we try to provide a synthetic and comprehensive state of the art concerning big data management in cloud environments. In this perspective, data management based on parallel and cloud (e.g. MapReduce) systems are overviewed, and compared by relying on meeting software requirements (e.g. data independence, software reuse), high performance, scalability, elasticity, and data availability. With respect to proposed cloud systems, we discuss evolution of their data manipulation languages and we try to learn some lessons should be exploited to ensure the viability of the next generation of large-scale data management systems for big data applications.

Keywords

  • Big data management
  • Data partitioning
  • Query processing and optimization
  • Parallel Relational Database Systems
  • High performance
  • Scalability
  • Cloud systems
  • Hadoop
  • Mapreduce
  • Spark
  • Elasticity

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-34099-9_2
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   109.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-34099-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   139.99
Price excludes VAT (USA)

References

  1. Agarwal, S., Kandula, S., Bruno, N., Wu, M., Stoica, I., Zhou, J.: Reoptimizing data parallel computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 281–294 (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/agarwal

  2. Agrawal, D., El Abbadi, A., Ooi, B.C., Das, S., Elmore, A.J.: The evolving landscape of data management in the cloud. IJCSE 7(1), 2–16 (2012). http://dx.doi.org/10.1504/IJCSE.2012.046177

    CrossRef  Google Scholar 

  3. Akbarinia, R., Liroz-Gistau, M., Agrawal, D., Valduriez, P.: An efficient solution for processing skewed mapreduce jobs. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9262, pp. 417–429. Springer, Heidelberg (2015)

    CrossRef  Google Scholar 

  4. Apache Spark. https://spark.incubator.apache.org/

  5. Baru, C.K., Fecteau, G., Goyal, A., Hsiao, H., Jhingran, A., Padmanabhan, S., Wilson, W.G.: An overview of DB2 parallel edition. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, 22–25 May 1995, pp. 460–462 (1995). http://doi.acm.org/10.1145/223784.223876

  6. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: JAQL: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011). http://www.vldb.org/pvldb/vol4/p1272-beyer.pdf

    Google Scholar 

  7. Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Integrating big data and relational data with a functional SQL-like query language. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 170–185. Springer, Heidelberg (2015)

    CrossRef  Google Scholar 

  8. Cariño, F., Kostamaa, P.: Exegesis of DBC/1012 and P-90 - industrial supercomputer database machines. In: Etiemble, D., Syre, J.-C. (eds.) PARLE 1992. LNCS, vol. 605, pp. 877–892. Springer, Heidelberg (1992). http://dx.doi.org/10.1007/3-540-55599-4_130

    Google Scholar 

  9. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008). http://www.vldb.org/pvldb/1/1454166.pdf

    Google Scholar 

  10. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4 (2008). http://doi.acm.org/10.1145/1365815.1365816

    CrossRef  Google Scholar 

  11. Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: HAWQ: a massively parallel processing SQL engine in hadoop. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1223–1234 (2014). http://doi.acm.org/10.1145/2588555.2595636

  12. Chaudhuri, S.: What next?: a half-dozen data management research goals for big data and the cloud. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 1–4 (2012). http://doi.acm.org/10.1145/2213556.2213558

  13. Chekuri, C., Hasan, W., Motwani, R.: Scheduling problems in parallel query optimization. In: Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Jose, California, USA, 22–25 May 1995, pp. 255–265 (1995). http://doi.acm.org/10.1145/212433.212471

  14. Chen, M., Lo, M., Yu, P.S., Young, H.C.: Using segmented right-deep trees for the execution of pipelined hash joins. In: Proceedings of 18th International Conference on Very Large Data Bases, Vancouver, Canada, 23–27 August 1992, pp. 15–26 (1992). http://www.vldb.org/conf/1992/P015.PDF

  15. Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

  16. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008). http://www.vldb.org/pvldb/1/1454167.pdf

    Google Scholar 

  17. Copeland, G.P., Alexander, W., Boughter, E.E., Keller, T.W.: Data placement in bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, 1–3 June 1988, pp. 99–108 (1988). http://doi.acm.org/10.1145/50202.50213

    Google Scholar 

  18. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150 (2004). http://www.usenix.org/events/osdi04/tech/dean.html

  19. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, 14–17 October 2007, pp. 205–220 (2007). http://doi.acm.org/10.1145/1294261.1294281

  20. DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992). http://doi.acm.org/10.1145/129888.129894

    CrossRef  Google Scholar 

  21. DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 1255–1266 (2013). http://doi.acm.org/10.1145/2463676.2463709

  22. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in mapreduce. VLDB J. 23(3), 355–380 (2014). http://dx.doi.org/10.1007/s00778-013-0319-9

    CrossRef  Google Scholar 

  23. Englert, S., Glasstone, R., Hasan, W.: Parallelism and its price: a case study of nonstop SQL/MP. SIGMOD Rec. 24(4), 61–71 (1995). http://dx.doi.org/10.1145/219713.219760

    CrossRef  Google Scholar 

  24. Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12), 1295–1306 (2014). http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf

    Google Scholar 

  25. Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the elephants handle the NoSQL onslaught? PVLDB 5(12), 1712–1723 (2012). http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf

    Google Scholar 

  26. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: the pig experience. PVLDB 2(2), 1414–1425 (2009). http://www.vldb.org/pvldb/2/vldb09-1074.pdf

    Google Scholar 

  27. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operatig Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, 19–22 October 2003, pp. 29–43 (2003). http://doi.acm.org/10.1145/945445.945450

  28. Gray, J.: Evolution of data management. IEEE Comput. 29(10), 38–46 (1996). http://dx.doi.org/10.1109/2.539719

    CrossRef  Google Scholar 

  29. Hadoop. http://hadoop.apache.org

  30. Hameurlain, A., Morvan, F.: An optimization method of data communication and control for parallel execution of SQL queries. In: Proceedings of 4th International Conference on Database and Expert Systems Applications, DEXA 1993, Prague, Czech Republic, 6–8 September 1993, pp. 301–312 (1993). http://dx.doi.org/10.1007/3-540-57234-1_27

    Google Scholar 

  31. Hameurlain, A., Morvan, F.: A parallel scheduling method for efficient query processing. In: Proceedings of the 1993 International Conference on Parallel Processing. Algorithms & Applications, Syracuse University, NY, USA, 16–20 August 1993, vol. III, pp. 258–262 (1993). http://dx.doi.org/10.1109/ICPP.1993.31

  32. Hameurlain, A., Morvan, F.: Scheduling and mapping for parallel execution of extended SQL queries. In: CIKM 1995, Proceedings of the 1995 International Conference on Information and Knowledge Management, Baltimore, Maryland, USA, 28 November–2 December 1995, pp. 197–204 (1995). http://doi.acm.org/10.1145/221270.221567

  33. Hameurlain, A., Morvan, F.: Parallel relational database systems: Why, how and beyond. In: Proceedings of 7th International Conference on Database and Expert Systems Applications, DEXA 1996, Zurich, Switzerland, 9–13 September 1996, pp. 302–312 (1996). http://dx.doi.org/10.1007/BFb0034690

    Google Scholar 

  34. Hasan, W., Motwani, R.: Optimization algorithms for exploiting the parallelism-communication tradeoff in pipelined parallelism. In: VLDB 1994, Proceedings of 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, 12–15 September 1994, pp. 36–47 (1994), http://www.vldb.org/conf/1994/P036.PDF

  35. Hasan, W., Motwani, R.: Coloring away communication in parallel query optimization. In: VLDB 1995, Proceedings of 21th International Conference on Very Large Data Bases, Zurich, Switzerland, 11–15 September 1995, pp. 239–250 (1995). http://www.vldb.org/conf/1995/P239.PDF

  36. Hong, W.: Exploiting inter-operation parallelism in XPRS. In: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, 2–5 June 1992, pp. 19–28 (1992). http://doi.acm.org/10.1145/130283.130292

  37. Indrawan-Santiago, M.: Database research: Are we at a crossroad? reflection on nosql. In: 15th International Conference on Network-Based Information Systems, NBiS 2012, Melbourne, Australia, 26–28 September 2012, pp. 45–51 (2012). http://dx.doi.org/10.1109/NBiS.2012.95

  38. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of mapreduce: an in-depth study. PVLDB 3(1), 472–483 (2010). http://www.comp.nus.edu.sg/vldb2010/proceedings/files/papers/E03.pdf

    Google Scholar 

  39. Kabra, N., DeWitt, D.J.: Efficient mid-query re-optimization of sub-optimal query execution plans. In: SIGMOD 1998, Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, 2–4 June 1998, pp. 106–117 (1998). http://doi.acm.org/10.1145/276304.276315

  40. Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on mapreduce. In: Proceedings of 15th International Conference on Extending Database Technology, EDBT 2012, Berlin, Germany, 27–30 March 2012, pp. 15–25 (2012). http://doi.acm.org/10.1145/2247596.2247600

  41. Karanasos, K., Balmin, A., Kutsch, M., Ozcan, F., Ercegovac, V., Xia, C., Jackson, J.: Dynamically optimizing queries over large scale data platforms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 943–954 (2014). http://doi.acm.org/10.1145/2588555.2610531

  42. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. Opera. Syst. Rev. 44(2), 35–40 (2010). http://doi.acm.org/10.1145/1773912.1773922

    CrossRef  Google Scholar 

  43. Lanzelotte, R.S.G., Valduriez, P.: Extending the search strategy in a query optimizer. In: Proceedings of 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, 3–6 September 1991, pp. 363–373 (1991). http://www.vldb.org/conf/1991/P363.PDF

  44. Lee, K., Lee, Y., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011). http://doi.acm.org/10.1145/2094114.2094118

    CrossRef  Google Scholar 

  45. Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3), 31: 1–31: 42 (2014). http://doi.acm.org/10.1145/2503009

    Google Scholar 

  46. Livny, M., Khoshafian, S., Boral, H.: Multi-disk management algorithms. In: SIGMETRICS, pp. 69–77 (1987). http://doi.acm.org/10.1145/29903.29914

    Google Scholar 

  47. Lu, H., Tan, K.L., Ooi, B.C.: Query Processing in Parallel Relational Database Systems. IEEE CS Press, Los Alamitos (1994)

    Google Scholar 

  48. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, 10–12 June 2008, pp. 1099–1110 (2008). http://doi.acm.org/10.1145/1376616.1376726

  49. Oracle. http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-

  50. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)

    Google Scholar 

  51. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, 29 June–2 July 2009, pp. 165–178 (2009). http://doi.acm.org/10.1145/1559845.1559865

  52. Schneider, D.A., DeWitt, D.J.: Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In: Proceedings of 16th International Conference on Very Large Data Bases, Brisbane, Queensland, Australia, 13–16 August 1990, pp. 469–480 (1990). http://www.vldb.org/conf/1990/P469.PDF

  53. Soliman, M.A., Antova, L., Raghavan, V., El-Helw, A., Gu, Z., Shen, E., Caragea, G.C., Garcia-Alvarado, C., Rahman, F., Petropoulos, M., Waas, F., Narayanan, S., Krikellas, K., Baldwin, R.: Orca: a modular query optimizer architecture for big data. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 337–348 (2014). http://doi.acm.org/10.1145/2588555.2595637

  54. Sqoop. http://sqoop.apache.org/

  55. Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010). http://doi.acm.org/10.1145/1629175.1629197

    CrossRef  Google Scholar 

  56. Stonebraker, M., Cattell, R.: 10 rules for scalable performance in ‘simple operation’ datastores. Commun. ACM 54(6), 72–80 (2011). doi:10.1145/1953122.1953144. http://doi.acm.org/10.1145/1953122.1953144

    CrossRef  Google Scholar 

  57. Stonebraker, M., Madden, S., Dubey, P.: Intel “big data” science and technology center vision and execution plan. SIGMOD Rec. 42(1), 44–49 (2013). http://doi.acm.org/10.1145/2481528.2481537

    CrossRef  Google Scholar 

  58. Tan, K., Lu, H.: Pipeline processing of multi-way join queries in shared-memory systems. In: Proceedings of the 1993 International Conference on Parallel Processing. Architecture, Syracuse University, NY, USA, 16–20 August 1993, vol. I, pp. 345–348 (1993). http://dx.doi.org/10.1109/ICPP.1993.147

  59. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)

    Google Scholar 

  60. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California, USA, 1–6 March 2010, pp. 996–1005 (2010). http://dx.doi.org/10.1109/ICDE.2010.5447738

  61. Trummer, I., Koch, C.: Multi-objective parametric query optimization. PVLDB 8(3), 221–232 (2014). http://www.vldb.org/pvldb/vol8/p221-trummer.pdf

    Google Scholar 

  62. Valduriez, P.: Parallel database systems: open problems and new issues. Distrib. Parallel Databases 1(2), 137–165 (1993). doi:10.1007/BF01264049. http://dx.doi.org/10.1007/BF01264049

    CrossRef  Google Scholar 

  63. Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992). http://dx.doi.org/10.1109/2.121508

    CrossRef  Google Scholar 

  64. Witkowski, A., Cariño, F., Kostamaa, P.: NCR 3700 - the next-generation industrial database computer. In: Proceedings of 19th International Conference on Very Large Data Bases, Dublin, Ireland, 24–27 August 1993, pp. 230–243 (1993). http://www.vldb.org/conf/1993/P230.PDF

  65. Xu, Y., Kostamaa, P., Gao, L.: Integrating hadoop and parallel DBMs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 969–974 (2010). http://doi.acm.org/10.1145/1807167.1807272

  66. Zha, L., Zhang, J., Liu, W., Lin, J.: An uncoupled data process and transfer model for mapreduce. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds.) TLDKS XVII. LNCS, vol. 8970, pp. 24–44. Springer, Heidelberg (2015). http://dx.doi.org/10.1007/978-3-662-46335-2_2

    Google Scholar 

  67. Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet mapreduce. VLDB J. 21(5), 611–636 (2012). http://dx.doi.org/10.1109/PDIS.1993.253066

    CrossRef  Google Scholar 

  68. Ziane, M., Zaït, M., Borla-Salamet, P.: Parallel query processing in DBS3. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems (PDIS 1993), Issues, Architectures, and Algorithms, San Diego, CA, USA, 20–23 January 1993, pp. 93–102 (1993). http://dx.doi.org/10.1109/PDIS.1993.253066

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelkader Hameurlain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Hameurlain, A., Morvan, F. (2016). Big Data Management in the Cloud: Evolution or Crossroad?. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34099-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34098-2

  • Online ISBN: 978-3-319-34099-9

  • eBook Packages: Computer ScienceComputer Science (R0)