Data Warehousing in Cloud Environments

Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_80623-1

Definition

Data warehousing was born in business information systems environments dominated by relational databases running on traditional servers. Later, the types of source data and source systems widened, and the deployment environments increasingly included high-end MPP systems. Today, data warehousing has joined the cloud computing wave, running DW systems on both private, public, and hybrid clouds, based mainly on clusters of commodity machines. Cloud-based data warehouses employ components for cloud-based data storage, querying, and processing, often using file-based storage of complex, non-relational, types of data. A widely used platform is Hadoop, the open-source version of Google’s MapReduce platform for scalable dataflow processing on commodity clusters, which was among the earliest systems for cloud data warehousing. While Hadoop is scalable, fault tolerant, and versatile, it is not...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Abadi DJ. Data Management in the cloud: limitations and opportunities. IEEE Data Eng Bull. 2009;32(1):3–12.MathSciNetGoogle Scholar
  2. 2.
    Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2009;2(1):922–933. doi:10.14778/1687627.1687731.Google Scholar
  3. 3.
    Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I. BlinkDB: queries with bounded errors and bounded response times on very large data. Eurosys 2013. doi:10.1145/2465351.2465355.Google Scholar
  4. 4.
    Armbrust M, Xin RS, Lian C, et al. Spark SQL: relational data processing in spark. SIGMOD 2015. doi:10.1145/2723372.2742797.Google Scholar
  5. 5.
    Chan L. Presto: interacting with petabytes of data at Facebook. 2016. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920. Accessed 28 Jun 2016.
  6. 6.
    Dean J, Ghemawat S. MapReduce: a flexible data processing tool. CACM 2010;53(1):72–77. doi:10.1145/1629175.1629198.CrossRefGoogle Scholar
  7. 7.
    Gupta A, Agarwal D, Tan D, et al. Amazon redshift and the case for simpler data warehouses. SIGMOD 2015. doi:10.1145/2723372.2742795.Google Scholar
  8. 8.
    Liu X, Thomsen C, Pedersen TB. ETLMR: a highly scalable dimensional ETL framework based on MapReduce. DaWaK 2011. doi:10.1007/978-3-642-23544-3_8.Google Scholar
  9. 9.
    Liu X, Thomsen C, Pedersen TB. CloudETL: scalable dimensional ETL for hive. IDEAS 2014. doi:10.1145/2628194.2628249.Google Scholar
  10. 10.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a not-so-foreign language for data processing. SIGMOD 2008. doi:10.1145/1376616.1376726.Google Scholar
  11. 11.
    Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y. Emerging trends in the enterprise analytics: connecting Hadoop and DB2 warehouse. SIGMOD 2011. doi:10.1145/1989323.1989446.Google Scholar
  12. 12.
    Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M. A comparison of approaches to large-scale data processing. SIGMOD 2009. doi:10.1145/1559845.1559865.Google Scholar
  13. 13.
    Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data: parallel analysis with Sawzall. Sci Program. 2005;13(4):277–298.Google Scholar
  14. 14.
    Stonebreaker M, Abadi D, DeWitt DJ, Madden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs: friends of foes? CACM 2010;53(1):64–71. doi:10.1145/1629175.1629197.CrossRefGoogle Scholar
  15. 15.
    Thusso A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, et al. Hive – a warehousing solution over a Map-Reduce framework. VLDB 2009. doi:10.14778/1687553.1687609.Google Scholar
  16. 16.
    Xin R, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. SIGMOD 2013. doi:10.1145/2463676.2465288Google Scholar
  17. 17.
    Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI 2012.Google Scholar

Authors and Affiliations

  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark

Section editors and affiliations

  • Torben Bach Pedersen
    • 1
  • Stefano Rizzi
    • 2
  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark
  2. 2.DISIUniversity of BolognaBolognaItaly