Web-Scale Analytics for BIG Data

  • Wolfgang Lehner
  • Kai-Uwe Sattler


Virtualization is the key concept to provide a scalable and flexible computing environment in general. In this chapter, we focus on virtualization concepts in the context of data management tasks. We review existing concepts and technologies spanning multiple software layers.


Query Processing Relational Algebra Functional Programming Storage Node Execution Plan 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM symposium on Cloud computing, pp. 119–130 (2010)Google Scholar
  2. 2.
    Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Kanne, M.E.C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. PVLDB (2011)Google Scholar
  3. 3.
    Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)Google Scholar
  4. 4.
    Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a sql implementation on the mapreduce framework. PVLDB 4(12), 1318–1327 (2011)Google Scholar
  5. 5.
    Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)Google Scholar
  6. 6.
    Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)MATHCrossRefGoogle Scholar
  7. 7.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the conference on Symposium on Opearting Systems Design & Implementation, pp. 10–10 (2004)Google Scholar
  8. 8.
    DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma – a high performance dataflow database machine. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 228–237 (1986)Google Scholar
  9. 9.
    DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)CrossRefGoogle Scholar
  10. 10.
    Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)Google Scholar
  11. 11.
    Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 209–219 (1986)Google Scholar
  12. 12.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems – the complete book (2. ed.). Pearson Education (2009)Google Scholar
  13. 13.
    Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB 2(2), 1414–1425 (2009)Google Scholar
  14. 14.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS 37(5), 29–43 (2003)CrossRefGoogle Scholar
  15. 15.
    Graefe, G.: Parallel query execution algorithms. In: Encyclopedia of Database Systems, pp. 2030–2035. Springer US (2009)Google Scholar
  16. 16.
    Graefe, G.: Modern b-tree techniques. Foundations and Trends in Databases 3(4), 203–402 (2011)CrossRefGoogle Scholar
  17. 17.
    Graefe, G., Bunker, R., Cooper, S.: Hash joins and hash teams in microsoft sql server. In: VLDB, pp. 86–97 (1998)Google Scholar
  18. 18.
    Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: ICDE, pp. 209–218 (1993)Google Scholar
  19. 19.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, pp. 127–138 (1995)Google Scholar
  20. 20.
    Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G.: Dvfs in 45nm cmos. IEEE Technology 9(2), 922–933 (2010)Google Scholar
  21. 21.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)Google Scholar
  22. 22.
    Kemper, A., Eickler, A.: Datenbanksysteme: Eine Einf?hrung. Oldenbourg Wissenschaftsverlag (2006)Google Scholar
  23. 23.
    Kolb, L., Thor, A., Rahm, E.: Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp. 45–64 (2011)Google Scholar
  24. 24.
    Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE, pp. 618–629 (2012)Google Scholar
  25. 25.
    Maier, D.: The Theory of Relational Databases. Computer Science Press (1983)Google Scholar
  26. 26.
    Markl, V., Lohman, G.M., Raman, V.: Leo: An autonomic query optimizer for db2. IBM Systems Journal 42(1), 98–106 (2003)CrossRefGoogle Scholar
  27. 27.
    Neumann, T.: Query optimization (in relational databases). In: Encyclopedia of Database Systems, pp. 2273–2278. Springer US (2009)Google Scholar
  28. 28.
    Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)Google Scholar
  29. 29.
    Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)Google Scholar
  30. 30.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)Google Scholar
  31. 31.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Scientific Programming 13(4), 277–298 (2005)Google Scholar
  32. 32.
    Rao, J., Ross, K.A.: Reusing invariants: A new strategy for correlated queries. In: SIGMOD, pp. 37–48 (1998)Google Scholar
  33. 33.
    Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented dbms. In: VLDB, pp. 553–564 (2005)Google Scholar
  34. 34.
    Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Communications of the ACM 53(1), 64–71 (2010)CrossRefGoogle Scholar
  35. 35.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)Google Scholar
  36. 36.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)Google Scholar
  37. 37.
    Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)Google Scholar
  38. 38.
    Wensel, C.K.: Cascading: Defining and executing complex and fault tolerant data processing workflows on a hadoop cluster (2008).
  39. 39.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2009)Google Scholar
  40. 40.
    Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Wolfgang Lehner
    • 1
  • Kai-Uwe Sattler
    • 2
  1. 1.Dresden University of TechnologyDresdenGermany
  2. 2.Ilmenau University of TechnologyIlmenauGermany

Personalised recommendations