Advertisement

The VLDB Journal

, Volume 21, Issue 5, pp 611–636 | Cite as

SCOPE: parallel databases meet MapReduce

  • Jingren ZhouEmail author
  • Nicolas Bruno
  • Ming-Chuan Wu
  • Per-Ake Larson
  • Ronnie Chaiken
  • Darren Shakib
Special Issue Paper

Abstract

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for Parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.

Keywords

SCOPE Parallel databases MapReduce Distributed computation Query optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceeding of VLDB Conference (2009)Google Scholar
  2. 2.
    Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of OSDI Conference (2010)Google Scholar
  3. 3.
    Apache. Hadoop. http://hadoop.apache.org/
  4. 4.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM Symposium on Cloud Computing (2010)Google Scholar
  5. 5.
    Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. In: Proceedings of VLDB Conference (2011)Google Scholar
  6. 6.
    Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of ICDE Conference (2011)Google Scholar
  7. 7.
    Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. In: Proceedings of VLDB Conference (2008)Google Scholar
  8. 8.
    Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing: a SQL implementation on the MapReduce framework. In: Proceedings of VLDB Conference (2011)Google Scholar
  9. 9.
    Copeland, G.P., Khoshafian, S.N.: A decomposition storage model. In: Proceedings of SIGMOD Conference (1985)Google Scholar
  10. 10.
    Darwen, H., Date, C.: The role of functional dependencies in query decomposition. In: Relational Database Writings 1989-1991. Addison Wesley (1992)Google Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI Conference (2004)Google Scholar
  12. 12.
    DeWitt D., Gray J.: Parallel database systems: the future of high performance database processing. Commun. ACM 35(6), 85–98 (1992)CrossRefGoogle Scholar
  13. 13.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of SOSP Conference (2003)Google Scholar
  14. 14.
    Graefe, G.: Encapsulation of parallelism in the Volcano query processing system. In: Proceeding of SIGMOD Conference (1990)Google Scholar
  15. 15.
    Graefe G.: The Cascades framework for query optimization. Data Eng. Bull. 18(3), 19–29 (1995)Google Scholar
  16. 16.
    Graefe, G., McKenna, W.J.: The Volcano optimizer generator: extensibility and efficient search. In: Proceeding of ICDE Conference (1993)Google Scholar
  17. 17.
    Isard, M. et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of EuroSys Conference (2007)Google Scholar
  18. 18.
    Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of SOSP Conference (2009)Google Scholar
  19. 19.
    Lu H., Ooi B.-C., Tan K.L.: Query Processing in Parallel Relational Database Systems. IEEE Computer Society Press, Los Alamitos (1994)Google Scholar
  20. 20.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of webscale datasets. In: Proceedings of VLDB Conference (2010)Google Scholar
  21. 21.
    Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of VLDB Conference (2004)Google Scholar
  22. 22.
    Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of ICDE Conference (2004)Google Scholar
  23. 23.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of SIGMOD Conference (2008)Google Scholar
  24. 24.
    Pike R., Dorward S., Griesemer R., Quinlan S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. J. 13(4), 277–298 (2005)Google Scholar
  25. 25.
    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of SIGMOD Conference (1979)Google Scholar
  26. 26.
    Simmen, D., Shekita, E., Malkenus, T.: Fundamental techniques for order optimization. In: Proceedings of SIGMOD Conference (1996)Google Scholar
  27. 27.
    Stonebraker M., Abadi D., DeWitt D.J., Madden S., Paulson E., Pavlo A., Rasin A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRefGoogle Scholar
  28. 28.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a MapReduce framework. In: Proceedings of VLDB Conference (2009)Google Scholar
  29. 29.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of ICDE Conference (2010)Google Scholar
  30. 30.
    Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceeding of VLDB Conference (2003)Google Scholar
  31. 31.
    Yu, Y. et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of OSDI Conference (2008)Google Scholar
  32. 32.
    Zhou, J., Larson, P.-Å., Chaiken, R.: Incorporating partitioning and parallel plans into the SCOPE optimizer. In: Proceedings of ICDE Conference (2010)Google Scholar
  33. 33.
    Zhou, J., Larson, P.-Å., Freytag, J.-C., Lehner, W.: Efficient exploitation of similar subexpressions for query processing. In: Proceedings of SIGMOD Conference (2007)Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Jingren Zhou
    • 1
    Email author
  • Nicolas Bruno
    • 1
  • Ming-Chuan Wu
    • 1
  • Per-Ake Larson
    • 1
  • Ronnie Chaiken
    • 1
  • Darren Shakib
    • 1
  1. 1.Microsoft Corp.RedmondUSA

Personalised recommendations