Cluster Computing, Recursion and Datalog

  • Foto N. Afrati
  • Vinayak Borkar
  • Michael Carey
  • Neoklis Polyzotis
  • Jeffrey D. Ullman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6702)

Abstract

The cluster-computing environment typified by Hadoop, the open-source implementation of map-reduce, is receiving serious attention as the way to execute queries and other operations on very large-scale data. Datalog execution presents several unusual issues for this enviroment. We discuss the best way to execute a round of seminaive evaluation on a computing cluster using the map-reduce. Using transitive closure as an example, we examine the cost of executing recursions in several different ways. Recursive processes such as evaluation of a recursive Datalog program do not fit the key map-reduce assumption that tasks deliver output only when they are completed. As a result, the resilience under compute-node failure that is a key element of the map-reduce framework is not supported for recursive programs. We discuss extensions to this framework that are suitable for executing recursive Datalog programs on very large-scale data in a way that allows progress to continue after node failures, without restarting the entire job.

Keywords

Hash Function Communication Cost Transitive Closure Node Failure Cluster Computing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT (2010)Google Scholar
  2. 2.
    Al-Kiswany, S., Ripeanu, M., Vazhkudai, S.S., Gharaibeh, A.: stdchk: A checkpoint storage system for desktop grid computing. In: ICDCS, pp. 613–624 (2008)Google Scholar
  3. 3.
    Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: EuroSys, pp. 223–236 (2010)Google Scholar
  4. 4.
    Apache. Hadoop (2006), http://hadoop.apache.org/
  5. 5.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC 2010: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 119–130. ACM, New York (2010)Google Scholar
  6. 6.
    Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Proceedings of the IEEE International Conference on Data Engineering (to appear, 2011)Google Scholar
  7. 7.
    Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.L.: Graph structure in the web. Computer Networks 33(1-6), 309–320 (2000)CrossRefGoogle Scholar
  8. 8.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: efficient iterative data processing on large clusters. In: VLDB Conference (2010)Google Scholar
  9. 9.
    Dar, S., Ramakrishnan, R.: A performance study of transitive closure algorithms. In: SIGMOD Conference, pp. 454–465 (1994)Google Scholar
  10. 10.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  11. 11.
    DeWitt, D.J., Paulson, E., Robinson, E., Naughton, J.F., Royalty, J., Shankar, S., Krioukov, A.: Clustera: an integrated computation and data management system. PVLDB 1(1), 28–41 (2008)Google Scholar
  12. 12.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The complete book (2009)Google Scholar
  13. 13.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: 19th ACM Symposium on Operating Systems Principles (2003)Google Scholar
  14. 14.
    Hellerstein, J.M.: The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec. 39, 1, 5–19 (2010)CrossRefGoogle Scholar
  15. 15.
    Ioannidis, Y.E.: On the computation of the transitive closure of relational operators. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 403–411. Morgan Kaufmann Publishers Inc., San Francisco (1986)Google Scholar
  16. 16.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys 2007 (2007)Google Scholar
  17. 17.
    Kabler, R., Ioannidis, Y.E., Carey, M.J.: Performance evaluation of algorithms for transitive closure. Inf. Syst. 17(5), 415–441 (1992)CrossRefMATHGoogle Scholar
  18. 18.
    Kontogiannis, S.C., Pantziou, G.E., Spirakis, P.G., Yung, M.: Robust parallel computations through randomization. Theory Comput. Syst. 33(5/6), 427–464 (2000)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Lam, M., et al.: Bdd-based deductive database. bddbddb.sourceforge.net (2008)Google Scholar
  20. 20.
    Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD Conference (2010)Google Scholar
  21. 21.
    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD 2010: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, New York (2010)Google Scholar
  22. 22.
    Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets (2010)Google Scholar
  23. 23.
    Seong, S.-W., Nasielski, M., Seo, J., Sengupta, D., Hangal, S., Teh, S.K., Chu, R., Dodson, B., Lam, M.S.: The architecture and implementation of a decentralized social networking platform (2009), http://prpl.stanford.edu/papers/prpl09.pdf
  24. 24.
    Ullman, J.D.: Principles of Database and Knowledge-Base Systems (1989)Google Scholar
  25. 25.
    Valduriez, P., Boral, H.: Evaluation of recursive queries using join indices. In: Expert Database Conf., pp. 271–293 (1986)Google Scholar
  26. 26.
    Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, L., Gunda, P.K., Currey, J.: Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In: Draves, R., van Renesse, R. (eds.) OSDI, pp. 1–14. USENIX Association (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Foto N. Afrati
    • 1
  • Vinayak Borkar
    • 2
  • Michael Carey
    • 2
  • Neoklis Polyzotis
    • 3
  • Jeffrey D. Ullman
    • 4
  1. 1.National Technical University of AthensGreece
  2. 2.UC IrvineUSA
  3. 3.UC Santa CruzUSA
  4. 4.Stanford UniversityUSA

Personalised recommendations