Advertisement

Knowledge and Information Systems

, Volume 41, Issue 2, pp 531–557 | Cite as

Tuple MapReduce and Pangool: an associated implementation

  • Pedro Ferrera
  • Ivan De Prado
  • Eric Palacios
  • Jose Luis Fernandez-Marquez
  • Giovanna Di Marzo Serugendo
Regular Paper

Abstract

This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool’s code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals.

Keywords

MapReduce Hadoop Big Data Distributed systems Scalability 

References

  1. 1.
    Agarwal A, Slee M, Kwiatkowski M (2007) Thrift: scalable cross-language services implementation, technical report, Facebook. http://incubator.apache.org/thrift/static/thrift-20070401.pdf
  2. 2.
    Beyer KS, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Özcan F, Shekita EJ (2011) Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12):1272–1283Google Scholar
  3. 3.
    Borthakur D (2007) The hadoop distributed file system: architecture and design. The Apache Software Foundation, Los Angeles. https://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
  4. 4.
    Byambajav B, Wlodarczyk T, Rong C, LePendu P, Shah N (2012) Performance of left outer join on hadoop with right side within single node memory size. In: 26th international conference on advanced information networking and applications workshops (WAINA), 2012, pp 1075–1080Google Scholar
  5. 5.
    Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010a) FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not 45(6):363–375CrossRefGoogle Scholar
  6. 6.
    Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010b) Flumejava: easy, efficient data-parallel pipelines. In: Proceedings of the 2010 ACM SIGPLAN conference on programming language design and implementation. PLDI ’10, ACM, New York, NY, USA, pp 363–375Google Scholar
  7. 7.
    Chu CT, Kim SK, Lin YA, Yu Y, Bradski GR, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Schölkopf B, Platt JC, Hoffman T (eds) NIPS. MIT Press, Cambridge, MA, pp 281–288Google Scholar
  8. 8.
    Dayal U, Castellanos M, Simitsis A, Wilkinson K (2009) Data integration flows for business intelligence. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. EDBT ’09, ACM, New York, NY, USA, pp 1–11Google Scholar
  9. 9.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. OSDI’04, ACM, USENIX Association, pp 10–10Google Scholar
  10. 10.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  11. 11.
    Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRefGoogle Scholar
  12. 12.
    Deligiannis P, Loidl H-W, Kouidi E (2012) Improving the diagnosis of mild hypertrophic cardiomyopathy with mapreduce. In: In the third international workshop on MapReduce and its applications (MAPREDUCE’12)Google Scholar
  13. 13.
    Ferrera P, de Prado I, Palacios E, Fernandez-Marquez J, Di Marzo Serugendo G (2012) Tuple MapReduce: beyond classic MapReduce. In: IEEE 12th international conference on data mining (ICDM), pp 260–269Google Scholar
  14. 14.
    Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U (2009) Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proc VLDB Endow 2(2):1414–1425CrossRefGoogle Scholar
  15. 15.
    Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’08, ACM, New York, NY, USA, pp 920–927Google Scholar
  16. 16.
    Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web, technical report, Stanford Digital Library Technologies ProjectGoogle Scholar
  17. 17.
    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program J 13:277–298. http://research.google.com/archive/sawzall.html Google Scholar
  18. 18.
    Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages. In: Proceedings of the 9th international conference on advanced parallel processing technologies. APPT’11, Springer, Berlin, pp 58–72Google Scholar
  19. 19.
    Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12):S1+Google Scholar
  20. 20.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow (PVLDB) 2(2):1626–1629CrossRefGoogle Scholar
  21. 21.
    Yang HC, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data. SIGMOD ’07, ACM, New York, NY, USA, pp 1029–1040Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Pedro Ferrera
    • 1
  • Ivan De Prado
    • 1
  • Eric Palacios
    • 1
  • Jose Luis Fernandez-Marquez
    • 2
  • Giovanna Di Marzo Serugendo
    • 2
  1. 1.Datasalt Systems S.L.BarcelonaSpain
  2. 2.CUIUniversity of GenevaCarougeSwitzerland

Personalised recommendations