Tuple MapReduce and Pangool: an associated implementation

Ferrera, Pedro; De Prado, Ivan; Palacios, Eric; Fernandez-Marquez, Jose Luis; Di Marzo Serugendo, Giovanna

doi:10.1007/s10115-013-0705-z

Tuple MapReduce and Pangool: an associated implementation

Regular Paper
Published: 24 December 2013

Volume 41, pages 531–557, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pedro Ferrera¹,
Ivan De Prado¹,
Eric Palacios¹,
Jose Luis Fernandez-Marquez² &
…
Giovanna Di Marzo Serugendo²

398 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool’s code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Agarwal A, Slee M, Kwiatkowski M (2007) Thrift: scalable cross-language services implementation, technical report, Facebook. http://incubator.apache.org/thrift/static/thrift-20070401.pdf
Beyer KS, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Özcan F, Shekita EJ (2011) Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12):1272–1283
Google Scholar
Borthakur D (2007) The hadoop distributed file system: architecture and design. The Apache Software Foundation, Los Angeles. https://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
Byambajav B, Wlodarczyk T, Rong C, LePendu P, Shah N (2012) Performance of left outer join on hadoop with right side within single node memory size. In: 26th international conference on advanced information networking and applications workshops (WAINA), 2012, pp 1075–1080
Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010a) FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not 45(6):363–375
Article Google Scholar
Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010b) Flumejava: easy, efficient data-parallel pipelines. In: Proceedings of the 2010 ACM SIGPLAN conference on programming language design and implementation. PLDI ’10, ACM, New York, NY, USA, pp 363–375
Chu CT, Kim SK, Lin YA, Yu Y, Bradski GR, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Schölkopf B, Platt JC, Hoffman T (eds) NIPS. MIT Press, Cambridge, MA, pp 281–288
Google Scholar
Dayal U, Castellanos M, Simitsis A, Wilkinson K (2009) Data integration flows for business intelligence. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. EDBT ’09, ACM, New York, NY, USA, pp 1–11
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. OSDI’04, ACM, USENIX Association, pp 10–10
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Article Google Scholar
Deligiannis P, Loidl H-W, Kouidi E (2012) Improving the diagnosis of mild hypertrophic cardiomyopathy with mapreduce. In: In the third international workshop on MapReduce and its applications (MAPREDUCE’12)
Ferrera P, de Prado I, Palacios E, Fernandez-Marquez J, Di Marzo Serugendo G (2012) Tuple MapReduce: beyond classic MapReduce. In: IEEE 12th international conference on data mining (ICDM), pp 260–269
Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U (2009) Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proc VLDB Endow 2(2):1414–1425
Article Google Scholar
Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’08, ACM, New York, NY, USA, pp 920–927
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web, technical report, Stanford Digital Library Technologies Project
Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program J 13:277–298. http://research.google.com/archive/sawzall.html
Google Scholar
Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages. In: Proceedings of the 9th international conference on advanced parallel processing technologies. APPT’11, Springer, Berlin, pp 58–72
Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12):S1+
Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow (PVLDB) 2(2):1626–1629
Article Google Scholar
Yang HC, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data. SIGMOD ’07, ACM, New York, NY, USA, pp 1029–1040

Download references

Author information

Authors and Affiliations

Datasalt Systems S.L., Barcelona, Spain
Pedro Ferrera, Ivan De Prado & Eric Palacios
CUI, University of Geneva, 1227 , Carouge, Switzerland
Jose Luis Fernandez-Marquez & Giovanna Di Marzo Serugendo

Authors

Pedro Ferrera
View author publications
You can also search for this author in PubMed Google Scholar
Ivan De Prado
View author publications
You can also search for this author in PubMed Google Scholar
Eric Palacios
View author publications
You can also search for this author in PubMed Google Scholar
Jose Luis Fernandez-Marquez
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Di Marzo Serugendo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Luis Fernandez-Marquez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferrera, P., De Prado, I., Palacios, E. et al. Tuple MapReduce and Pangool: an associated implementation. Knowl Inf Syst 41, 531–557 (2014). https://doi.org/10.1007/s10115-013-0705-z

Download citation

Received: 20 March 2013
Revised: 13 September 2013
Accepted: 17 October 2013
Published: 24 December 2013
Issue Date: November 2014
DOI: https://doi.org/10.1007/s10115-013-0705-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tuple MapReduce and Pangool: an associated implementation

Abstract

Access this article

Similar content being viewed by others

Representing MapReduce Optimisations in the Nested Relational Calculus

NotaQL Is Not a Query Language! It’s for Data Transformation on Wide-Column Stores

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Representing MapReduce Optimisations in the Nested Relational Calculus

NotaQL Is Not a Query Language! It’s for Data Transformation on Wide-Column Stores

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation