Chrum: The Tool for Convenient Generation of Apache Oozie Workflows

Dendek, Piotr Jan; Czeczko, Artur; Fedoryszak, Mateusz; Kawa, Adam; Wendykier, Piotr; Bolikowski, Łukasz

doi:10.1007/978-3-319-04714-0_12

Piotr Jan Dendek⁷,
Artur Czeczko⁷,
Mateusz Fedoryszak⁷,
Adam Kawa⁷,
Piotr Wendykier⁷ &
…
Łukasz Bolikowski⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 541))

623 Accesses

Abstract

Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis System (CoAnSys) created in the Center for Open Science (CeON). In addition to best practices for working in the Apache Hadoop environment, the tool for convenient generation of Apache Oozie workflows is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Springer, Berlin (2012)
Google Scholar
Chu, C.T., Kim, S.K., Lin, Y.A., Ng, A.Y.: Map-reduce for machine learning on multicore. Architecture 19(23), 281 (2007). http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf
Google Scholar
Dean, B.Y.J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). http://dl.acm.org/citation.cfm?id=1629198
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 1–„13 (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
Dean, J., Ghemawat, S.: System and Method for Efficient Large-scale Data Processing (2010). http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7,526,461&OS=7,526,461&RS=7,526,461
Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, vol. 8092, pp. 362–365. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-40501-3_37
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004). http://www.journals.cambridge.org/abstract_S1351324904003523
Gates, A.: Programming Pig. O’Reilly Media, Sebastopol (2011)
Google Scholar
George, L.: HBase: The Definitive Guide, 1 edn. O’Reilly Media, Sebastopol (2011)
Google Scholar
Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 155–169. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-35647-6_12
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010). http://genome.cshlp.org/cgi/doi/10.1101/gr.107524.110
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceeding of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, vol. 0, 13–24 Oct 2007. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4147644
White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media Inc., Sebastopol (2009)
Google Scholar
Yang, H.c., Dasdan, A., Hsiao, R.l., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. Rain pages, 1029–1040 (2007), http://portal.acm.org/citation.cfm?id=1247480.1247602
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. Symp. Q. J. Mod. Foreign Lit. 57(4), 29–42 (2008). http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/

Download references

Author information

Authors and Affiliations

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warszawa, Poland
Piotr Jan Dendek, Artur Czeczko, Mateusz Fedoryszak, Adam Kawa, Piotr Wendykier & Łukasz Bolikowski

Authors

Piotr Jan Dendek
View author publications
You can also search for this author in PubMed Google Scholar
Artur Czeczko
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz Fedoryszak
View author publications
You can also search for this author in PubMed Google Scholar
Adam Kawa
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Wendykier
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Bolikowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Piotr Jan Dendek , Artur Czeczko , Mateusz Fedoryszak , Adam Kawa , Piotr Wendykier or Łukasz Bolikowski .

Editor information

Editors and Affiliations

Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Robert Bembenik
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Łukasz Skonieczny
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Henryk Rybiński
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Marzena Kryszkiewicz
InterdisciplinaryCentre for Mathematical and Computational Modelling (ICM), University of Warsaw, Warsaw, Poland
Marek Niezgódka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł. (2014). Chrum: The Tool for Convenient Generation of Apache Oozie Workflows. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-04714-0_12
Published: 27 February 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04713-3
Online ISBN: 978-3-319-04714-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics