Chrum: The Tool for Convenient Generation of Apache Oozie Workflows

  • Piotr Jan Dendek
  • Artur Czeczko
  • Mateusz Fedoryszak
  • Adam Kawa
  • Piotr Wendykier
  • Łukasz Bolikowski
Part of the Studies in Computational Intelligence book series (SCI, volume 541)


Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis System (CoAnSys) created in the Center for Open Science (CeON). In addition to best practices for working in the Apache Hadoop environment, the tool for convenient generation of Apache Oozie workflows is presented.


Hadoop Research environment Big data CoAnSys Text mining 


  1. 1.
    Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Springer, Berlin (2012)Google Scholar
  2. 2.
    Chu, C.T., Kim, S.K., Lin, Y.A., Ng, A.Y.: Map-reduce for machine learning on multicore. Architecture 19(23), 281 (2007). Google Scholar
  3. 3.
    Dean, B.Y.J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 1–„13 (2004).
  5. 5.
  6. 6.
    Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, vol. 8092, pp. 362–365. Springer, Heidelberg (2013).
  7. 7.
    Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004).
  8. 8.
    Gates, A.: Programming Pig. O’Reilly Media, Sebastopol (2011)Google Scholar
  9. 9.
    George, L.: HBase: The Definitive Guide, 1 edn. O’Reilly Media, Sebastopol (2011)Google Scholar
  10. 10.
    Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 155–169. Springer, Heidelberg (2013).
  11. 11.
    McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010).
  12. 12.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceeding of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, vol. 0, 13–24 Oct 2007.
  13. 13.
    White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media Inc., Sebastopol (2009)Google Scholar
  14. 14.
    Yang, H.c., Dasdan, A., Hsiao, R.l., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. Rain pages, 1029–1040 (2007),
  15. 15.
    Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. Symp. Q. J. Mod. Foreign Lit. 57(4), 29–42 (2008).

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawWarszawaPoland

Personalised recommendations