Abstract
Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis System (CoAnSys) created in the Center for Open Science (CeON). In addition to best practices for working in the Apache Hadoop environment, the tool for convenient generation of Apache Oozie workflows is presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Springer, Berlin (2012)
Chu, C.T., Kim, S.K., Lin, Y.A., Ng, A.Y.: Map-reduce for machine learning on multicore. Architecture 19(23), 281 (2007). http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf
Dean, B.Y.J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72â77 (2010). http://dl.acm.org/citation.cfm?id=1629198
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 1ââ13 (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
Dean, J., Ghemawat, S.: System and Method for Efficient Large-scale Data Processing (2010). http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7,526,461&OS=7,526,461&RS=7,526,461
Fedoryszak, M., Tkaczyk, D., Bolikowski, Ć.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, vol. 8092, pp. 362â365. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-40501-3_37
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3â4), 327â348 (2004). http://www.journals.cambridge.org/abstract_S1351324904003523
Gates, A.: Programming Pig. OâReilly Media, Sebastopol (2011)
George, L.: HBase: The Definitive Guide, 1 edn. OâReilly Media, Sebastopol (2011)
Kawa, A., Bolikowski, Ć., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 155â169. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-35647-6_12
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297â1303 (2010). http://genome.cshlp.org/cgi/doi/10.1101/gr.107524.110
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceeding of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, vol. 0, 13â24 Oct 2007. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4147644
White, T.: Hadoop: The Definitive Guide, 1st edn. OâReilly Media Inc., Sebastopol (2009)
Yang, H.c., Dasdan, A., Hsiao, R.l., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. Rain pages, 1029â1040 (2007), http://portal.acm.org/citation.cfm?id=1247480.1247602
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. Symp. Q. J. Mod. Foreign Lit. 57(4), 29â42 (2008). http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ć. (2014). Chrum: The Tool for Convenient Generation of Apache Oozie Workflows. In: Bembenik, R., Skonieczny, Ć., RybiĆski, H., Kryszkiewicz, M., NiezgĂłdka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-04714-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04713-3
Online ISBN: 978-3-319-04714-0
eBook Packages: EngineeringEngineering (R0)