Abstract
Biological data analysis relies on complex pipelines for cleaning, integrating, and summarizing data before presenting the results to a user. Specifically, biological data analysis is usually implemented as a pipeline that combines many independent tools. During development, it is necessary to tune the pipeline to find the tools and parameters that work well with a particular dataset. However, as the dataset size increases, the pipeline execution time also increases and parameter tuning becomes impractical. No current biological data analysis frameworks enable analysts to interactively tune the parameters of a biological analysis pipelines for large-scale datasets. We present Mario, a system that quickly updates pipeline output data when pipeline parameters are changed. It combines reservoir sampling, fine-grained caching of derived datasets, and an iterative data-parallel processing model. We demonstrate the usability of our approach through a biological use case, and experimentally evaluate the latency, throughput, and resource usage of the Mario system. Mario is open-sourced at bdps.cs.uit.no/code/Mario.
Chapter PDF
Similar content being viewed by others
Keywords
References
Kahn, S.D.: On the Future of Genomic Data. Science 331(6018), 728–729 (2011)
Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1), 1–28 (2005)
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8) (2010)
Hadoop homepage (2014), http://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1) (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. of NSDI 2012. Usenix (2012)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: Proc. VLDB Endow. 2010, vol. 3(1–2) (2013)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP 2013, pp. 439–455 (2013)
Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: Proc. of OSDI 2010. Useinx (2010)
Pedersen, E., Willassen, N.P., Bongo, L.A.: Fseries Transparent Incremental Updates for Genomics Data Analysis Pipelines. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 311–320. Springer, Heidelberg (2014)
Apache HBase, http://hbase.apache.org/
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: BigTable: A Distributed Storage System for Structured Data. ACM TOCS 26(2), 1–26 (2008)
Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12(1), 385 (2011)
Goll, J., Rusch, D., Tanenbaum, D.M., Thiagarajan, M., Li, K., Methé, B.A., Yooseph, S.: METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics 26(20), 2631–2632 (2010)
Ernstsen, M.: Mario - A system for iterative and interactive processing of biological data, Master’s thesis, University of Tromsø (2013)
Sidirourgos, L., Kersten, M., Boncz, P.: Scientific discovery through weighted sampling. In: 2013 IEEE International Conference on Big Data, pp. 300–306 (2013)
Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)
Kjærner-Semb, E.: Exploring Bioinformatic Software for Taxonomic Classification of Metagenomes, Master thesis, University of Tromsø (2013)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? CACM 53(1), 64 (2010)
Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics (11 Suppl.1), S1 (2010)
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: Proc. of VLDB Endowment, vol. 2(2) (2009)
Mahout homepage (2014), https://mahout.apache.org/
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proc. of NSDI 2011. Usenix (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ernstsen, M., Kjærner-Semb, E., Willassen, N.P., Bongo, L.A. (2014). Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)