Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Ernstsen, Martin; Kjærner-Semb, Erik; Willassen, Nils Peder; Bongo, Lars Ailo

doi:10.1007/978-3-319-14325-5_23

Martin Ernstsen³⁴,
Erik Kjærner-Semb³⁵,
Nils Peder Willassen³⁵ &
…
Lars Ailo Bongo³⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8805))

Included in the following conference series:

European Conference on Parallel Processing

1715 Accesses
2 Citations

Abstract

Biological data analysis relies on complex pipelines for cleaning, integrating, and summarizing data before presenting the results to a user. Specifically, biological data analysis is usually implemented as a pipeline that combines many independent tools. During development, it is necessary to tune the pipeline to find the tools and parameters that work well with a particular dataset. However, as the dataset size increases, the pipeline execution time also increases and parameter tuning becomes impractical. No current biological data analysis frameworks enable analysts to interactively tune the parameters of a biological analysis pipelines for large-scale datasets. We present Mario, a system that quickly updates pipeline output data when pipeline parameters are changed. It combines reservoir sampling, fine-grained caching of derived datasets, and an iterative data-parallel processing model. We demonstrate the usability of our approach through a biological use case, and experimentally evaluate the latency, throughput, and resource usage of the Mario system. Mario is open-sourced at bdps.cs.uit.no/code/Mario.

Download to read the full chapter text

Chapter PDF

Interactive Data Analyses Using TBtools

PyBDA: a command line tool for automated analysis of big biological data sets

Article Open access 12 November 2019

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Keywords

References

Kahn, S.D.: On the Future of Genomic Data. Science 331(6018), 728–729 (2011)
Article Google Scholar
Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1), 1–28 (2005)
Article Google Scholar
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8) (2010)
Google Scholar
Hadoop homepage (2014), http://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1) (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. of NSDI 2012. Usenix (2012)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: Proc. VLDB Endow. 2010, vol. 3(1–2) (2013)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Article Google Scholar
Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP 2013, pp. 439–455 (2013)
Google Scholar
Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: Proc. of OSDI 2010. Useinx (2010)
Google Scholar
Pedersen, E., Willassen, N.P., Bongo, L.A.: Fseries Transparent Incremental Updates for Genomics Data Analysis Pipelines. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 311–320. Springer, Heidelberg (2014)
Chapter Google Scholar
Apache HBase, http://hbase.apache.org/
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: BigTable: A Distributed Storage System for Structured Data. ACM TOCS 26(2), 1–26 (2008)
Article MATH Google Scholar
Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12(1), 385 (2011)
Article Google Scholar
Goll, J., Rusch, D., Tanenbaum, D.M., Thiagarajan, M., Li, K., Methé, B.A., Yooseph, S.: METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics 26(20), 2631–2632 (2010)
Article Google Scholar
Ernstsen, M.: Mario - A system for iterative and interactive processing of biological data, Master’s thesis, University of Tromsø (2013)
Google Scholar
Sidirourgos, L., Kersten, M., Boncz, P.: Scientific discovery through weighted sampling. In: 2013 IEEE International Conference on Big Data, pp. 300–306 (2013)
Google Scholar
Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)
Article MathSciNet MATH Google Scholar
Kjærner-Semb, E.: Exploring Bioinformatic Software for Taxonomic Classification of Metagenomes, Master thesis, University of Tromsø (2013)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Article Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? CACM 53(1), 64 (2010)
Article Google Scholar
Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics (11 Suppl.1), S1 (2010)
Google Scholar
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: Proc. of VLDB Endowment, vol. 2(2) (2009)
Google Scholar
Mahout homepage (2014), https://mahout.apache.org/
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proc. of NSDI 2011. Usenix (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Center for Bioinformatics, University of Tromsø, Norway
Martin Ernstsen & Lars Ailo Bongo
Dept. of Chemistry and Center for Bioinformatics, University of Tromsø, Norway
Erik Kjærner-Semb & Nils Peder Willassen

Authors

Martin Ernstsen
View author publications
You can also search for this author in PubMed Google Scholar
Erik Kjærner-Semb
View author publications
You can also search for this author in PubMed Google Scholar
Nils Peder Willassen
View author publications
You can also search for this author in PubMed Google Scholar
Lars Ailo Bongo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CRACS/INESC-TEC and FCUP, University of Porto, Rua do Campo Alegre, 1021, 4169-007, Porto, Portugal
Luís Lopes
Vilnius University, 08663, Vilnius, Lithuania
Julius Žilinskas
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan
Inria, Campus Universitaire de Beaulieu, 35042, Rennes, France
Roberto G. Cascella
MTA SZTAKI, Budapest, Hungary
Gabor Kecskemeti
LaBRI, Inria, France
Emmanuel Jeannot
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
University of Pisa, Italy
Laura Ricci
Faculty of Computer Science, University of Vienna, Wien, Austria
Siegfried Benkner
Universitat Politècnica de València, Spain
Salvador Petit
ISISLab - Dipartimento di Informatica, Università di Salerno, Italy
Vittorio Scarano
High Performance Computing Center Stuttgart (HLRS), University of Stuttgart, 70550, Stuttgart, Germany
José Gracia
Vienna University of Technology, 1040, Vienna, Austria
Sascha Hunold
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
RWTH Aachen University, Aachen, Germany
Stefan Lankes
Department of Informatics and Mathematics, University of Passau, Germany
Christian Lengauer
Universidad Carlos III de Madrid, 28911, Leganés, Spain
Jesus Carretero
TU München, 85747, Garching bei München, Germany
Jens Breitbart
TU Vienna, 1040, Vienna, Austria
Michael Alexander

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ernstsen, M., Kjærner-Semb, E., Willassen, N.P., Bongo, L.A. (2014). Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-14325-5_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Abstract

Chapter PDF

Similar content being viewed by others

Interactive Data Analyses Using TBtools

PyBDA: a command line tool for automated analysis of big biological data sets

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Abstract

Chapter PDF

Similar content being viewed by others

Interactive Data Analyses Using TBtools

PyBDA: a command line tool for automated analysis of big biological data sets

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation