Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

  • Martin Ernstsen
  • Erik Kjærner-Semb
  • Nils Peder Willassen
  • Lars Ailo Bongo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8805)

Abstract

Biological data analysis relies on complex pipelines for cleaning, integrating, and summarizing data before presenting the results to a user. Specifically, biological data analysis is usually implemented as a pipeline that combines many independent tools. During development, it is necessary to tune the pipeline to find the tools and parameters that work well with a particular dataset. However, as the dataset size increases, the pipeline execution time also increases and parameter tuning becomes impractical. No current biological data analysis frameworks enable analysts to interactively tune the parameters of a biological analysis pipelines for large-scale datasets. We present Mario, a system that quickly updates pipeline output data when pipeline parameters are changed. It combines reservoir sampling, fine-grained caching of derived datasets, and an iterative data-parallel processing model. We demonstrate the usability of our approach through a biological use case, and experimentally evaluate the latency, throughput, and resource usage of the Mario system. Mario is open-sourced at bdps.cs.uit.no/code/Mario.

Keywords

Iterative processing interactive processing biological data analysis parameter tuning provenance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kahn, S.D.: On the Future of Genomic Data. Science 331(6018), 728–729 (2011)CrossRefGoogle Scholar
  2. 2.
    Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1), 1–28 (2005)CrossRefGoogle Scholar
  3. 3.
    Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8) (2010)Google Scholar
  4. 4.
    Hadoop homepage (2014), http://hadoop.apache.org/
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1) (2010)Google Scholar
  6. 6.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. of NSDI 2012. Usenix (2012)Google Scholar
  7. 7.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: Proc. VLDB Endow. 2010, vol. 3(1–2) (2013)Google Scholar
  8. 8.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRefGoogle Scholar
  9. 9.
    Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP 2013, pp. 439–455 (2013)Google Scholar
  10. 10.
    Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: Proc. of OSDI 2010. Useinx (2010)Google Scholar
  11. 11.
    Pedersen, E., Willassen, N.P., Bongo, L.A.: Fseries Transparent Incremental Updates for Genomics Data Analysis Pipelines. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 311–320. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  12. 12.
  13. 13.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: BigTable: A Distributed Storage System for Structured Data. ACM TOCS 26(2), 1–26 (2008)CrossRefMATHGoogle Scholar
  14. 14.
    Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12(1), 385 (2011)CrossRefGoogle Scholar
  15. 15.
    Goll, J., Rusch, D., Tanenbaum, D.M., Thiagarajan, M., Li, K., Methé, B.A., Yooseph, S.: METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics 26(20), 2631–2632 (2010)CrossRefGoogle Scholar
  16. 16.
    Ernstsen, M.: Mario - A system for iterative and interactive processing of biological data, Master’s thesis, University of Tromsø (2013)Google Scholar
  17. 17.
    Sidirourgos, L., Kersten, M., Boncz, P.: Scientific discovery through weighted sampling. In: 2013 IEEE International Conference on Big Data, pp. 300–306 (2013)Google Scholar
  18. 18.
    Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)CrossRefMathSciNetMATHGoogle Scholar
  19. 19.
    Kjærner-Semb, E.: Exploring Bioinformatic Software for Taxonomic Classification of Metagenomes, Master thesis, University of Tromsø (2013)Google Scholar
  20. 20.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  21. 21.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? CACM 53(1), 64 (2010)CrossRefGoogle Scholar
  22. 22.
    Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics (11 Suppl.1), S1 (2010)Google Scholar
  23. 23.
    Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: Proc. of VLDB Endowment, vol. 2(2) (2009)Google Scholar
  24. 24.
    Mahout homepage (2014), https://mahout.apache.org/
  25. 25.
    Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proc. of NSDI 2011. Usenix (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Martin Ernstsen
    • 1
  • Erik Kjærner-Semb
    • 2
  • Nils Peder Willassen
    • 2
  • Lars Ailo Bongo
    • 1
  1. 1.Dept. of Computer Science and Center for BioinformaticsUniversity of TromsøNorway
  2. 2.Dept. of Chemistry and Center for BioinformaticsUniversity of TromsøNorway

Personalised recommendations