Skip to main content

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

  • Conference paper
  • First Online:
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2014)

Abstract

Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems.

We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kahn, S.D.: On the Future of Genomic Data. Science (80-) 331, 728–729 (2011)

    Google Scholar 

  2. Diao, Y., Roy, A., Bloom, T.: Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis. In: 7th Biennial Conference on Innovative Data Systems Research (CIDR 2015), Asilomar, CA, USA (2015)

    Google Scholar 

  3. Clarke, L., Zheng-Bradley, X., Smith, R., Kulesha, E., Xiao, C., Toneva, I., Vaughan, B., Preuss, D., Leinonen, R., Shumway, M., Sherry, S., Flicek, P.: The 1000 Genomes Project: data management and community access. Nat. Methods 9, 459–462 (2012)

    Article  Google Scholar 

  4. Fernández-Suárez, X.M., Rigden, D.J., Galperin, M.Y.: The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection. Nucleic Acids Res. 42 (2014)

    Google Scholar 

  5. Benson, G.: Editorial: Nucleic Acids Research annual Web Server Issue in 2014. Nucleic Acids Res. 42, W1–W2 (2014)

    Google Scholar 

  6. Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010)

    Google Scholar 

  7. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004)

    Article  Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72 (2010)

    Article  Google Scholar 

  9. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association (2012)

    Google Scholar 

  10. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (2004)

    Google Scholar 

  11. Blankenberg, D., Von Kuster, G., Bouvier, E., Baker, D., Afgan, E., Stoler, N., Taylor, J., Nekrutenko, A.: Dissemination of scientific software with Galaxy ToolShed. Genome Biol. 15, 403 (2014)

    Article  Google Scholar 

  12. Open Grid Scheduler, http://gridscheduler.sourceforge.net/

  13. Hadoop homepage, http://hadoop.apache.org/

  14. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: 26th Symposium on Mass Storage Systems and Technologies. IEEE (2010)

    Google Scholar 

  15. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. ACM SIGOPS Operating Systems Review, 29 (2003)

    Google Scholar 

  16. MountableHDFS, http://wiki.apache.org/hadoop/MountableHDFS

  17. Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11 (2010)

    Google Scholar 

  18. Apache HBase, http://hbase.apache.org/

  19. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: BigTable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 1–26 (2008)

    Article  Google Scholar 

  20. Apache Spark, https://spark.apache.org/

  21. Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: Proc. of the VLDB Endowment, pp. 1414–1425 (2009)

    Google Scholar 

  22. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. In: Proc. of VLDB Endowment, pp. 1626–1629 (2009)

    Google Scholar 

  23. Cascading, http://www.cascading.org/

  24. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 35 (2010)

    Article  Google Scholar 

  25. Impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

  26. Apache Drill, http://incubator.apache.org/drill/

  27. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: Proc. VLDB Endow., pp. 330–339 (2010)

    Google Scholar 

  28. Storm, https://storm.incubator.apache.org/

  29. Mahout homepage, https://mahout.apache.org/

  30. Pireddu, L., Leo, S., Soranzo, N., Zanetti, G.: A Hadoop-Galaxy adapter for user-friendly and scalable data-intensive bioinformatics in Galaxy. In: Proc. of 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 184–191 (2014)

    Google Scholar 

  31. Wong, A.K., Park, C.Y., Greene, C.S., Bongo, L.A., Guan, Y., Troyanskaya, O.G.: IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 40, W484–W490 (2012)

    Google Scholar 

  32. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., Soboleva, A.: NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 39, D1005–D1010 (2010)

    Google Scholar 

  33. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proc.of the 8th USENIX Conference on Networked Systems Design and Implementation. USENIX Association (2011)

    Google Scholar 

  34. Pedersen, E., Willassen, N.P., Bongo, L.A.: Transparent incremental updates for Genomics Data Analysis Pipelines. In: an Mey, D., Alexander, M., Bientinesi, P., Cannataro, M., Clauss, C., Costan, A., Kecskemeti, G., Morin, C., Ricci, L., Sahuquillo, J., Schulz, M., Scarano, V., Scott, S.L., Weidendorfer, J. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 311–320. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  35. Pedersen, E., Raknes, I.A., Ernstsen, M., Bongo, L.A.: Integrating Data-Intensive Computing Systems with Biological Data Processing Frameworks. In: Euromicro Conference on Parallel, Distributed and Network-Based Processing (2015)

    Google Scholar 

  36. Magrane, M., Consortium, U.: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, bar009 (2011)

    Google Scholar 

  37. Ernstsen, M., Kjærner-Semb, E., Willassen, N.P., Bongo, L.A.: Mario: Interactive tuning of biological analysis pipelines using iterative processing. In: Lopes, L., et al. (eds.) Euro-Par 2014, Part I. LNCS, vol. 8805, pp. 263–274. Springer, Heidelberg (2014)

    Google Scholar 

  38. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams. In: Proc. of Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM Press (2013)

    Google Scholar 

  39. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Article  Google Scholar 

  40. Killcoyne, S., del Sol, A.: FIGG: simulating populations of whole genome sequences for heterogeneous data analyses. BMC Bioinformatics 15, 149 (2014)

    Article  Google Scholar 

  41. Azure: Microsoft’s Cloud Platform, http://azure.microsoft.com/en-us/

  42. O’Connor, B.D., Merriman, B., Nelson, S.F.: SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics 11(Suppl. 1), S2 (2010)

    Google Scholar 

  43. Roberts, A., Feng, H., Pachter, L.: Fragment assignment in the cloud with eXpress-D. BMC Bioinformatics 14, 358 (2013)

    Article  Google Scholar 

  44. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of Operating Systems Design & Implementation. USENIX (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Ailo Bongo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bongo, L.A., Pedersen, E., Ernstsen, M. (2015). Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines. In: DI Serio, C., Liò, P., Nonis, A., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2014. Lecture Notes in Computer Science(), vol 8623. Springer, Cham. https://doi.org/10.1007/978-3-319-24462-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24462-4_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24461-7

  • Online ISBN: 978-3-319-24462-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics