Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Bongo, Lars Ailo; Pedersen, Edvard; Ernstsen, Martin

doi:10.1007/978-3-319-24462-4_22

Lars Ailo Bongo¹⁷,
Edvard Pedersen¹⁷ &
Martin Ernstsen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8623))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

1400 Accesses
1 Citations

Abstract

Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems.

We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kahn, S.D.: On the Future of Genomic Data. Science (80-) 331, 728–729 (2011)
Google Scholar
Diao, Y., Roy, A., Bloom, T.: Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis. In: 7th Biennial Conference on Innovative Data Systems Research (CIDR 2015), Asilomar, CA, USA (2015)
Google Scholar
Clarke, L., Zheng-Bradley, X., Smith, R., Kulesha, E., Xiao, C., Toneva, I., Vaughan, B., Preuss, D., Leinonen, R., Shumway, M., Sherry, S., Flicek, P.: The 1000 Genomes Project: data management and community access. Nat. Methods 9, 459–462 (2012)
Article Google Scholar
Fernández-Suárez, X.M., Rigden, D.J., Galperin, M.Y.: The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection. Nucleic Acids Res. 42 (2014)
Google Scholar
Benson, G.: Editorial: Nucleic Acids Research annual Web Server Issue in 2014. Nucleic Acids Res. 42, W1–W2 (2014)
Google Scholar
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010)
Google Scholar
Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72 (2010)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association (2012)
Google Scholar
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (2004)
Google Scholar
Blankenberg, D., Von Kuster, G., Bouvier, E., Baker, D., Afgan, E., Stoler, N., Taylor, J., Nekrutenko, A.: Dissemination of scientific software with Galaxy ToolShed. Genome Biol. 15, 403 (2014)
Article Google Scholar
Open Grid Scheduler, http://gridscheduler.sourceforge.net/
Hadoop homepage, http://hadoop.apache.org/
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: 26th Symposium on Mass Storage Systems and Technologies. IEEE (2010)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. ACM SIGOPS Operating Systems Review, 29 (2003)
Google Scholar
MountableHDFS, http://wiki.apache.org/hadoop/MountableHDFS
Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11 (2010)
Google Scholar
Apache HBase, http://hbase.apache.org/
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: BigTable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 1–26 (2008)
Article Google Scholar
Apache Spark, https://spark.apache.org/
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: Proc. of the VLDB Endowment, pp. 1414–1425 (2009)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. In: Proc. of VLDB Endowment, pp. 1626–1629 (2009)
Google Scholar
Cascading, http://www.cascading.org/
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 35 (2010)
Article Google Scholar
Impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Apache Drill, http://incubator.apache.org/drill/
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: Proc. VLDB Endow., pp. 330–339 (2010)
Google Scholar
Storm, https://storm.incubator.apache.org/
Mahout homepage, https://mahout.apache.org/
Pireddu, L., Leo, S., Soranzo, N., Zanetti, G.: A Hadoop-Galaxy adapter for user-friendly and scalable data-intensive bioinformatics in Galaxy. In: Proc. of 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 184–191 (2014)
Google Scholar
Wong, A.K., Park, C.Y., Greene, C.S., Bongo, L.A., Guan, Y., Troyanskaya, O.G.: IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 40, W484–W490 (2012)
Google Scholar
Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., Soboleva, A.: NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 39, D1005–D1010 (2010)
Google Scholar
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proc.of the 8th USENIX Conference on Networked Systems Design and Implementation. USENIX Association (2011)
Google Scholar
Pedersen, E., Willassen, N.P., Bongo, L.A.: Transparent incremental updates for Genomics Data Analysis Pipelines. In: an Mey, D., Alexander, M., Bientinesi, P., Cannataro, M., Clauss, C., Costan, A., Kecskemeti, G., Morin, C., Ricci, L., Sahuquillo, J., Schulz, M., Scarano, V., Scott, S.L., Weidendorfer, J. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 311–320. Springer, Heidelberg (2014)
Chapter Google Scholar
Pedersen, E., Raknes, I.A., Ernstsen, M., Bongo, L.A.: Integrating Data-Intensive Computing Systems with Biological Data Processing Frameworks. In: Euromicro Conference on Parallel, Distributed and Network-Based Processing (2015)
Google Scholar
Magrane, M., Consortium, U.: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, bar009 (2011)
Google Scholar
Ernstsen, M., Kjærner-Semb, E., Willassen, N.P., Bongo, L.A.: Mario: Interactive tuning of biological analysis pipelines using iterative processing. In: Lopes, L., et al. (eds.) Euro-Par 2014, Part I. LNCS, vol. 8805, pp. 263–274. Springer, Heidelberg (2014)
Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams. In: Proc. of Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM Press (2013)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Article Google Scholar
Killcoyne, S., del Sol, A.: FIGG: simulating populations of whole genome sequences for heterogeneous data analyses. BMC Bioinformatics 15, 149 (2014)
Article Google Scholar
Azure: Microsoft’s Cloud Platform, http://azure.microsoft.com/en-us/
O’Connor, B.D., Merriman, B., Nelson, S.F.: SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics 11(Suppl. 1), S2 (2010)
Google Scholar
Roberts, A., Feng, H., Pachter, L.: Fragment assignment in the cloud with eXpress-D. BMC Bioinformatics 14, 358 (2013)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of Operating Systems Design & Implementation. USENIX (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Center for Bioinformatics, University of Tromsø, Tromsø, Norway
Lars Ailo Bongo & Edvard Pedersen
Now at Kongsberg Satellite Services AS, Tromsø, Norway
Martin Ernstsen

Authors

Lars Ailo Bongo
View author publications
You can also search for this author in PubMed Google Scholar
Edvard Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Martin Ernstsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Ailo Bongo .

Editor information

Editors and Affiliations

CUSSB, University "Vita-Salute" San Raffae, Milano, Italy
Clelia DI Serio
The Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
Pietro Liò
CUSSB, Università Vita-Salute San Raffaele, Milano, Italy
Alessandro Nonis
Dipartimento di Informatica, Universitá degli Studi di Salerno, Fisciano, Salerno, Italy
Roberto Tagliaferri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bongo, L.A., Pedersen, E., Ernstsen, M. (2015). Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines. In: DI Serio, C., Liò, P., Nonis, A., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2014. Lecture Notes in Computer Science(), vol 8623. Springer, Cham. https://doi.org/10.1007/978-3-319-24462-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-24462-4_22
Published: 18 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24461-7
Online ISBN: 978-3-319-24462-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics