Transparent Incremental Updates for Genomics Data Analysis Pipelines

Pedersen, Edvard; Willassen, Nils Peder; Bongo, Lars Ailo

doi:10.1007/978-3-642-54420-0_31

Edvard Pedersen^27,29,
Nils Peder Willassen^28,29 &
Lars Ailo Bongo^27,29

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8374))

Included in the following conference series:

European Conference on Parallel Processing

1758 Accesses
4 Citations

Abstract

A large up-to-date compendium of integrated genomic data is often required for biological data analysis. The compendium can be tens of terabytes in size, and must often be frequently updated with new experimental or meta-data. Manual compendium update is cumbersome, requires a lot of unnecessary computation, and it may result in errors or inconsistencies in the compendium. We propose a transparent file based approach for adding incremental update capabilities to unmodified genomics data analysis tools and pipeline workflow managers. This approach is implemented in the GeStore system. We evaluate GeStore using a real world genomics compendium. Our results show that it is easy to add incremental updates to genomics data processing pipelines, and that incremental updates can reduce the computation time such that it becomes practical to maintain large-scale up-to-date genomics compendia on small clusters.

Download to read the full chapter text

Chapter PDF

DolphinNext: a distributed data processing platform for high throughput genomics

Article Open access 19 April 2020

Experiences in the Development of a Data Management System for Genomics

Watchdog – a workflow management system for the distributed analysis of large-scale experimental data

Article Open access 13 March 2018

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nature Methods 5(1), 16–18 (2008)
Article Google Scholar
Galperin, M.Y., Fernández-Suárez, X.M.: The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Research 40(Database issue), D1–D8 (2012)
Google Scholar
Magrane, M., UniProt Consortium: UniProt Knowledgebase: a hub of integrated protein data. Database the Journal of Biological Databases and Curation 2011 (2011)
Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000)
Article Google Scholar
Kahn, S.D.: On the Future of Genomic Data. Science 331(6018), 728–729 (2011)
Article Google Scholar
Wilkening, J., Wilke, A., Desai, N., Meyer, F.: Using clouds for metagenomics: A case study. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–6 (2009)
Google Scholar
Sandberg, R., Larsson, O.: Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics 8(1), 48 (2007)
Article Google Scholar
Liu, Y.A., Stoller, S.D., Teitelbaum, T.: Static caching for incremental computation. ACM Transactions on Programming Languages and Systems 20(3), 546–585 (1998)
Article Google Scholar
Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–8 (2010)
Google Scholar
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquini, R.: Incoop: MapReduce for Incremental Computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 7 (2011)
Google Scholar
Peng, D., Dabek, F.: Large-scale Incremental Processing Using Distributed Transactions and Notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, vol. 2006, pp. 1–15 (2010)
Google Scholar
Popa, L., Budiu, M., Yu, Y., Isard, M.: DryadInc: reusing work in large-scale computations. In: Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, p. 21 (June 2009)
Google Scholar
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, p. 51 (2010)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. The VLDB Journal 21(2), 169–190 (2012)
Article Google Scholar
Turcu, G., Nestorov, S., Foster, I.: Efficient Incremental Maintenance of Derived Relations and BLAST Computations in Bioinformatics Data Warehouses. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 135–145. Springer, Heidelberg (2008)
Chapter Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3), 59 (2007)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Communications of the ACM 53(1), 72 (2010)
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, vol. (5), pp. 1–10 (2010)
Google Scholar
Apache, “Apache HBase” (2012), http://hbase.apache.org/ (accessed: April 24, 2012)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Google Scholar
Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18(Suppl. 1), S312–S320 (2002)
Google Scholar
Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39(Web Server issue), W29–W37 (2011)
Google Scholar
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A.: The Pfam protein families database. Nucleic Acids Research 38(Database issue), D211–D222 (2010)
Google Scholar
Oracle Grid Engine, http://www.oracle.com/us/products/tools/oracle-grid-engine-075549.html (accessed: May 02, 2012)
Bhaya, D., Grossman, A.R., Steunou, A.-S., Khuri, N., Cohan, F.M., Hamamura, N., Melendrez, M.C., Bateson, M.M., Ward, D.M., Heidelberg, J.F.: Population level functional diversity in a microbial community revealed by comparative genomic and metagenomic analyses. The ISME Journal 1(8), 703–713 (2007)
Article Google Scholar
Douglis, F., Iyengar, A.: Application-specific Delta-encoding via Resemblance Detection. In: Proceedings of the USENIX Annual Technical Conference, pp. 113–126 (2003)
Google Scholar
Grune, D.: Concurrent Versions System, A Method for Independent Cooperation, Working paper. IR 113, Vrije Universiteit (1986)
Google Scholar
Ceri, S., Widom, J.: Deriving Production Rules for Incremental View Maintenance. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp. 577–589 (September 1991)
Google Scholar
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8), R86 (2010)
Google Scholar
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5(10), R80 (2004)
Google Scholar
Tanenbaum, D.M., Goll, J., Murphy, S., Kumar, P., Zafar, N., Thiagarajan, M., Madupu, R., Davidsen, T., Kagan, L., Kravitz, S., Rusch, D.B., Yooseph, S.: The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data. Standards in Genomic Sciences 2(2), 229–237 (2010)
Article Google Scholar
Wong, A.K., Park, C.Y., Greene, C.S., Bongo, L.A., Guan, Y., Troyanskaya, O.G.: IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Research 40(Web Server issue), 1–7 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Tromsø, Norway
Edvard Pedersen & Lars Ailo Bongo
Department of Chemistry, University of Tromsø, Norway
Nils Peder Willassen
Centre for Bioinformatics, University of Tromsø, Norway
Edvard Pedersen, Nils Peder Willassen & Lars Ailo Bongo

Authors

Edvard Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Nils Peder Willassen
View author publications
You can also search for this author in PubMed Google Scholar
Lars Ailo Bongo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rechen- und Kommunikationszentrum, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey
TU Vienna, 1040, Vienna, Austria
Michael Alexander
RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Paolo Bientinesi & Carsten Clauss &
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan & Christine Morin &
University of Innsbruck, 6020, Innsbruck, Austria
Gabor Kecskemeti
Department of Computer Science, University of Pisa, 56126, Pisa, Italy
Laura Ricci
Universitat Politècnica de València, 46022, València, Spain
Julio Sahuquillo
LLNL, USA
Martin Schulz
Dipartimento di Informatica, Università di Salerno, 84084, Salerno, Italy
Vittorio Scarano
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
Technische Universität München, 80333, Munich, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pedersen, E., Willassen, N.P., Bongo, L.A. (2014). Transparent Incremental Updates for Genomics Data Analysis Pipelines . In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-54420-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Transparent Incremental Updates for Genomics Data Analysis Pipelines

Abstract

Chapter PDF

Similar content being viewed by others

DolphinNext: a distributed data processing platform for high throughput genomics

Experiences in the Development of a Data Management System for Genomics

Watchdog – a workflow management system for the distributed analysis of large-scale experimental data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Transparent Incremental Updates for Genomics Data Analysis Pipelines

Abstract

Chapter PDF

Similar content being viewed by others

DolphinNext: a distributed data processing platform for high throughput genomics

Experiences in the Development of a Data Management System for Genomics

Watchdog – a workflow management system for the distributed analysis of large-scale experimental data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation