Abstract
A large up-to-date compendium of integrated genomic data is often required for biological data analysis. The compendium can be tens of terabytes in size, and must often be frequently updated with new experimental or meta-data. Manual compendium update is cumbersome, requires a lot of unnecessary computation, and it may result in errors or inconsistencies in the compendium. We propose a transparent file based approach for adding incremental update capabilities to unmodified genomics data analysis tools and pipeline workflow managers. This approach is implemented in the GeStore system. We evaluate GeStore using a real world genomics compendium. Our results show that it is easy to add incremental updates to genomics data processing pipelines, and that incremental updates can reduce the computation time such that it becomes practical to maintain large-scale up-to-date genomics compendia on small clusters.
Chapter PDF
Similar content being viewed by others
Keywords
- Nucleic Acid Research
- Hadoop Distribute File System
- Incremental Computation
- Incremental Update
- Genomic Data Analysis
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nature Methods 5(1), 16–18 (2008)
Galperin, M.Y., Fernández-Suárez, X.M.: The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Research 40(Database issue), D1–D8 (2012)
Magrane, M., UniProt Consortium: UniProt Knowledgebase: a hub of integrated protein data. Database the Journal of Biological Databases and Curation 2011 (2011)
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000)
Kahn, S.D.: On the Future of Genomic Data. Science 331(6018), 728–729 (2011)
Wilkening, J., Wilke, A., Desai, N., Meyer, F.: Using clouds for metagenomics: A case study. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–6 (2009)
Sandberg, R., Larsson, O.: Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics 8(1), 48 (2007)
Liu, Y.A., Stoller, S.D., Teitelbaum, T.: Static caching for incremental computation. ACM Transactions on Programming Languages and Systems 20(3), 546–585 (1998)
Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–8 (2010)
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquini, R.: Incoop: MapReduce for Incremental Computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 7 (2011)
Peng, D., Dabek, F.: Large-scale Incremental Processing Using Distributed Transactions and Notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, vol. 2006, pp. 1–15 (2010)
Popa, L., Budiu, M., Yu, Y., Isard, M.: DryadInc: reusing work in large-scale computations. In: Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, p. 21 (June 2009)
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, p. 51 (2010)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. The VLDB Journal 21(2), 169–190 (2012)
Turcu, G., Nestorov, S., Foster, I.: Efficient Incremental Maintenance of Derived Relations and BLAST Computations in Bioinformatics Data Warehouses. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 135–145. Springer, Heidelberg (2008)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3), 59 (2007)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Communications of the ACM 53(1), 72 (2010)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, vol. (5), pp. 1–10 (2010)
Apache, “Apache HBase” (2012), http://hbase.apache.org/ (accessed: April 24, 2012)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18(Suppl. 1), S312–S320 (2002)
Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39(Web Server issue), W29–W37 (2011)
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A.: The Pfam protein families database. Nucleic Acids Research 38(Database issue), D211–D222 (2010)
Oracle Grid Engine, http://www.oracle.com/us/products/tools/oracle-grid-engine-075549.html (accessed: May 02, 2012)
Bhaya, D., Grossman, A.R., Steunou, A.-S., Khuri, N., Cohan, F.M., Hamamura, N., Melendrez, M.C., Bateson, M.M., Ward, D.M., Heidelberg, J.F.: Population level functional diversity in a microbial community revealed by comparative genomic and metagenomic analyses. The ISME Journal 1(8), 703–713 (2007)
Douglis, F., Iyengar, A.: Application-specific Delta-encoding via Resemblance Detection. In: Proceedings of the USENIX Annual Technical Conference, pp. 113–126 (2003)
Grune, D.: Concurrent Versions System, A Method for Independent Cooperation, Working paper. IR 113, Vrije Universiteit (1986)
Ceri, S., Widom, J.: Deriving Production Rules for Incremental View Maintenance. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp. 577–589 (September 1991)
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8), R86 (2010)
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5(10), R80 (2004)
Tanenbaum, D.M., Goll, J., Murphy, S., Kumar, P., Zafar, N., Thiagarajan, M., Madupu, R., Davidsen, T., Kagan, L., Kravitz, S., Rusch, D.B., Yooseph, S.: The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data. Standards in Genomic Sciences 2(2), 229–237 (2010)
Wong, A.K., Park, C.Y., Greene, C.S., Bongo, L.A., Guan, Y., Troyanskaya, O.G.: IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Research 40(Web Server issue), 1–7 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pedersen, E., Willassen, N.P., Bongo, L.A. (2014). Transparent Incremental Updates for Genomics Data Analysis Pipelines . In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-54420-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)