Evaluating Genomic Big Data Operations on SciDB and Spark

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10360)


We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general-purpose operations.

In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.



The authors would like to thank the SciDB support team for help during Simone Cattani’s thesis [7] and for comments at his seminar, given at SciDB on July 19, 2016. This work is supported by the ERC Advanced Grant GeCo (Data-Driven Genomic Computing).


  1. 1.
    Anonymous paper: Accelerating bioinformatics research with new software for big data to knowledge (BD2K). Paradigm4 (2015)Google Scholar
  2. 2.
    Anonymous paper: SciDB MAC Storage Explained, Paradigm4 (2015)Google Scholar
  3. 3.
  4. 4.
  5. 5.
    Bertoni, M., Ceri, S., Kaitoua, A., Pinoli, P.: Evaluating cloud frameworks on genomic applications. In: IEEE-Big Data Conference, pp. 193–202 (2015)Google Scholar
  6. 6.
    Brown, P.G.: Overview of SciDB: large scale array storage, processing and analysis. In: Proceedings of ACM-SIGMOD, pp. 963–968 (2010)Google Scholar
  7. 7.
    Cattani, S.: Genomic Computing with SciDB, a Data Management System for Scientific Computations. Master Thesis, Politecnico di Milano, July 2016Google Scholar
  8. 8.
    Chawda, B., et al.: Processing interval joins on map-reduce. In: Proceedings of EDBT, pp. 463–474, (2014)Google Scholar
  9. 9.
    Edelkamp, S., Sulewski, D., Yucel, C.: Perfect hashing for state space exploration on the GPU. In: Proceedings of ICAPS, pp. 57–64 (2010)Google Scholar
  10. 10.
    ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)Google Scholar
  11. 11.
    Kaitoua, A., Ceri, S., Bertoni, M., Pinoli, P.: Framework for supporting genomic operations. IEEE-TC (2016). doi: 10.1109/TC.2016.2603980
  12. 12.
    Masseroli, M., et al.: GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics (2015). doi: 10.1093/bioinformatics/btv048
  13. 13.
    Masseroli, M., Kaitoua, A., Pinoli, P., Ceri, S.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods (2016). doi: 10.1016/j.ymeth.2016.09.002
  14. 14.
    Spangenberg, N., Roth, M., Franczyk, B.: Evaluating new approaches of big data analytics frameworks. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 208, pp. 28–37. Springer, Cham (2015). doi: 10.1007/978-3-319-19027-3_3 CrossRefGoogle Scholar
  15. 15.
    Stonebraker, M., Brown, P., Poliakov, A., Raman, S.: The architecture of SciDB. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 1–16. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-22351-8_1 CrossRefGoogle Scholar
  16. 16.
    Stonebraker, M., et al.: SciDB: a database management syatem for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)CrossRefGoogle Scholar
  17. 17.
    Xin, R., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of ACM-SIGMOD, June 2013Google Scholar
  18. 18.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, pp. 15–28 (2012)Google Scholar
  19. 19.
    Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, November 2013Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Dip. Elettronica, Informazione e BioingegneriaPolitecnico di MilanoMilanoItaly

Personalised recommendations