Ibidas: Querying Flexible Data Structures to Explore Heterogeneous Bioinformatics Data

  • Marc Hulsman
  • Jan J. Bot
  • Arjen P. de Vries
  • Marcel J. T. Reinders
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7970)

Abstract

Nowadays, bioinformatics requires the handling of large and diverse datasets. Analyzing this data demands often significant custom scripting, as reuse of code is limited due to differences in input/output formats between both data sources and algorithms. This recurring need to write data-handling code significantly hinders fast data exploration.

We argue that this problem cannot be solved by just data integration and standardization alone. We propose that the integration-analysis chain misses a link: a query solution which can operate on diversely structured data throughout the whole bioinformatics workflow, rather than just on data available in the data sources. We describe how a simple concept (shared ’dimensions’) allows such a query language to be constructed, enabling it to handle flat, nested and multi-dimensional data. Due to this, one can operate in a unified way on the outputs of algorithms and the contents of files and databases, directly structuring the data in a format suitable for further analysis. These ideas have been implemented in a prototype system called Ibidas. To retain flexibility, it is directly integrated into a scripting language. We show how this framework enables the reuse of common data operations in different problem settings, and for different data interfaces, thereby speeding up data exploration.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Galperin, M., Fernández-Suárez, X.: The 2012 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Research 40(D1), D1–D8 (2012)Google Scholar
  2. 2.
    Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics 41(5), 687–693 (2008)CrossRefGoogle Scholar
  3. 3.
    Belleau, F., Nolin, M., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics 41(5), 706–716 (2008)CrossRefGoogle Scholar
  4. 4.
    Goble, C., Belhajjame, K., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R., Nzuobontane, E., McWilliam, H., Laurent, T., Lopez, R.: BioCatalogue: a curated web service registry for the life science community. In: Microsoft eScience Workshop 2008, Indianapolis, IN, USA (2009)Google Scholar
  5. 5.
    Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: BioMart – biological queries made easy. BMC Genomics 10(1), 22 (2009)CrossRefGoogle Scholar
  6. 6.
    Bellinger, G., Castro, D., Mills, A.: Data, information, knowledge, and wisdom (2004)Google Scholar
  7. 7.
    McKusick, V.: Mendelian Inheritance in Man and its online version, OMIM. American Journal of Human Genetics 80(4), 588 (2007)CrossRefGoogle Scholar
  8. 8.
    Zukowski, M., Boncz, P., Nes, N., Héman, S.: Monetdb/x100–a dbms in the cpu cache. IEEE Data Eng. Bull. 28(2), 17–22 (2005)Google Scholar
  9. 9.
    Roth, M., Arya, M., Haas, L., Carey, M., Cody, W., Fagin, R., Schwarz, P., Thomas, J., Wimmers, E.: The garlic project. ACM SIGMOD Record 25(2), 557 (1996)CrossRefGoogle Scholar
  10. 10.
    Jensen, L., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., et al.: STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 37(Database issue), D412 (2009)Google Scholar
  11. 11.
    Perez, F., Granger, B.: IPython: a system for interactive scientific computing. Computing in Science & Engineering, 21–29 (2007)Google Scholar
  12. 12.
    Oliphant, T.: Guide to NumPy (2006)Google Scholar
  13. 13.
    Gyssens, M., Lakshmanan, L.: A foundation for multi-dimensional databases. In: Proceedings of the International Conference on Very Large Data Bases, Citeseer, pp. 106–115 (1997)Google Scholar
  14. 14.
    Rew, R., Davis, G.: Netcdf: an interface for scientific data access. IEEE Computer Graphics and Applications 10(4), 76–82 (1990)CrossRefGoogle Scholar
  15. 15.
    HDF Group and others: Hdf5: Hierarchical data format, http://www.hdfgroup.org/hdf5
  16. 16.
    Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F.: Extensible markup language (XML) 1.0. W3C recommendation 6 (2000)Google Scholar
  17. 17.
    Colby, L.: A recursive algebra for nested relations. Information Systems 15(5), 567–582 (1990)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Kim, W.: Introduction to object-oriented databases (1990)Google Scholar
  19. 19.
    Clark, J., DeRose, S.: XML path language (XPath) 1.0. W3C recommendation. World Wide Web Consortium (1999), http://www.w3.org/TR/xpath
  20. 20.
    Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: Discoverylink: A system for integrating life sciences data. IBM Systems Journal 40(2) 2001 (2001)Google Scholar
  21. 21.
    Wong, L.: Kleisli, a functional query system. Journal of Functional Programming 10(01), 19–56 (2000)CrossRefGoogle Scholar
  22. 22.
    Baker, P., Brass, A., Bechhofer, S., Goble, C., Paton, N., Stevens, R.: TAMBIS-Transparent Access to Multiple Biological Information Sources. In: Proc. Int. Conf. on Intelligent Systems for Molecular Biology, pp. 25–34 (1998)Google Scholar
  23. 23.
    Miled, Z., Li, N., Baumgartner, M., Liu, Y.: A decentralized approach to the integration of life science web databases. Bioinformatics Tools and Applications 27, 3–14 (2003)Google Scholar
  24. 24.
    Shaker, R., Mork, P., Brockenbrough, J., Donelson, L., Tarczy-Hornoch, P.: The biomediator system as a tool for integrating biologic databases on the web. In: Workshop on Information Integration on the Web (IIWeb 2004), Toronto, CA (2004)Google Scholar
  25. 25.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)Google Scholar
  26. 26.
    Box, D., Hejlsberg, A.: The LINQ Project: .NET Language Integrated Query. Microsoft Corporation (2005)Google Scholar
  27. 27.
    Kersten, M., Zhang, Y., Ivanova, M., Nes, N.: Sciql, a query language for science applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 1–12. ACM (2011)Google Scholar
  28. 28.
    Shannon, P., Reiss, D., Bonneau, R., Baliga, N.: The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics 7(1), 176 (2006)CrossRefGoogle Scholar
  29. 29.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), W729 (2006)Google Scholar
  30. 30.
    Giardine, B., Riemer, C., Hardison, R., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Research 15(10), 1451 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Marc Hulsman
    • 1
  • Jan J. Bot
    • 1
  • Arjen P. de Vries
    • 1
    • 3
  • Marcel J. T. Reinders
    • 1
    • 2
  1. 1.Delft Bioinformatics LabDelft University of TechnologyThe Netherlands
  2. 2.Netherlands Bioinformatics Centre (NBIC)The Netherlands
  3. 3.Centrum Wiskunde & Informatica (CWI)The Netherlands

Personalised recommendations