Ibidas: Querying Flexible Data Structures to Explore Heterogeneous Bioinformatics Data

Hulsman, Marc; Bot, Jan J.; de Vries, Arjen P.; Reinders, Marcel J. T.

doi:10.1007/978-3-642-39437-9_2

Marc Hulsman²²,
Jan J. Bot²²,
Arjen P. de Vries^22,24 &
…
Marcel J. T. Reinders^22,23

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7970))

Included in the following conference series:

International Conference on Data Integration in the Life Sciences

682 Accesses
2 Citations

Abstract

Nowadays, bioinformatics requires the handling of large and diverse datasets. Analyzing this data demands often significant custom scripting, as reuse of code is limited due to differences in input/output formats between both data sources and algorithms. This recurring need to write data-handling code significantly hinders fast data exploration.

We argue that this problem cannot be solved by just data integration and standardization alone. We propose that the integration-analysis chain misses a link: a query solution which can operate on diversely structured data throughout the whole bioinformatics workflow, rather than just on data available in the data sources. We describe how a simple concept (shared ’dimensions’) allows such a query language to be constructed, enabling it to handle flat, nested and multi-dimensional data. Due to this, one can operate in a unified way on the outputs of algorithms and the contents of files and databases, directly structuring the data in a format suitable for further analysis. These ideas have been implemented in a prototype system called Ibidas. To retain flexibility, it is directly integrated into a scripting language. We show how this framework enables the reuse of common data operations in different problem settings, and for different data interfaces, thereby speeding up data exploration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Galperin, M., Fernández-Suárez, X.: The 2012 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Research 40(D1), D1–D8 (2012)
Article Google Scholar
Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics 41(5), 687–693 (2008)
Article Google Scholar
Belleau, F., Nolin, M., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics 41(5), 706–716 (2008)
Article Google Scholar
Goble, C., Belhajjame, K., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R., Nzuobontane, E., McWilliam, H., Laurent, T., Lopez, R.: BioCatalogue: a curated web service registry for the life science community. In: Microsoft eScience Workshop 2008, Indianapolis, IN, USA (2009)
Google Scholar
Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: BioMart – biological queries made easy. BMC Genomics 10(1), 22 (2009)
Article Google Scholar
Bellinger, G., Castro, D., Mills, A.: Data, information, knowledge, and wisdom (2004)
Google Scholar
McKusick, V.: Mendelian Inheritance in Man and its online version, OMIM. American Journal of Human Genetics 80(4), 588 (2007)
Article Google Scholar
Zukowski, M., Boncz, P., Nes, N., Héman, S.: Monetdb/x100–a dbms in the cpu cache. IEEE Data Eng. Bull. 28(2), 17–22 (2005)
Google Scholar
Roth, M., Arya, M., Haas, L., Carey, M., Cody, W., Fagin, R., Schwarz, P., Thomas, J., Wimmers, E.: The garlic project. ACM SIGMOD Record 25(2), 557 (1996)
Article Google Scholar
Jensen, L., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., et al.: STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 37(Database issue), D412 (2009)
Google Scholar
Perez, F., Granger, B.: IPython: a system for interactive scientific computing. Computing in Science & Engineering, 21–29 (2007)
Google Scholar
Oliphant, T.: Guide to NumPy (2006)
Google Scholar
Gyssens, M., Lakshmanan, L.: A foundation for multi-dimensional databases. In: Proceedings of the International Conference on Very Large Data Bases, Citeseer, pp. 106–115 (1997)
Google Scholar
Rew, R., Davis, G.: Netcdf: an interface for scientific data access. IEEE Computer Graphics and Applications 10(4), 76–82 (1990)
Article Google Scholar
HDF Group and others: Hdf5: Hierarchical data format, http://www.hdfgroup.org/hdf5
Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F.: Extensible markup language (XML) 1.0. W3C recommendation 6 (2000)
Google Scholar
Colby, L.: A recursive algebra for nested relations. Information Systems 15(5), 567–582 (1990)
Article Google Scholar
Kim, W.: Introduction to object-oriented databases (1990)
Google Scholar
Clark, J., DeRose, S.: XML path language (XPath) 1.0. W3C recommendation. World Wide Web Consortium (1999), http://www.w3.org/TR/xpath
Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: Discoverylink: A system for integrating life sciences data. IBM Systems Journal 40(2) 2001 (2001)
Article Google Scholar
Wong, L.: Kleisli, a functional query system. Journal of Functional Programming 10(01), 19–56 (2000)
Article Google Scholar
Baker, P., Brass, A., Bechhofer, S., Goble, C., Paton, N., Stevens, R.: TAMBIS-Transparent Access to Multiple Biological Information Sources. In: Proc. Int. Conf. on Intelligent Systems for Molecular Biology, pp. 25–34 (1998)
Google Scholar
Miled, Z., Li, N., Baumgartner, M., Liu, Y.: A decentralized approach to the integration of life science web databases. Bioinformatics Tools and Applications 27, 3–14 (2003)
Google Scholar
Shaker, R., Mork, P., Brockenbrough, J., Donelson, L., Tarczy-Hornoch, P.: The biomediator system as a tool for integrating biologic databases on the web. In: Workshop on Information Integration on the Web (IIWeb 2004), Toronto, CA (2004)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Google Scholar
Box, D., Hejlsberg, A.: The LINQ Project: .NET Language Integrated Query. Microsoft Corporation (2005)
Google Scholar
Kersten, M., Zhang, Y., Ivanova, M., Nes, N.: Sciql, a query language for science applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 1–12. ACM (2011)
Google Scholar
Shannon, P., Reiss, D., Bonneau, R., Baliga, N.: The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics 7(1), 176 (2006)
Article Google Scholar
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), W729 (2006)
Google Scholar
Giardine, B., Riemer, C., Hardison, R., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Research 15(10), 1451 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Delft Bioinformatics Lab, Delft University of Technology, The Netherlands
Marc Hulsman, Jan J. Bot, Arjen P. de Vries & Marcel J. T. Reinders
Netherlands Bioinformatics Centre (NBIC), The Netherlands
Marcel J. T. Reinders
Centrum Wiskunde & Informatica (CWI), The Netherlands
Arjen P. de Vries

Authors

Marc Hulsman
View author publications
You can also search for this author in PubMed Google Scholar
Jan J. Bot
View author publications
You can also search for this author in PubMed Google Scholar
Arjen P. de Vries
View author publications
You can also search for this author in PubMed Google Scholar
Marcel J. T. Reinders
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Applied Statistics, University of New Brunswick, E2L 4L5, Saint John, NB, Canada
Christopher J. O. Baker
Department of Computer Science, Concordia University, H3G 1M8, Montreal, QC, Canada
Greg Butler
Ontario Cancer Institute, University of Toronto, M5G 1L7, Toronto, ON, Canada
Igor Jurisica

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hulsman, M., Bot, J.J., de Vries, A.P., Reinders, M.J.T. (2013). Ibidas: Querying Flexible Data Structures to Explore Heterogeneous Bioinformatics Data. In: Baker, C.J.O., Butler, G., Jurisica, I. (eds) Data Integration in the Life Sciences. DILS 2013. Lecture Notes in Computer Science(), vol 7970. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39437-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-39437-9_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39436-2
Online ISBN: 978-3-642-39437-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics