Abstract
Nowadays, bioinformatics requires the handling of large and diverse datasets. Analyzing this data demands often significant custom scripting, as reuse of code is limited due to differences in input/output formats between both data sources and algorithms. This recurring need to write data-handling code significantly hinders fast data exploration.
We argue that this problem cannot be solved by just data integration and standardization alone. We propose that the integration-analysis chain misses a link: a query solution which can operate on diversely structured data throughout the whole bioinformatics workflow, rather than just on data available in the data sources. We describe how a simple concept (shared ’dimensions’) allows such a query language to be constructed, enabling it to handle flat, nested and multi-dimensional data. Due to this, one can operate in a unified way on the outputs of algorithms and the contents of files and databases, directly structuring the data in a format suitable for further analysis. These ideas have been implemented in a prototype system called Ibidas. To retain flexibility, it is directly integrated into a scripting language. We show how this framework enables the reuse of common data operations in different problem settings, and for different data interfaces, thereby speeding up data exploration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Galperin, M., Fernández-Suárez, X.: The 2012 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Research 40(D1), D1–D8 (2012)
Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics 41(5), 687–693 (2008)
Belleau, F., Nolin, M., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics 41(5), 706–716 (2008)
Goble, C., Belhajjame, K., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R., Nzuobontane, E., McWilliam, H., Laurent, T., Lopez, R.: BioCatalogue: a curated web service registry for the life science community. In: Microsoft eScience Workshop 2008, Indianapolis, IN, USA (2009)
Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: BioMart – biological queries made easy. BMC Genomics 10(1), 22 (2009)
Bellinger, G., Castro, D., Mills, A.: Data, information, knowledge, and wisdom (2004)
McKusick, V.: Mendelian Inheritance in Man and its online version, OMIM. American Journal of Human Genetics 80(4), 588 (2007)
Zukowski, M., Boncz, P., Nes, N., Héman, S.: Monetdb/x100–a dbms in the cpu cache. IEEE Data Eng. Bull. 28(2), 17–22 (2005)
Roth, M., Arya, M., Haas, L., Carey, M., Cody, W., Fagin, R., Schwarz, P., Thomas, J., Wimmers, E.: The garlic project. ACM SIGMOD Record 25(2), 557 (1996)
Jensen, L., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., et al.: STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 37(Database issue), D412 (2009)
Perez, F., Granger, B.: IPython: a system for interactive scientific computing. Computing in Science & Engineering, 21–29 (2007)
Oliphant, T.: Guide to NumPy (2006)
Gyssens, M., Lakshmanan, L.: A foundation for multi-dimensional databases. In: Proceedings of the International Conference on Very Large Data Bases, Citeseer, pp. 106–115 (1997)
Rew, R., Davis, G.: Netcdf: an interface for scientific data access. IEEE Computer Graphics and Applications 10(4), 76–82 (1990)
HDF Group and others: Hdf5: Hierarchical data format, http://www.hdfgroup.org/hdf5
Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F.: Extensible markup language (XML) 1.0. W3C recommendation 6 (2000)
Colby, L.: A recursive algebra for nested relations. Information Systems 15(5), 567–582 (1990)
Kim, W.: Introduction to object-oriented databases (1990)
Clark, J., DeRose, S.: XML path language (XPath) 1.0. W3C recommendation. World Wide Web Consortium (1999), http://www.w3.org/TR/xpath
Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: Discoverylink: A system for integrating life sciences data. IBM Systems Journal 40(2) 2001 (2001)
Wong, L.: Kleisli, a functional query system. Journal of Functional Programming 10(01), 19–56 (2000)
Baker, P., Brass, A., Bechhofer, S., Goble, C., Paton, N., Stevens, R.: TAMBIS-Transparent Access to Multiple Biological Information Sources. In: Proc. Int. Conf. on Intelligent Systems for Molecular Biology, pp. 25–34 (1998)
Miled, Z., Li, N., Baumgartner, M., Liu, Y.: A decentralized approach to the integration of life science web databases. Bioinformatics Tools and Applications 27, 3–14 (2003)
Shaker, R., Mork, P., Brockenbrough, J., Donelson, L., Tarczy-Hornoch, P.: The biomediator system as a tool for integrating biologic databases on the web. In: Workshop on Information Integration on the Web (IIWeb 2004), Toronto, CA (2004)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Box, D., Hejlsberg, A.: The LINQ Project: .NET Language Integrated Query. Microsoft Corporation (2005)
Kersten, M., Zhang, Y., Ivanova, M., Nes, N.: Sciql, a query language for science applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 1–12. ACM (2011)
Shannon, P., Reiss, D., Bonneau, R., Baliga, N.: The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics 7(1), 176 (2006)
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), W729 (2006)
Giardine, B., Riemer, C., Hardison, R., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Research 15(10), 1451 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hulsman, M., Bot, J.J., de Vries, A.P., Reinders, M.J.T. (2013). Ibidas: Querying Flexible Data Structures to Explore Heterogeneous Bioinformatics Data. In: Baker, C.J.O., Butler, G., Jurisica, I. (eds) Data Integration in the Life Sciences. DILS 2013. Lecture Notes in Computer Science(), vol 7970. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39437-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-39437-9_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39436-2
Online ISBN: 978-3-642-39437-9
eBook Packages: Computer ScienceComputer Science (R0)