Integration of Data on Substance Properties Using Big Data Technologies and Domain-Specific Ontologies
A new technology for storage and categorization of heterogeneous data on the properties of matter is proposed. Availability of a multitude of heterogeneous data from a variety of sources justifies the use of one of the popular toolkit for Big Data processing, Apache Spark. Its role in the proposed technology is to manage with extensive data warehouse in text files of the JSON format. The first stage of the technology involves the conversion of primary resources (relational databases, digital archives, Web-portals, etc.) to a standardized form of the JSON document. Advantages of JSON-format - the ability to store data and metadata within a text document, accessible perceptions of a person and a computer and support for the hierarchical structures needed to represent complex and irregular data structure. The presence of such data structures is associated with the possible expansion of the subject area: new types of materials, expansion of the nomenclature of properties, and so on. For the semantic integration of resources converted to the JSON format a repository of subject-oriented ontologies is used. The search for data in the JSON document store is implemented through a combination of SPARQL and SQL queries. The first one (addressed to the ontology repository) provide the user with the ability to view and search for adequate and related concepts. The second, accessing the JSON document sets, retrieves the required data from the document body using the capabilities of Apache Spark SQL. The efficiency of the developed technology is tested on the problems of thermophysical data integration with a characteristic for them complexity of the logical structure.
KeywordsThermophysical properties Semi-structured data JSON format Ontology
The work is supported by Russian Scientific Foundation, grant 14-50-00124.
- 1.WhatIs.com (a reference and self-education tool about information technology). http://whatis/techtarget.com/definition/3Vs
- 2.Erkimbaev, A.O., Zitserman, V.Y., Kobzev, G.A., Kosinov, A.V.: Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology. In: CEUR –WS 2022, urn: nbn:de:0074-2022-6 (2017). http://ceur-ws.org/Vol-2022/paper36.pdf
- 4.Sturrock, C.P., Begley, E.F., Kaufman, J.G.: NISTIR 6785. MatML – Materials Markup Language Workshop Report, U.S. Department of Commerce. National Institute of Standards and Technology (2001)Google Scholar
- 5.Introducing JSON. http://json.org/index.html
- 7.Ontobee: A linked data server designed for ontologies. http://www.ontobee.org
- 9.Ataeva, O.M., Erkimbaev, A.O., Zitserman, V.Yu. et al.: Ontological Modeling as a Means of Integration Data on Substances Thermophysical Properties. In: 15th All-Russian Science Conference “Electronic Libraries: Advanced Approaches and Technologies, Electronic Collections”, s1_3. Yaroslavl (2013). http://rcdl.ru/doc/2013/paper/s1_3.pdf
- 10.ChemSpider. http://www.chemspider.com
- 12.Apache Spark. http://spark.apache.org