Abstract
This paper first presents DANKE, a data and knowledge management platform that allows users to submit keyword queries to a centralized database. DANKE uses a knowledge graph to provide a semantic view of the centralized database in a vocabulary familiar to the users. The paper then describes DANKE-U, a specialized module that enables DANKE to handle unstructured data, including scientific and engineering documents. Lastly, the paper presents a real use case from the oil and gas industry, involving technical/scientific documents.
You have full access to this open access chapter, Download conference paper PDF
Keywords
- Database Keyword Search
- Knowledge Graphs
- Data Integration
- Engineering Data
- Scientific Data
- Unstructured Data
1 Introduction
This paper first presents DANKE, a data and knowledge management platform that allows users, without technical training, to submit keyword queries to a centralized database. DANKE’s primary motivation is to democratize access to data and documents, originally dispersed across different data sources, without requiring users to write scripts or depend on the development of applications that provide forms for querying this data.
DANKE helps construct a centralized database through a data integration process with the following major steps: (i) data extraction from the original data sources; (ii) transformation and enrichment of these data; (iii) loading the data into the centralized database; (iv) indexing the data. DANKE uses a knowledge graph to provide a semantic view of the centralized database in a vocabulary familiar to the users. The knowledge graph is the basis of the search engine, which matches the keywords users submit with the concepts of the graph, and uses the matchings to compile automatically an SQL (or RDF) query on the centralized database.
The paper then describes DANKE-U, which enables DANKE to handle unstructured data, including scientific and engineering documents. DANKE-U is a collection of components that process unstructured data, extracting metadata, thumbnails, texts, images, named entities, and tables. With the help of this extension, DANKE’s users can seamlessly search unstructured documents and relate them to structured data.
Lastly, the paper presents a real use case from the oil and gas industry, where DANKE is applied to technical/scientific documents and structured data about subsea production systems, that play a critical role in offshore oil and gas extraction. In this context, technical/scientific documents are challenging to interpret: they may contain data organized in non-conventional tables that merge cells (between rows and columns); or they may take the form of engineering drawings, etc. Thus, off-the-shelf tools are of limited help to analyze such documents.
The paper is organized as follows. Section 2 contains a very brief discussion on related work. Section 3 introduces DANKE. Section 4 outlines DANKE-U. Section 5 covers a use case. Finally, Sect. 6 contains the conclusions.
2 Related Work
This section briefly covers related work in areas directly connected to data integration and database keyword search.
Data Integration. Classic data integration is usually divided into the major sub-problems of data retrieval, data fusion, schema alignment, and entity linkage [3]. The progress the data integration community has made in addressing these challenges in the context of big data integration was explored in [5]. The ties between machine learning and data integration were discussed in [4, 20].
In the context of engineering data, Nguyen et al. [17] described the development of a framework for data integration to optimize the remote operations of offshore wind farms. Espinola et al. [6] presented an approach that integrates data from mixed/augmented reality tools and embedded intelligent maintenance systems to support operators/technicians during maintenance tasks, providing easier access, understanding, and comprehension of information from different systems. Urbina Coronado et al. [22] described how to integrate data from machine tools with production data collected by a Manufacturing Execution System (MES) to monitor process output, consumable usage, and operator productivity.
Database Keyword Search Engines. Early relational keyword-based query processing tools [1, 18, 19] explored the foreign/primary keys declared in the relational schema to compile a keyword-based query into an SQL query with a minimal set of join clauses based on the notion of candidate networks.
State-of-the-art content-based image retrieval strategies [9, 13] assume that high-dimensional vectors represent images, created using Deep Learning techniques. Alternatively, the tool might transform both the text and the images (or, in fact, any other media as well) into a single high-dimensional vector, as in cross-modal retrieval techniques [2, 25].
Tautkute et al. [21] proposed a multimodal search engine in the fashion domain that retrieves items aesthetically similar to a query composed of image and textual inputs. Vo et al. [23] addressed a similar problem, where a query is specified in the form of an image and a text description of desired modifications to the input image. Yu et al. [24] introduced a multimodal model for understanding commerce topics, where the content can be an image, a text, or a combination of both.
DANKE [12] is an evolution of earlier tools [8, 10, 11]. DANKE-U enables DANKE to handle unstructured data, including scientific and engineering documents, as mentioned in the introduction.
3 A Brief Overview of DANKE
A keyword query in DANKE is just a list of terms, or keywords, that the user wants to search the database for, and may include reserved terms, such as “<”, “>”, “between”. In relational jargon, an answer to a keyword query is formatted as a table whose columns (or column names) contain the keyword matches. The answer may be the result of joining several database tables, that is, an answer to a keyword query does not need to be constructed out of a single table.
DANKE has three main components (see Fig. 1): Storage Module; Preparation Module; and Knowledge Extraction Module.
The Storage Module houses a relational (or RDF) database, constructed from various data sources. The database is described by a knowledge graph (KG), which is independent of the data model of the underlying database. The Storage Module also holds data indices required to support keyword search and other services.
The Preparation Module provides tools for creating the knowledge graph and for constructing and updating the centralized database through a pipeline responsible for a typical data integration process. This includes indexing data, collecting data from data sources, transforming and enriching data, and ingesting data into the database. This module enables customization of the pipeline in the most convenient manner for each case. The knowledge graph KG is defined by de-normalizing the schemas of the underlying databases, enriching KG with metadata that help interpret the data, and indicating which properties will have their values indexed.
The Knowledge Extraction Module uses the technology described in [7, 8, 10] and explores the knowledge graph and the data indices to compile a keyword query into an SQL (or SPARQL) query that returns data that best match the keywords. It features an algorithm [8] that accepts as input a keyword query \(Q_K\), the knowledge graph KG, and the indices, and: (1) finds matches with the keywords in \(Q_K\); (2) creates a conceptual query \(Q_C\) by exploring the keyword matches found and KG; (3) compiles \(Q_C\) into an SQL (or SPARQL) query \(Q_S\), which is then executed.
4 DANKE-U
To handle unstructured data, such as technical/scientific documents, DANKE creates pipelines using DANKE-U, as illustrated in Fig. 2. DANKE-U comprises a set of components that process unstructured data, extracting metadata, thumbnails, texts, images, named entities and tables, for example.
A pipeline encompasses several forms of text document processing, automatically extracting information and enriching the database with new data and relationships. If it is necessary to search through the textual content of the file, the pipeline must include text indexing, enabling the search engine to match keywords with the textual content of the file.
In relational mode, documents and the associated data are recorded in a table, including their identification, file path, thumbnail, textual content, and metadata such as title, author, date, and type. Furthermore, other tables that relate documents to additional information may also be used.
An example of text document processing is extracting attribute values from specific entities using advanced table processing techniques facilitated by the “Table Extraction” component of DANKE-U. This component can be tailored to extract data from various types of tables, as illustrated on the right-hand side of Fig. 2, and executes the processing in three stages: (i) Collect and Settings: involves reading a folder containing PDF documents and a configuration file defining the metadata and tables to be investigated in the documents. The configuration determines domain-specific structures (e.g., applying regular expressions to identify entity and attribute names) used to select tables and attributes for extraction, along with formatting options for each attribute in the output. (ii) Extractor: extracts the attributes and tables defined in the previous stage. The process utilizes the Camelot libraryFootnote 1, capable of extracting tables with column and line separators (lattices, illustrated in Figs. 4 and 5 of the appendix) as well as those without clear separators (stream, illustrated in Fig. 6 of the appendix). From the Camelot output, the component identifies and extracts the attributes based on the configuration. (iii) Cleaner and Formatter: transforms the extracted data into tabular format, which can be stored using spreadsheets, CSV, or database tables. The user can personally and configure the output format through the configuration file.
Another relevant example of text processing is Named Entity Recognition. Assuming the closed-world hypothesis, where there is a known set of values representing an entity, this technique can identify which documents mention a specific entity in their text. Hence, it is possible to relate documents to entities, storing this relationship in the database.
5 Use Case
This section introduces the domain in which DANKE is employed to handle scientific and technical documents, illustrating the approach with real-life cases in subsea production systems of offshore oil and gas extraction.
Subsea production systems are composed of increasingly complex structures to meet the growing demands of large deep-water oil fields. Moreover, maintaining the physical integrity of several subsea components faces an additional challenge from their remote and hard-to-access locations. Nonetheless, tracking their operational health is crucial to guarantee environmental safety, reduce the high costs of an unplanned operation shutdown, and meet regulatory demands.
To accomplish these tasks, subsea experts need efficient access to consistent and up-to-date information about equipment designs, layouts and locations, inspection reports, maintenance activities, and other daily-generated data. However, throughout the life cycle of the subsea system, a large variety of data are generated and stored in various places, creating a complex environment. In this context, accessing, searching, navigating, and integrating this data becomes critical for decision-making. However, these actions are often costly and require substantial time from experts, technicians, and engineers.
Figure 3 illustrates a fragment of the knowledge graph that represents the subsea domain, encompassing: (i) a riser, a conductor pipe connected to offshore production platforms and the subsea flowlines, manifolds, and wellhead; (ii) a structure detailing the specific attributes for which a riser is designed, such as volume, diameter, and weight; and (iii) documents containing engineering and scientific data related to the risers and structures. Figure 3 also displays the resulting relational database, featuring tables to store data about risers, structures, documents, and the relationships among them. The appendix contains three illustrations of tables with a layout commonly found in PDFs in a real-world scenario; however, for confidentiality reasons, the values of the structures are fictitious.
The following use cases illustrate the approaches used in this domain for searching the documents.
Use Case 1 - A Keyword Query Over Document Metadata. Suppose the user enters the keyword query “document type technical drawing”. DANKE compiles this keyword query into an SQL query involving the table “Document” and its attribute “type”, filtering for the attribute value “technical drawing”, and responds with documents of the type technical drawing. Document metadata may have been automatically extracted from the PDF file using the “Metadata Extraction” component of DANKE-U, or it could have been manually entered by a specialist using an Electronic Document Management (EDM) system.
Use Case 2 - A Keyword Query Over Document Text. Suppose the user enters the keyword query “document diagram yyyy.xxxx”. DANKE compiles this keyword query into an SQL query to find documents of type “diagram” that contain the text “yyyy.xxxx”. To locate the text “yyyy.xxxx” within diagrams, represented as images in PDFs, it is necessary to extract the text from the images using OCR, a task performed by the “Text Extraction” component of DANKE-U. Additionally, the pipeline created in DANKE’s Preparation Module must store and index the text extracted from the PDF.
Use Case 3 - A Keyword Query Relating Data and Documents. Suppose the user enters the keyword query “structure document diagram yyyy.xxxx”. DANKE compiles this keyword query into an SQL query similar to the previous example but includes the relationships between structures and documents. To establish this relationship, document texts must be processed to identify those that mention structure identifiers. This task is accomplished by the “Entity Extraction” component of DANKE-U. Using the structure identifiers, previously known and stored in the table “Structure”, the component “Entity Extraction” processes the document texts and returns a table associating the structure identifiers with the document identifiers.
Use Case 4 - A Keyword Query Involving Attribute Values Defined in Documents. Suppose the user enters the keyword query “riser xxxx-yyyy structure diameter.” DANKE compiles this keyword query into an SQL query that retrieves the diameter of the structures for the riser “xxxx-yyyy”. Some attributes of the structure are not originally present in a structured form (e.g., in a relational database) and had to be extracted through a sophisticated process that interprets various table formats found in documents. This extraction is performed by the “Table Extraction”component of DANKE-U, customized to process complex tables. Figures 4, 5 and 6 in appendix illustrates a real-life scenario. However, for confidentiality reasons, the values presented in the tables are fictitious. The classical format of a table places each attribute in a separate cell, as in Fig. 4. In this example, the figure shows a table with diameter 4 for the structure identified as 001.00024. However, in some cases, when the attribute value is the same for different structures, the attribute value is presented only once by merging cells. Figure 5 presents a slightly more complex table to process, as it contains the same diameter value (4) in a merged cell, for three different structures: 001.00021, 001.00022, and 001.00023. Figure 6 depicts a table with a different layout but also containing the diameter value 4 of structure 001.00024. Furthermore, a page break may occur within the same table.
6 Conclusions and Directions for Future Work
The paper first presented the overall architecture of DANKE, a data and knowledge extraction platform, based on a search engine. Then, it described DANKE-U, that enables DANKE to handle unstructured data, including scientific and engineering documents. Finally, the paper presented a use case from the oil and gas industry, involving technical/scientific documents, processed using DANKE.
DANKE is currently being equipped with a Natural Language (NL) interface, constructed with the help of a Large Language Model (LLM), to help users locate data using NL questions, covering language constructs not supported by keyword queries [14,15,16]. The final goal is to explore the combination of controllable database keyword search with LLM freedom to create a trustworthy application that generates reliable answers.
As for future work, the application area involves a large variety of documents, including complete engineering drawings. This demand calls for the development of other specialized set of components to be included in DANKE-U.
References
Bergamaschi, S., Guerra, F., Interlandi, M., Trillo-Lado, R., Velegrakis, Y.: Combining user and database perspective for solving keyword queries over relational databases. Inf. Syst. 55, 1–19 (2016). https://doi.org/10.1016/j.is.2015.07.005
Costa Pereira, J., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014). https://doi.org/10.1109/TPAMI.2013.142
Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration, 1st edn. Morgan Kaufmann, San Francisco (2012). https://doi.org/10.1016/C2011-0-06130-6
Dong, X.L., Rekatsinas, T.: Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 International Conference on Management of Data. p. 1645-1650. SIGMOD 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3183713.3197387
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248 (2013). https://doi.org/10.1109/ICDE.2013.6544914
Espíndola, D.B., Fumagalli, L., Garetti, M., Pereira, C.E., Botelho, S.S., Ventura Henriques, R.: A model-based approach for data integration to improve maintenance management by mixed reality. Comput. Ind. 64(4), 376–391 (2013). https://doi.org/10.1016/j.compind.2013.01.002
García, G.M.: A Keyword-based Query Processing Method for Datasets with Schemas. Ph.D. thesis, Thesis presented to the Graduate Program in Informatics, PUC-Rio (2020)
García, G.M., Izquierdo, Y.T., Menendez, E., Dartayre, F., Casanova, M.A.: RDF keyword-based query technology meets a real-world dataset. In: Proceedings of the International Conference on Extending Database Technology, pp. 656–667. OpenProceedings.org (2017). https://doi.org/10.5441/002/edbt.2017.86
Hameed, I.M., Abdulhussain, S.H., Mahmmod, B.M.: Content-based image retrieval: a review of recent trends. Cogent Eng. 8(1), 1927469 (2021). https://doi.org/10.1080/23311916.2021.1927469
Izquierdo, Y.T., García, G.M., Menendez, E.S., Casanova, M.A., Dartayre, F., Levy, C.H.: QUIOW: a keyword-based query processing tool for RDF datasets and relational databases. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11030, pp. 259–269. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98812-2_22
Izquierdo, Y.T., et al.: Keyword search over schema-less RDF datasets by SPARQL query compilation. Inf. Syst. 102, 101814 (2021). https://doi.org/10.1016/j.is.2021.101814
Izquierdo, Y.T., et al.: A platform for keyword search and its application for covid-19 pandemic data. J. Inf. Data Manage. 12(5) (2021). https://doi.org/10.5753/jidm.2021.1904, https://sol.sbc.org.br/journals/index.php/jidm/article/view/1904
Li, X., Yang, J., Ma, J.: Recent developments of content-based image retrieval (CBIR). Neurocomputing 452, 675–689 (2021). https://doi.org/10.1016/j.neucom.2020.07.139
Nascimento, E.R., et al.: Text-to-SQL meets the real-world. In: (Accepted to the 26th International Conference on Enterprise Information System) (2024)
Nascimento, E.R., et al.: A family of natural language interfaces for databases based on chatGPT and langchain (short paper). In: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: Posters and Demos co-located with ER 2023, Lisbon, Portugal, November 06-09, 2023. CEUR Workshop Proceedings, vol. 3618 (2023). https://ceur-ws.org/Vol-3618/pd_paper_1.pdf
Nascimento, E.R., et al.: My database user is a large language model. In: (Accepted to the 26th International Conference on Enterprise Information System) (2024)
Nguyen, T.H., Prinz, A., Friiső, T., Nossum, R., Tyapin, I.: A framework for data integration of offshore wind farms. Renew. Energy 60, 150–161 (2013). https://doi.org/10.1016/j.renene.2013.05.002
de Oliveira, P., da Silva, A., de Moura, E.: Ranking candidate networks of relations to improve keyword search over relational databases. In: 2015 IEEE 31st International Conference on Data Engineering, pp. 399–410 (2015). https://doi.org/10.1109/ICDE.2015.7113301
Ramada, M.S., Silva, J.C., Leitão-Júnior, P.S.: From keywords to relational database content: a semantic mapping method. Inf. Syst. 88, 101460 (2020). https://doi.org/10.1016/j.is.2019.101460
Stonebraker, M., Ilyas, I.F.: Data integration: the current status and the way forward. IEEE Data Eng. Bull. 41, 3–9 (2018). https://api.semanticscholar.org/CorpusID:49407081
Tautkute, I., Trzciński, T., Skorupa, A.P., Brocki, Ł, Marasek, K.: Deepstyle: multimodal search engine for fashion and interior design. IEEE Access 7, 84613–84628 (2019). https://doi.org/10.1109/ACCESS.2019.2923552
Urbina Coronado, P.D., Lynn, R., Louhichi, W., Parto, M., Wescoat, E., Kurfess, T.: Part data integration in the shop floor digital twin: mobile and cloud technologies to enable a manufacturing execution system. J. Manuf. Syst. 48, 25–33 (2018). https://doi.org/10.1016/j.jmsy.2018.02.002. special Issue on Smart Manufacturing
Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval-an empirical odyssey. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6439–6448 (2019). https://doi.org/10.1109/CVPR.2019.00660
Yu, L., et al.: Commercemm: large-scale commerce multimodal representation learning with omni retrieval. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4433–4442 (2022). https://doi.org/10.1145/3534678.3539151
Zeng, D., Yu, Y., Oyama, K.: Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16(3), 1–23 (2020). https://doi.org/10.1145/3387164
Acknowledgements
This work was partly funded by FAPERJ under grant E-26/202.818/2017; by CAPES under grants 88881.310592-2018/01, 88881.134081/2016-01, and 88882.164913/2010-01; by CNPq under grant 302303/2017-0; and by Libra Consortium (Petrobras, Shell Brasil, Total Energies, CNOOC, CNPC, and PPSA) within the ANP R&D levy as a commitment to research and development investments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Lemos, M. et al. (2024). A Technical/Scientific Document Management Platform. In: Rehm, G., Dietze, S., Schimmler, S., Krüger, F. (eds) Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024. Lecture Notes in Computer Science(), vol 14770. Springer, Cham. https://doi.org/10.1007/978-3-031-65794-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-65794-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65793-1
Online ISBN: 978-3-031-65794-8
eBook Packages: Computer ScienceComputer Science (R0)