Keywords

1 Introduction

This paper first presents DANKE, a data and knowledge management platform that allows users, without technical training, to submit keyword queries to a centralized database. DANKE’s primary motivation is to democratize access to data and documents, originally dispersed across different data sources, without requiring users to write scripts or depend on the development of applications that provide forms for querying this data.

DANKE helps construct a centralized database through a data integration process with the following major steps: (i) data extraction from the original data sources; (ii) transformation and enrichment of these data; (iii) loading the data into the centralized database; (iv) indexing the data. DANKE uses a knowledge graph to provide a semantic view of the centralized database in a vocabulary familiar to the users. The knowledge graph is the basis of the search engine, which matches the keywords users submit with the concepts of the graph, and uses the matchings to compile automatically an SQL (or RDF) query on the centralized database.

The paper then describes DANKE-U, which enables DANKE to handle unstructured data, including scientific and engineering documents. DANKE-U is a collection of components that process unstructured data, extracting metadata, thumbnails, texts, images, named entities, and tables. With the help of this extension, DANKE’s users can seamlessly search unstructured documents and relate them to structured data.

Lastly, the paper presents a real use case from the oil and gas industry, where DANKE is applied to technical/scientific documents and structured data about subsea production systems, that play a critical role in offshore oil and gas extraction. In this context, technical/scientific documents are challenging to interpret: they may contain data organized in non-conventional tables that merge cells (between rows and columns); or they may take the form of engineering drawings, etc. Thus, off-the-shelf tools are of limited help to analyze such documents.

The paper is organized as follows. Section 2 contains a very brief discussion on related work. Section 3 introduces DANKE. Section 4 outlines DANKE-U. Section 5 covers a use case. Finally, Sect. 6 contains the conclusions.

2 Related Work

This section briefly covers related work in areas directly connected to data integration and database keyword search.

Data Integration. Classic data integration is usually divided into the major sub-problems of data retrieval, data fusion, schema alignment, and entity linkage [3]. The progress the data integration community has made in addressing these challenges in the context of big data integration was explored in [5]. The ties between machine learning and data integration were discussed in [4, 20].

In the context of engineering data, Nguyen et al. [17] described the development of a framework for data integration to optimize the remote operations of offshore wind farms. Espinola et al. [6] presented an approach that integrates data from mixed/augmented reality tools and embedded intelligent maintenance systems to support operators/technicians during maintenance tasks, providing easier access, understanding, and comprehension of information from different systems. Urbina Coronado et al. [22] described how to integrate data from machine tools with production data collected by a Manufacturing Execution System (MES) to monitor process output, consumable usage, and operator productivity.

Database Keyword Search Engines. Early relational keyword-based query processing tools [1, 18, 19] explored the foreign/primary keys declared in the relational schema to compile a keyword-based query into an SQL query with a minimal set of join clauses based on the notion of candidate networks.

State-of-the-art content-based image retrieval strategies [9, 13] assume that high-dimensional vectors represent images, created using Deep Learning techniques. Alternatively, the tool might transform both the text and the images (or, in fact, any other media as well) into a single high-dimensional vector, as in cross-modal retrieval techniques [2, 25].

Tautkute et al. [21] proposed a multimodal search engine in the fashion domain that retrieves items aesthetically similar to a query composed of image and textual inputs. Vo et al. [23] addressed a similar problem, where a query is specified in the form of an image and a text description of desired modifications to the input image. Yu et al. [24] introduced a multimodal model for understanding commerce topics, where the content can be an image, a text, or a combination of both.

DANKE [12] is an evolution of earlier tools [8, 10, 11]. DANKE-U enables DANKE to handle unstructured data, including scientific and engineering documents, as mentioned in the introduction.

3 A Brief Overview of DANKE

A keyword query in DANKE is just a list of terms, or keywords, that the user wants to search the database for, and may include reserved terms, such as “<”, “>”, “between”. In relational jargon, an answer to a keyword query is formatted as a table whose columns (or column names) contain the keyword matches. The answer may be the result of joining several database tables, that is, an answer to a keyword query does not need to be constructed out of a single table.

DANKE has three main components (see Fig. 1): Storage Module; Preparation Module; and Knowledge Extraction Module.

Fig. 1.
figure 1

Architectural Overview of the DANKE Platform

The Storage Module houses a relational (or RDF) database, constructed from various data sources. The database is described by a knowledge graph (KG), which is independent of the data model of the underlying database. The Storage Module also holds data indices required to support keyword search and other services.

The Preparation Module provides tools for creating the knowledge graph and for constructing and updating the centralized database through a pipeline responsible for a typical data integration process. This includes indexing data, collecting data from data sources, transforming and enriching data, and ingesting data into the database. This module enables customization of the pipeline in the most convenient manner for each case. The knowledge graph KG is defined by de-normalizing the schemas of the underlying databases, enriching KG with metadata that help interpret the data, and indicating which properties will have their values indexed.

The Knowledge Extraction Module uses the technology described in [7, 8, 10] and explores the knowledge graph and the data indices to compile a keyword query into an SQL (or SPARQL) query that returns data that best match the keywords. It features an algorithm [8] that accepts as input a keyword query \(Q_K\), the knowledge graph KG, and the indices, and: (1) finds matches with the keywords in \(Q_K\); (2) creates a conceptual query \(Q_C\) by exploring the keyword matches found and KG; (3) compiles \(Q_C\) into an SQL (or SPARQL) query \(Q_S\), which is then executed.

4 DANKE-U

To handle unstructured data, such as technical/scientific documents, DANKE creates pipelines using DANKE-U, as illustrated in Fig. 2. DANKE-U comprises a set of components that process unstructured data, extracting metadata, thumbnails, texts, images, named entities and tables, for example.

A pipeline encompasses several forms of text document processing, automatically extracting information and enriching the database with new data and relationships. If it is necessary to search through the textual content of the file, the pipeline must include text indexing, enabling the search engine to match keywords with the textual content of the file.

In relational mode, documents and the associated data are recorded in a table, including their identification, file path, thumbnail, textual content, and metadata such as title, author, date, and type. Furthermore, other tables that relate documents to additional information may also be used.

An example of text document processing is extracting attribute values from specific entities using advanced table processing techniques facilitated by the “Table Extraction” component of DANKE-U. This component can be tailored to extract data from various types of tables, as illustrated on the right-hand side of Fig. 2, and executes the processing in three stages: (i) Collect and Settings: involves reading a folder containing PDF documents and a configuration file defining the metadata and tables to be investigated in the documents. The configuration determines domain-specific structures (e.g., applying regular expressions to identify entity and attribute names) used to select tables and attributes for extraction, along with formatting options for each attribute in the output. (ii) Extractor: extracts the attributes and tables defined in the previous stage. The process utilizes the Camelot libraryFootnote 1, capable of extracting tables with column and line separators (lattices, illustrated in Figs. 4 and 5 of the appendix) as well as those without clear separators (stream, illustrated in Fig. 6 of the appendix). From the Camelot output, the component identifies and extracts the attributes based on the configuration. (iii) Cleaner and Formatter: transforms the extracted data into tabular format, which can be stored using spreadsheets, CSV, or database tables. The user can personally and configure the output format through the configuration file.

Another relevant example of text processing is Named Entity Recognition. Assuming the closed-world hypothesis, where there is a known set of values representing an entity, this technique can identify which documents mention a specific entity in their text. Hence, it is possible to relate documents to entities, storing this relationship in the database.

Fig. 2.
figure 2

Architectural Overview of the DANKE-U Module

Fig. 3.
figure 3

Extracting data from tables

5 Use Case

This section introduces the domain in which DANKE is employed to handle scientific and technical documents, illustrating the approach with real-life cases in subsea production systems of offshore oil and gas extraction.

Subsea production systems are composed of increasingly complex structures to meet the growing demands of large deep-water oil fields. Moreover, maintaining the physical integrity of several subsea components faces an additional challenge from their remote and hard-to-access locations. Nonetheless, tracking their operational health is crucial to guarantee environmental safety, reduce the high costs of an unplanned operation shutdown, and meet regulatory demands.

To accomplish these tasks, subsea experts need efficient access to consistent and up-to-date information about equipment designs, layouts and locations, inspection reports, maintenance activities, and other daily-generated data. However, throughout the life cycle of the subsea system, a large variety of data are generated and stored in various places, creating a complex environment. In this context, accessing, searching, navigating, and integrating this data becomes critical for decision-making. However, these actions are often costly and require substantial time from experts, technicians, and engineers.

Figure 3 illustrates a fragment of the knowledge graph that represents the subsea domain, encompassing: (i) a riser, a conductor pipe connected to offshore production platforms and the subsea flowlines, manifolds, and wellhead; (ii) a structure detailing the specific attributes for which a riser is designed, such as volume, diameter, and weight; and (iii) documents containing engineering and scientific data related to the risers and structures. Figure 3 also displays the resulting relational database, featuring tables to store data about risers, structures, documents, and the relationships among them. The appendix contains three illustrations of tables with a layout commonly found in PDFs in a real-world scenario; however, for confidentiality reasons, the values of the structures are fictitious.

The following use cases illustrate the approaches used in this domain for searching the documents.

Use Case 1 - A Keyword Query Over Document Metadata. Suppose the user enters the keyword query “document type technical drawing”. DANKE compiles this keyword query into an SQL query involving the table “Document” and its attribute “type”, filtering for the attribute value “technical drawing”, and responds with documents of the type technical drawing. Document metadata may have been automatically extracted from the PDF file using the “Metadata Extraction” component of DANKE-U, or it could have been manually entered by a specialist using an Electronic Document Management (EDM) system.

Use Case 2 - A Keyword Query Over Document Text. Suppose the user enters the keyword query “document diagram yyyy.xxxx”. DANKE compiles this keyword query into an SQL query to find documents of type “diagram” that contain the text “yyyy.xxxx”. To locate the text “yyyy.xxxx” within diagrams, represented as images in PDFs, it is necessary to extract the text from the images using OCR, a task performed by the “Text Extraction” component of DANKE-U. Additionally, the pipeline created in DANKE’s Preparation Module must store and index the text extracted from the PDF.

Use Case 3 - A Keyword Query Relating Data and Documents. Suppose the user enters the keyword query “structure document diagram yyyy.xxxx”. DANKE compiles this keyword query into an SQL query similar to the previous example but includes the relationships between structures and documents. To establish this relationship, document texts must be processed to identify those that mention structure identifiers. This task is accomplished by the “Entity Extraction” component of DANKE-U. Using the structure identifiers, previously known and stored in the table “Structure”, the component “Entity Extraction” processes the document texts and returns a table associating the structure identifiers with the document identifiers.

Use Case 4 - A Keyword Query Involving Attribute Values Defined in Documents. Suppose the user enters the keyword query “riser xxxx-yyyy structure diameter.” DANKE compiles this keyword query into an SQL query that retrieves the diameter of the structures for the riser “xxxx-yyyy”. Some attributes of the structure are not originally present in a structured form (e.g., in a relational database) and had to be extracted through a sophisticated process that interprets various table formats found in documents. This extraction is performed by the “Table Extraction”component of DANKE-U, customized to process complex tables. Figures 4, 5 and 6 in appendix illustrates a real-life scenario. However, for confidentiality reasons, the values presented in the tables are fictitious. The classical format of a table places each attribute in a separate cell, as in Fig. 4. In this example, the figure shows a table with diameter 4 for the structure identified as 001.00024. However, in some cases, when the attribute value is the same for different structures, the attribute value is presented only once by merging cells. Figure 5 presents a slightly more complex table to process, as it contains the same diameter value (4) in a merged cell, for three different structures: 001.00021, 001.00022, and 001.00023. Figure 6 depicts a table with a different layout but also containing the diameter value 4 of structure 001.00024. Furthermore, a page break may occur within the same table.

6 Conclusions and Directions for Future Work

The paper first presented the overall architecture of DANKE, a data and knowledge extraction platform, based on a search engine. Then, it described DANKE-U, that enables DANKE to handle unstructured data, including scientific and engineering documents. Finally, the paper presented a use case from the oil and gas industry, involving technical/scientific documents, processed using DANKE.

DANKE is currently being equipped with a Natural Language (NL) interface, constructed with the help of a Large Language Model (LLM), to help users locate data using NL questions, covering language constructs not supported by keyword queries [14,15,16]. The final goal is to explore the combination of controllable database keyword search with LLM freedom to create a trustworthy application that generates reliable answers.

As for future work, the application area involves a large variety of documents, including complete engineering drawings. This demand calls for the development of other specialized set of components to be included in DANKE-U.