Introduction

The easy way to access digital information resources has motivated an exponential growth of unstructured information (natural language with no predefined data model). Such information is mainly contained within data sources such as journals, books, technical reports, social networks, to mention a few. Many of these data sources are written in Spanish, which is the third most used language on the Internet [35]. Thus, it is essential to process it to exploit information and knowledge to create cutting-edge software applications [33]. In order to take advantage of the knowledge inside information resources, the Web of Data and the knowledge graphs (KG) provide a data model that allow machines to exchange and use content in a structured format so that less human intervention is required for configuring and running algorithms [16] (e.g. for data integration and/or inferencing facts).

A KG is a data structure intended to accumulate and convey knowledge, where the nodes represent real world entities and whose edges represent diverse relations between these entities [16]. Several KG solutions have been made publicly available, for example, YAGO [15], DBpedia [22], BabelNet [31], and Wikidata [41]. Such KGs were published under the Linked Data principles [13], allowing the information to be free of ambiguity, which is particularly useful to obtain precise or personalized results in varied tasks such as information retrieval [38] and question answering [7], to mention a few. With a well-defined scheme, as supported by ontologies, KGs are used to infer knowledge about the named entities (people, places, organizations, among others) through the use of rules, classes, and relationships. Moreover, ontologies support information querying similarly as a human does it, moving through the nodes and obtaining implicit information throughout edges in the graph [16].

The generation of KGs from unstructured text is a complicated process that involves the detection of knowledge elements (named entities and their relations), generally through natural language processing (NLP). In this sense, several strategies have been developed to detect such elements in English text [3, 6, 9, 24, 38, 40], linking them with resources from an existing KG in domains such as medicine [43] and education [3, 5], to mention a few.

Some solutions have recently been proposed for Spanish to extract and link entities [1], and to generically explore structural relations [40]. However, as far as we know, a complete solution has not been developed to construct KGs from unstructured text in Spanish. Processing Spanish implies its own difficulties in terms of linguistic variations and ambiguities, where structures developed for other languages are not equivalent in this scenario. Thus, a model for data representation from Spanish text is required. This model would allow, among other things, the integration and quick access to information, the enrichment of the original text with metadata, and the exploitation of underlying knowledge, implicit in the text.

In this paper, we propose a method to design and construct KGs from unstructured text in Spanish. We use contextual information to generate lexical patterns that help capture the meaning of words to extract named entities and semantic relations (taxonomic, non-taxonomic, equivalence, and composition relations) that are later linked with DBpedia resources. The idea is to build a network to describe the content of a set of documents by defining a collection of fixed relations and properties, enriched with information from external KGs. We evaluate the performance of the proposed method by measuring the degree of elements extracted from the input text and then assessing their correctness through measures such as precision, recall, and F-1 [36]. The results demonstrate the feasibility of the proposed method to extract triples from three datasets in general, computer science, and financial news domains. We also demonstrate competitive results by comparing our method regarding an existing approach from the literature (but focused on English text).

The main contributions of this paper are:

  • The definition of a set of lexical patterns to identify knowledge components from unstructured text in Spanish.

  • A way to enrich entity resources from existing KGs, used for providing additional context to the extracted components.

  • A pipeline of stages to construct KGs from unstructured text in Spanish, defining an OWL scheme for interconnecting semantic relations with the identified entities.

This paper is organized as follows. “Background and related work” section presents background concepts and the related work. We describe our proposed method in “Proposed method” section. Next, the experiments and results are presented in “Experiments and results” section. Finally, in “Conclusions” section, we present the conclusions and future work.

Background and related work

Extracting and representing information from unstructured text is a challenging task, where an underlying document model is required to fulfill the needs of information by people and applications. In recent years, document models for content representation based on KGs have been proposed, whose purpose is to store knowledge in the form of named entities associated with each other through a semantic relation or property. KGs are very useful for information retrieval tasks [38], data visualization [20], and unification of resources from different information sources [43].

Fig. 1
figure 1

Proposed method for KG construction

According to the resource description framework (RDF) model,Footnote 1 KGs are composed of a set of triples, each with the structure <subject, predicate, object>. The interaction between the subject and the object can be seen as a semantic relation (see the work of Karim et al. [19] for a formal definition of RDF triple and RDF graph). Rather than extracting such components manually by a domain expert, the NLP field plays a central role during the automatic extraction of knowledge from text. NLP incorporates two main tasks: the extraction of named entities (e.g., persons, locations, or organizations) and the extraction of their semantic relations. Named entities are commonly extracted by varied techniques based on machine learning [5], ontology matching [38], online services such as DBpedia Spotlight [27], a combination of techniques [14], and so on. On the other hand, the relation between entities is commonly exploited by strategies that encompass the use of syntactic patterns [8], machine learning [10], and semantic role labeling [12]. However, a step forward for creating a KG is the association of such components to resources from an existing KG and underlying ontologies of the semantic web. The complete process of extracting entities, their relations, and linking them to KG resources is known as relation extraction and linking (REL). Thus, REL can be useful for the semantic enrichment of text or create a graph with semantic and unique components. According to the way semantic relations are extracted from text, we identified three groups of REL approaches:

  • Pattern-based. This kind of approach is defined in terms of constructing grammatical structures that describe semantic relations from text. For example, it is often common to see entities defined by nouns and their relation by a verb, producing a pattern like noun-verb-noun (NN-VB-NN), where a text sentence showing such a pattern is deemed as a semantic relation. Pattern-based approaches typically associate entities with resources through entity linking tools (e.g., DBpedia Spotlight) and the relations with properties of a KB by using established dictionaries and rules [5], and with the support of ontology matching techniques [38]. Examples of this strategy are the approaches relying on OpenIE [3, 24, 26] for the extraction of relations not attached to a single domain, where the verb is the central part in a relation. However, OpenIE tools are costly in resources and the ontologies of a domain are not always available.

  • Event-based. Approaches in this strategy exploit units of information regarding particular events or situations in terms of their actors and the action relating them. It is similar to pattern-based but involves NLP techniques for describing logical expressions in the form of axioms that lead to valid implications. Linking of relations is performed by defining mappings of particular (actor) roles with specific ontology properties [9, 38, 40].

  • Distant supervision. This kind of approach employs information from a KG to train a machine learning algorithm so that similar statements as the stored in the KG are found in text. Thus, the relation extraction is transformed into a classification task [29]. The association of properties is performed directly at the classification task with the KG data used for training algorithms such as multi-class logistic regression classifier [29] or neural networks [6].

The three groups of approaches use NLP tools for pre-processing plain text and obtain features later used in the segmentation or extraction of components (for English). In this sense, adapting existing works to process other languages may be difficult for changing and preparing diverse aspects such as dictionaries, rules, models, and libraries, which are (in some way) complicated and time-consuming tasks.

Rather than applying a full transformation of text into RDF triples, some approaches are focused on the construction of KGs by addressing only the extraction and linking of named entities [1]. Then, a generic relation is defined and used for explaining the association of a document to such entities.

Although there are works extracting KGs from structured means in languages such as Italian [4], Chinese [18, 44], or French [23], extracting entities and relations from unstructured text and linking them to a KG has been scarcely studied for Spanish. Vossen et al. [40] proposed a strategy to read news articles in various languages and detects events, particularly situations regarding what happened, who is involved, where, and when. Thus, they identified such relations through NLP tasks composed of named entity linking, event and semantic role detection to later mapping them to a defined model. Although such work is mainly focused on event-centric information (making descriptions for a resource acting as an event), it is also important to cover entity-centric relations (our goal), where entities are directly connected through properties from an ontology.

Proposed method

The proposed method for KG construction from unstructured text in Spanish is presented in Fig. 1. It is composed of four main stages: text preprocessing, knowledge extraction (entities and semantic relations), component linking (with DBpedia resources and ontologies), and representation of triples (by following the Linked Data principles). We describe the stages of the proposed method in the following subsections starting with the type of components extracted with it.

Knowledge components

The following elements of information can be extracted through the proposed method, either by lexical patterns or by an NLP parser:

  • Named entities. A named entity is a word (or multiword) representing real-world things or concepts. Named entities refer to the core components discussed in the text, and thus, it is essential to obtain as many of them as possible. In this sense, named entities are represented by tuples with the form <sf, offset, type>, where sf is the surface form (literal value) of the named entity, offset refers to the position in the text, and type refers to the class (e.g., person, place, company, etc.); commonly determined by a NER parser.

  • Taxonomic relations (instantiation). This kind of relation classifies a specific concept as part of a more general concept or indicates if a named entity is an individual of a class. For example, <FridaKahlo, rdf:type, Artista/Artist>,Footnote 2 where Frida Kahlo is an instance of the Artist class. As previously mentioned, semantic relations in this document are defined by a tuple of the form <s,p,o>, where s refers to the subject, p is the predicate, and o the object.

  • Equivalence relations. These relations establish the equality or equivalence between two expressions that are apparently different. For example, infancia/childhood is a synonym of niñez/childhood (in English, both concepts are related to childhood).

  • Composition (structural) relations. These relations describe how a concept or system of concepts can be broken down into parts of subsystems. For example, <puerta de embarque/gate, isPartOf, Aeropuerto/Airport>, where Airport is composed of gates and other elements.

  • Non-taxonomic relations. Those semantic relations in which two or more entities are linked in a non-hierarchical structure (e.g., an action, event in time, or location in space). For example, <FridaKahlo, nacidaEn/wasBornIn, Coyoacan> is used to indicate that a person (Frida Kahlo) was born in a place called Coyoacan.

Preparation

The first stage of the method aims to obtain features from unstructured text to be later used to identify knowledge components (named entities/semantic relations).

Preprocessing

The input text must be preprocessed to obtain underlying features regarding the role of words in a sentence. In this sense, the performed tasks are as follows:

  • Text filtering. The method takes as input plain textFootnote 3 in Spanish (UTF-8 charset format). Moreover, special characters are eliminated.

  • Sentence segmentation. The text is segmented into sentences through an NLP parser. This task is performed because such grammatical structures help to define complete and coherent ideas (with subject, predicate, and object) to extract entities and relations. Additionally, sentences containing less than five words are removed. We defined such a word window because sentences with fewer words (short texts) present problems in the detection of the knowledge components, particularly for extracting semantic relations.

  • Grammatical tagging. The grammatical category for each word in the segmented sentences is obtained using a grammatical parser. Such labels/categories (and words) are used to generate lexical patterns that define the structure of semantic relations between entities or concepts (often denoted by nouns, determinants, and other grammatical components later described in this document).

Pattern generation

According to Bower [2], structures that define different types of semantic relations can be obtained by analyzing the text. For example, if there is a sentence: “an X has the parts A, B and C”, a structural relation (meronymic or part-whole relation) can be identified. One way to find such information is through lexical patterns. Thus, the extraction of knowledge components is based on lexical patterns that were defined in two phases. First, taxonomic, non-taxonomic, and some linguistic patterns (defined by Mora et al. [30] and by Hojas-Mazo et al. [17]) were adapted for obtaining relations and specialized entities. Second, diverse sentences were organized regarding their components (e.g., named entity, equivalence relation). Next, applying a grammatical tagger over the sentences makes it possible to generalize tags for obtaining patterns that describe each knowledge component.

The obtained patterns are defined in Tables 1 and 2, for entities and relations, respectively, where \(E_i\) means entity, TR means taxonomic relation, ER means equivalent relation, SR means structural relation, and NR means non-taxonomic relation. In addition, we used the EAGLE tag set for Spanish recomendations [21], for the morphosyntactic annotation of lexicons and corpus. However, for visualization purposes, we used word contractions to refer to POS tags, for example, we use v for tags such as VA00000, VAIC000, VAIP1S0 corresponding to some type of verb. Table 3 shows the set of labels used in the defined patterns.

Table 1 Patterns used for the identification of entities in Spanish
Table 2 Patterns used for the identification of relations in Spanish
Table 3 Labels used in the pattern definition

Knowledge extraction

This step extracts the named entities and semantic relations from text. For such purpose, it takes the input sentences, their tags identified by the grammatical tagging, and the lexical patterns defined in the previous step. The following subsections describe how the components can be obtained.

Knowledge components

The knowledge components are extracted from text as follows:

  • Named Entities. A two-step strategy to get named entities was followed:

    • Through a NER parser. A named entity recognition (NER) parser is used to identify named entities from the input sentences. The output of this step is a set of entity tuples with the components defined previously.

    • Through grammatical categories and patterns. Lexical patterns are used to recognize named entities with more than one word. In this sense, we assume that the words tagged as simple nouns, compound nouns, and specialized terms are considered as entities, which may also contain adjectives, determinants, and/or prepositions. For example, there are cases as the following text sequence Magdalena Carmen Frida Kahlo CalderonFootnote 4 that can be described with the lexical pattern <Noun Noun Noun Noun Noun>.

  • Semantic relations. Semantic relations are obtained by a strategy based on pattern matching. The strategy uses the named entities identified from text to extract the semantic relations, where the verb (and supporting words) acts as the core of the predicate joining such named entities.

Pattern matching

The extraction of the knowledge components is supported by a matching process according to the defined patterns. Thus, the set of patterns is iterated to find occurrences within the tagged text. The process is defined in Algorithm 1. The algorithm takes as input a sentence (plain text in Spanish) and obtains its POS (line 3), which is used for obtaining entities through a NER parser and the input entity patterns (line 4). Only if the sentence contains two or more entities then their possible relations are obtained (line 6), where the relation matcher identifies the elements of the semantic relation according to the patterns previously defined.

figure a

Association of components with external repositories (linking)

This section covers the mapping of named entities and relations to Linked Data resources to finally create a KG with the information extracted from text.

Named entity linking

This step is aimed at providing an association of the named entities with resources from DBpedia (denoted by the prefix dbr:) through SPARQL queries. The idea is to associate the surface form of the entity with its corresponding uniform resource identifier (URI) by matching the available label of the queried resources. Additionally, RDF triples associated with the selected resource are retrieved as enriching information for the KG. Listing 1 presents two SPARQL query examples. In the left query, the literal value is used to retrieve the URI (dbr:Frida_Kahlo) through the property rdfs:label. Such URI is then used in the query on the right through the property dcterms:subject to retrieve additional information of the entity.

figure b

Note that in case of ambiguity (two or more resources from the KG using the same surface form), the following strategy is used:

  • String matching. The exact text from the entity is used to retrieve the resource (through the rdfs:label property).

  • Type. The type of the entity (retrieved by the NER parser) is matched against the type (rdf:type) of the resource.

  • Other. The first result is selected if more than one resource meets the criteria.

Semantic enrichment

In addition to the association of entity resources, this step aims to retrieve information available in an external KG to provide additional context to such entities, thus enriching the original text. The semantic enrichment stage is composed of two steps:

  • Query preparation and submission. This step takes as input a set of properties used in query templates later submitted to the DBpedia endpointFootnote 5(as query 2 in Listing 1). Some of the properties are shown in Table 4, which are retrieved for all the identified entities in the input text. Although other properties can be used, the focus of the work is on taxonomy, type, and equivalence relations, as defined in the Linked Data recommendations [13].

  • Result gathering. This step collects and filters the output triples. Resources from DBpedia and WikidataFootnote 6 are only accepted (in the case of owl:sameAs equivalences), and those described by rdfs and owl models.

Table 4 Properties used in the enrichment process

The steps for the semantic enrichment are described in Algorithm 2. The algorithm takes as input a set of named entities and a set of queries (as defined by the properties in Table 4). Next, the entities are iterated to query and disambiguate their respective resource from DBpedia (line 3). Only if there is an available resource for such entity, the algorithm continues by submitting queries and gathering the enriching triples output (line 6). Note that the association of resources is restricted to the mentioned properties to enrich the graph logically without ambiguous information, where those entities with no resource associated are discarded.

figure c

Representation

This final step covers the construction of the KG according to the knowledge components extracted in the previous tasks. This step is divided into three subtasks as described in the following subsections.

Scheme specification (property generation)

At this point, the extracted entities were associated to resources (from DBpedia and Wikidata) and their enriching information is connected to them with already defined properties from DBpedia, RDFS and OWL vocabularies (obtained through queries). However, connecting the elements in semantic relations is a complicated task due to the heterogeneous data, the sparse descriptive ontologies, and the integration of knowledge components. Thus, we implemented a strategy for crafting properties using the predicate of a semantic relation and the identified resources by defining the following features:

  • Notation. A camelCase notationFootnote 7 is used to join the words in the predicate. In this way, we create an identifier for the generation of a new property.

  • Domain and range. The property is defined by axioms indicating its formalization. The type (class) of the identified resources in subject and object from the semantic relation is used to define the rdfs:domain and rdfs:range, respectively. Note that, for definition, an RDF triple must contain a resource as subject (not a literal value).

  • Type. If the semantic relation has KG resources associated to subject and object, the property is defined as owl:ObjectProperty, otherwise it is defined as owl:DatatypeProperty.

The property is then formalized using OWL, preserving the original value of the predicate through the rdfs:label property. A property example crafted from the semantic relation (Frida Kahlo; fue una; pintora mexicana) is presented in Listing 2 using the Turtle syntax.Footnote 8

figure d

Triple crafting and visualization

This subsection covers the last two tasks in the representation:

  • Triple crafting. Once the scheme for the set of entities and properties was generated, it is possible to build RDF triples that comprise the KG. The idea is to include the entities, relations, and the enriched information associated with DBpedia resources as triples.

  • Visualization. Finally, the generated KG is stored as an RDF file and into a file that can be visualized as an image.

The entire process for the representation and enrichment of text as a KG is specified in Algorithm 3. The algorithm takes as input a set sentences in Spanish. Then, it obtains named entities (\(\textsc {E}_{s}\)) and semantic relations (\(\textsc {Rel}_{s}\)). Next, the named entities are associated with their respective resources from DBpedia, where enriched triples related to the entities are also obtained (\(\textsc {Ten}\)). Furthermore, for each semantic relation, the elements of an RDF triple are identified, associating subject (sub) and object (obj) with the resources (and their labels) already obtained in the previous step and defining the property (prop) with the crafting strategy previously mentioned. Assuming the input sentences correspond to a single document, the generated RDF file represents the context and knowledge of the topics described in such a document.

figure e

Listing 3 shows a fragment of the KG constructed from the running example. Particularly, textprop:fueUna/textprop:wasA corresponds to a property extracted from text where the value is pintora mexicana/Mexican painter. Additionally, in the enriched information, dcterms:subject adds topical information and the property rdf:type associates the resource dbr:Frida_Kahlo with three different resources (Person, Artist, and Thing).

figure f

Experiments and results

This section presents the experiments to evaluate the performance of the proposal. Thus, the number of components extracted, the represented RDF triples, and their accuracy are evaluated. Moreover, the performance of the triple representation is compared regarding an existing approach from the literature. Additionally, the validity of the data is verified through Protégé.Footnote 9 Before presenting the scenario of the experiments, the used datasets and metrics are introduced.

Table 5 Dataset features

Datasets

Validating the results of the extraction of components for the construction of a KG is a complicated task due to the limited availability of gold standard datasets, influenced by the type of knowledge components, the domain (news, medicine, etc.), and the language (generally focused on English). Thus, we used three datasets with the following features:

  • Computer Science (CS). This dataset is composed of 19 documents with a total of 250 sentences written in Spanish. The documents correspond to abstracts of general domain articles and specialized texts in computer science published by researchers from the Autonomous University of Tamaulipas (by using a Google Scholar filter). Three computer science domain experts manually identified a set of 270 entities and 166 triples from the sentences. We considered entities that correspond to class, places, people, and those linked to resources in DBpedia. This dataset is provided online.Footnote 10

  • Wikipedia-based (KI). This dataset was proposed by [20], which is composed of 100 sentences in English randomly selected from Wikipedia. Each sentence in the dataset contains its corresponding triples with resources linked to DBpedia. The dataset is composed of 132 RDF triples. The sentences were translated into Spanish.

  • Financial News (FN). This dataset is composed of 1050 sentences in the domain of financial news in Spanish, collected from specialized media in Spain. This dataset is provided online.Footnote 11

While the first dataset is used for testing the extraction of entities and semantic relations through an a posteriori analysis (evaluated by judges after extraction), the second and third datasets were used to evaluate the extraction and association of RDF triples from text. A brief description of the datasets is presented in Table 5, where CS, G, and F are for the Computer Science, General, and Financial domain, respectively, and ‘–’ means no info available.

Metrics

The experiments considered traditional information retrieval metrics, such as precision, recall, and F-Measure. The metrics are described as follows:

  • Precision evaluates the capability of the method to exclude those elements that do not represent true entities or semantic relations. It is defined in Eq. (1).

    $$\begin{aligned} \mathrm{precision} = \frac{|\mathrm{relevantElements} \cap \mathrm{retrievedElements}|}{|\mathrm{retrievedElements}|} \end{aligned}$$
    (1)
  • Recall evaluates the ratio of relevant entities or semantic relations extracted. The recall is defined in Eq. (2).

    $$\begin{aligned} \mathrm{recall} = \frac{|\mathrm{relevantElements} \cap \mathrm{retrievedElements}|}{|\mathrm{relevantElements}|} \end{aligned}$$
    (2)
  • F-Measure is a harmonic average between precision and recall. It is defined in Eq. (3).

    $$\begin{aligned} \mathrm{F}\text {-}\mathrm{measure} = \frac{(1+\beta ^2)(\mathrm{precision} \cdot \mathrm{recall})}{(\beta ^2 \cdot \mathrm{precision}+\mathrm{recall})} \end{aligned}$$
    (3)

    where commonly \(\beta =1\) to define the F-1 measure.

Implementation details

In addition to the algorithms and descriptions presented in Sect. 3, this section presents some implementation details for the components of the proposed method. Python (version 3.5) was used for the implementation of the modules. The following libraries and frameworks were used:

  • Preparation: The NLTK toolkitFootnote 12 was used for natural language processing tasks (Tokenize, POS-Tagging).

  • Knowledge extraction: Stanford CoreNLPFootnote 13 was used to identify entities in Spanish with the Named Entity Recognition tool together with the lexical patterns.

  • Linking: The library RDFlibFootnote 14 for Python was used for querying the DBpedia repository.

  • Representation: The building of the scheme RDF was made using Apache Jena.Footnote 15 The graph was visualized using the library GraphViz Python.Footnote 16

  • Visualization. Protégé was used for visualizing the represented data.

In terms of hardware (and OS), all the experiments were executed on a computer with the following features: Intel Core i-5 CPU (2.4 GHz), 8 GB RAM, 64 bits, and Windows 10 as the operating system.

Quantitative analysis

This experiment evaluates the degree of extraction of components by the proposed method. The scenario for this experiment is as follows:

  • Datasets. Sentences from the three datasets were used for this experiment.

  • Evaluation of triples. The elements in each RDF triple were connected to resources in DBpedia. However, an RDF triple is complete only if an identifier can be defined for each of its elements (subject, predicate, object).

The results are presented in Table 6, where NER-T and NER-P refers to the named entities extracted through an NLP tool and by the defined patterns respectively; EL refers to the entity linking process (named entities associated to resources in DBpedia), SR refers to the extracted semantic relations, and finally the number of RDF triples constructed is given.

Table 6 Results of the extracted components

We noted that the NER tool presented difficulties in identifying entities in Spanish texts for the CS dataset. This was because the used models works better in more general domains. On the contrary, when adding lexical patterns, the identification of specific entities improves. For example, entities composed of different words such as “Museo Nacional de Historia Natural/National Museum of Natural History” can be identified. The idea is to increase the number of extractions that can be useful for the association of resources for the subject or object of a triple identified in the text. In the case of the KI dataset, a larger number of entities were extracted by NER-T, mainly because it contains general domain sentences from Wikipedia. However, we also noted a reduction in the number of entities obtained by NER-P because they had already been considered in NER-T and also by variations in the language of translation from English to Spanish. In the case of the FN dataset, the extraction of elements demonstrated that the strategy is able to get elements of information in another domain. Compared to the other datasets, the system obtained more triples from FN (regarding the proportion of sentences of the three datasets). This is due to the number of linked entities and the ratio of the words per sentence, which shows that the patterns work better with shorter sentences since longer sentences contain more lexical elements (e.g., sentence complements joined with prepositions) that require more robust patterns to model them.

Accuracy analysis

In this section, the performance of the proposed strategy through the manual assessment of the extracted components is evaluated. Thus, an a posteriori revision by human evaluators was performed for such a purpose.

Entity linking assessment

In this experiment, the named entities extracted from text and their corresponding resources from DBpedia were analyzed. The experts determined whether the DBpedia resource refers to the entity or concept found in the text according to its context. The scenario for this experiment is as follows:

  • Dataset. Due to the high number of extracted components (see Table 6), it is time-consuming and not straightforward to analyze the total number of named entities extracted from text. For this reason, 24 sentences were randomly selected from the CS dataset.

  • Evaluation. The context of the named entity helps the reviewer in determining the precision regarding the linked resource. Moreover, the evaluation considered complete matchings of the text and the label in the DBpedia resource.

The results are presented in Table 7, where EL-T and EL-P refer to the entity linking considering entities extracted by a tool (NER-T) and patterns (NER-P) respectively, and EL-TP considers linking using both NER strategies. Note that only the precision is presented because the complete number of correct extractions is not known in advance.

Table 7 Results of precision for entity extraction and linking
Table 8 Evaluation of extracted semantic relations

The results obtained by the EL-TP configuration demonstrates an increasing number of components that can be associated with resources from DBpedia. Some of the entities were retrieved by both strategies but the type retrieved by the NER-tool helps to disambiguate the correct resource associated with the entity from text.

Semantic relation assessment

This experiment evaluates the semantic relations extracted through the defined pattern matching process. In this sense, the scenario for the analysis is as follows:

  • Dataset. 100 sentences from the CS dataset were randomly selected.

  • Evaluators. Three human evaluators manually identified the semantic relations from the randomly selected sentences, and then they compared and assessed each of the automatically extracted semantic relations. On the other hand, the evaluators (judges) are Spanish native speakers with notions of the Semantic Web standards, and belong to a postgraduate Computer Science program. Note that the same judge features apply for the next experiments.

It is worth mentioning that a semantic relation is correct if their three elements are identified. Additionally, it is also verified if each of such elements is coherent to convey the idea of the corresponding sentence (maps the information of a sentence into the semantic relation). The results are presented in Table 8, where SR refers to the total number of semantic relations, and then the numbers of correct, incorrect, and identified relations are presented.

The extraction of semantic relations is an essential step for the creation of RDF triples. The number and quality of such components directly impact the association of resources and properties in a semantic and pragmatical perspective. That is, errors detected in the semantic relations are inherited in further steps of the KG construction. The performance of relation extraction approaches in Spanish is still improving, lying in the precision range of 0.590.80 for data in general and news domain [39, 45]. Thus, we consider our results in an intermediate range that can be applied for the next steps of the process.

Assessment of RDF triples

This step evaluates the performance of the represented RDF triples according to correct and incorrect elements. We consider the following conditions:

  • Dataset. We used the whole CS dataset.

  • Evaluators. Similarly to the previous scenario, three human experts manually assessed RDF triple elements.

An RDF triple is deemed correct if its elements are coherent according to the original sentence (and semantic relation). The subject and object entities are correct according to the resource they are targetting from DBpedia, and if the generated property is appropriate for the predicate. The results of the evaluation are presented in Table 9.

Table 9 Evaluation of RDF triple elemenets

Based on the results of Table 9 and given that the sentences in the CS dataset are contained within 19 documents, we present the Precision, Recall, and F-1 values by document (and by element) in the Fig. 2. In terms of the RDF triple elements, the subject obtained the best performance given for all the measures. This behavior is assumed due to the initial point to set an entity as subject and later connect its remaining elements, which sometimes are missing or incomplete (mainly in the case of the object). For example, according to the sentence Frida Kahlo was a self-portrait painter, a semantic relation can be wasA(Frida Khalo, self-portrait painter). However, in an RDF triple, the object may miss the word painter, producing an incoherent idea. Much of the influence in supporting coherent ideas is given by the correct extraction of the semantic relation, delimiting the elements that can be represented and inheriting possible errors from early to late steps of the process.

Fig. 2
figure 2

Results of RDF triples evaluation

Performance comparison

Although there is a lack of techniques for KG extraction from plain text in Spanish, our goal in this experiment is to compare the performance of our proposed strategy against an existing approach from the literature. In this regard, we considered the approach proposed by Kertkeidkachorn et al. [20] (K &I) by adapting the following scenario:

  • Dataset. The KI dataset is used.

  • Entity linking. Entities are linked to DBpedia resources (through SPARQL queries). In the case of K &I, they used DBpedia Spotlight.

  • Property linking. While the K &I approach is based on property mapping (mapping predicates to DBpedia properties), the proposed approach follows a property generation strategy. Moreover, the experiment considered only RDF triples with object properties.

  • Evaluation. Considering the language of the KI dataset and the one used in our proposal, the extracted triples were manually evaluated to analyze the degree of matching regarding the elements of the original RDF triples in the dataset. In this way, we avoid penalizing identifier differences for the original and extracted resources due to language label changes.

The results of the comparison are presented in Table 10, where the evaluation measures and number of extracted triples (|T|) are displayed for K &I and our proposed approach. Although the conditions for the extraction and representation between the two approaches are slightly different (particularly the property mapping and the language), the purpose was to analyze the performance of our proposal regarding an existing approach from the literature. Our proposal got better performance due to a focus on precision, but there were fewer extractions compared to K &I approach.

Table 10 Results of comparing the RDF triple extractors

KG visualization

The syntactic validity of the generated KG was verified by loading the RDF file containing all the obtained triples and the schema specification. This validation was possible through the scheme loading in the Protégé framework. An example of a KG loaded with Protégé is depicted in Fig. 3 to visualize the classes and instances generated from the extracted RDF triples according to the running example sentence (used for illustrative purposes).

Fig. 3
figure 3

RDF data visualization in Protégé

Regarding to the KG visualization, in Fig. 4 is shown a KG obtained from the text Frida Kahlo fue una pintora mexicana/Frida Kahlo was a mexican painter.

Fig. 4
figure 4

The constructed KG

Discussion

Although the proposed method is designed for a general domain (according to the defined patterns), the description of new concepts and patterns can support the extraction of components regarding a particular domain of information. According to the extracted elements and results, the following aspects can be observed:

  • More than one RDF triple can be extracted from a single sentence. For example, given the sentence John Genzale was born in Queens, New York, where he first lived in East Elmhurst, a list of some possible RDF triples is presented in Listing 4. For visual and space reasons, not all descriptions from the sentence are represented in the example, where the qualifier (first) is missing and can only be defined by a reification model [24], which is not within the focus of this work.

figure g
  • Getting too many DBpedia statements for enrichment could cause the KG to miss focus and context from the processed documents.

  • Most of the errors are detected at the last stage of the triple extraction because the early stages are finally combined. Thus, the final stage depends on the correctness of tasks in the text processing pipeline.

  • This work is focused on lexical patterns to establish a semantic valence of words and describe the behavior of words in a sentence. Such lexical patterns represent taxonomic, equivalent, and structural relations. For example, our method can detect taxonomic relations described by hyponyms/hyperonyms relation predicates between two resources, such as in the triple: \(<\texttt {dbr:Rohtak, isA/esUna, dbr:Ciudad}>\). However, although the use of lexical patterns allow the proposed method to get a higher precision regarding the approach of K &I, it is also a cause of obtaining fewer triples and a lower recall because of its focus on a particular type of relations. This issue is known as out-of-vocabulary (OOV) problem. In this case, the system does not consider what is not covered by the patterns. One option to solve this aspect is to expand the list of patterns according to the domain and language [25]. Another option for addressing this problem is through deep learning and word embeddings, creating a representation of character sequences that allow predicting strings [37, 42]. However, such aspects will be considered for future work.

  • Even though the patterns defined in this work (Tables 1, 2) could also be applied in diverse sentences in English, in some cases, the Spanish language requires lexical items in a specific order to follow correct grammar. Such is the case of the order of adjectives and nouns in a sentence. In Spanish, an adjective goes after the noun, and in English, it goes before the noun (for example, “carro rojo” is an expression in Spanish, the equivalent in English would be “red car”).

  • Another challenge to deal with in the Spanish language is that the noun could be omitted in a sentence; this is called ellipsis. For example, in the sentence “estuve en Paris/I was in Paris”, there is no noun in the Spanish version, this type of sentence could not be processed with our proposal because the patterns always start with a determiner or a noun.

  • While we focused on a prototype implementation, the tagging and linking modules (for entities and relations) can be replaced with more robust and/or precise strategies. For example, disambiguation methods depending on the characteristics of the KB (e.g., hierarchy, types, mentions) [25], and property linking strategies based on machine learning [32], to mention a few.

  • Regarding the computational complexity, the representation strategy works in polynomial time. Given a text (dataset), its set of sentences (n) is determined. Next, for each sentence, its set of entities (m) is determined, and for each entity, a set of queries (p) is performed. From the above, the complexity of the algorithm is \(O(n \times m \times p)\). In the worst case, each word in a sentence is an entity, and for each of the entities, the set of queries to DBpedia is performed.

  • Regarding the visualization, Fig. 4 shows a graph created with GraphViz for an example entity, which was done only to briefly represent the data obtained. On the other hand, higher dimensional graphs can be navigated with Protégé, which can be useful in, for example, the task of automatic ontology learning. However, for the next version of the system, we will explore Web of Data tools or strategies that would help us to visualize large KGs. For example, KG-visual [11] or LDViz [28], to mention a few.

It is worth mentioning that the characterization of entities and relations would be helpful for a fair comparison between approaches. For example, Rosales et al. [34] provide a categorization of named entities so that they can compare EL systems according to the type of entities they extract. Such an aspect may benefit the extraction and evaluation of semantic relations and also the assessment of RDF triples. Moreover, there are different descriptions of semantic relations according to, for example, the number of elements (binary, n-ary) and the type of relation (e.g., composition, inheritance, aggregation, and association). However, it is required a categorization of semantic web relations thinking on several factors such as open relations for real-world objects, different languages and linguistic features, relations attached to certain types of entities, to mention a few. Moreover, a way to map such relations with properties and vocabularies using well-defined standards (OWL, RDFS, OWL) is still needed [25].

Conclusions

This paper presented a method for constructing knowledge graphs from unstructured text in Spanish. The method consists of four phases for preprocessing text, extracting entities and semantic relations, linking, and representing RDF triples. One of the most critical phases is the extraction of entities and semantic relations through existing NLP frameworks and matching lexical patterns. Such patterns take into account the occurrence of nouns followed by grammatical elements, resulting in the extraction of simple entities, multiple entities (with a sequence of nouns and concepts), specialized terms, among others. The proposed method is capable of identifying and extracting non-taxonomic, taxonomic, equivalence, and structural relations.

The experiments evaluated the extraction of KG components linked to DBpedia, verifying the number of extracted elements and their validity in terms of Precision, Recall, and F-1 measures. The results demonstrated the capabilities of the proposed method to integrate NLP tasks in a pipeline for the generation of KGs, and also that it outperforms an adapted version of an approach that processes text in English. We also found the following findings:

  • Processing Spanish text is complicated due to variations in the language regarding the core words that define the knowledge components. For example, words describing verb tenses and noun variations (noun phrases, gender, grammatical number) hinder the lexical pattern generation.

  • The use of patterns for identifying entities and semantic relations are useful for matching such components in a general domain. Thus, we can replace the whole set of patterns or add more patterns to the existing list if the input text is in another language or information domain. This option allows the strategy to stay balanced for processing text in no particular domain or adapted for a particular one.

  • We found the need to categorize semantic relations (both for English and Spanish) to support the evaluation of approaches with certain information requirements.

As future work, we plan to test different extraction component alternatives to get a pipeline composed of more precise tools and thus, improve the accuracy of the proposed strategy. We also want to propose a categorization of semantic relations to create a Spanish benchmark to evaluate this and similar approaches. Finally, we plan to use the extracted information (KG) in a knowledge-based system that supports tasks such as information retrieval, classification, and question answering.