RDF-F: RDF Datatype inFerring Framework

In the context of RDF document matching/integration, the datatype information, which is related to literal objects, is an important aspect to be analyzed in order to better determine similar RDF documents. In this paper, we present an RDF Datatype in Ferring Framework, called RDF-F, which provides two independent datatype inference processes: 1) a four-step process consisting of (i) a predicate information analysis (i.e., deduce the datatype from existing range property), (ii) an analysis of the object value itself by a pattern-matching process (i.e., recognize the object lexical space), (iii) a semantic analysis of the predicate name and its context, and (iv) generalization of Numeric and Binary datatypes to ensure the integration; and 2) a non-ambiguous lexical-space-matching process, where literal values are inferred by the modification of their representation, following new lexical spaces. We evaluated the performance and the accuracy of both processes with datasets from DBpedia. Results show that the execution time of both indicators is linear and their accuracy can increase up to 97.10 and 99.30%, respectively.


Introduction
One of the main benefits offered by the Semantic Web initiative is the increased support of data sharing and the description of real resources on the Web, by defining standard data representation models such as RDF, the Resource Description Framework. Particularly, heterogeneous RDF documents can express similar concepts using different vocabularies. Hence, many efforts focus on describing the similarity between concepts, properties, and relations to support RDF document matching/integration [1,3,22].
Indeed, RDF describes resources as triples: hsubject; predicate; objecti, where subjects, predicates, and objects are all resources identified by IRIs. 1 Objects can also be literals (e.g., a number, a string), which can be annotated with optional type information, called datatype. This latter is a classification of data, which defines types of RDF, adopted from XML Schema [25]. There are two classes of datatypes: simple and complex. Simple datatypes can be primitive (e.g., boolean, float), derived (e.g., long, int derived from decimal), or user defined, which are built from primitive and derived datatypes by constraining some of its properties (e.g., range, precision, length, format). Complex datatypes contain elements defined as either simple or complex datatypes.
The W3C Recommendation (proposed in [24]) points out the importance of the existence of datatype annotations to detect entailments between objects that have the same datatype, but a different value representation. For example, if we consider two distinct triples containing the objects ''20.000'' and ''20.0'', these objects are considered as different, because of the missing datatype. However, if they were annotated as follows: ''20.000''^^xml:decimal and ''20.0''^^xml:decimal, then one can conclude that both objects are identical. Works on XML Schema matching proved that the presence of datatype information, constraints, and annotations on an object improves the similarity between two documents (up to 14%) [2]. Moreover, recent studies in the context of XML/RDF document matching have performed an analysis of datatypes to increase the compatibility/integration among data [9,18,23,29]. However, a huge quantity of RDF documents is incomplete or inconsistent in terms of datatypes [15,27]. Hence, when datatypes are missing, datatype inference emerges as a new challenge in order to obtain more accurate RDF document matching results.
Two categories of approaches have been proposed in the literature to cope with datatype inference: • Several theoretical studies were conducted in the context of XML Schema Definition (XSD) [6,7,14]. They mainly infer simple datatypes by a hierarchy among the candidate datatypes obtained by a patternmatching process on the format of the values, i.e., the characters that make unique a datatype, which is called lexical space according to the W3C Recommendation [25]. These works consider a limited number of simple datatypes (e.g., date, decimal, integer, boolean, and string) and choose the most specific datatype among the candidates. However, datatypes, as gYear (e.g., 1999), cannot be determined (since it is identified as an integer). Also, the other theoretical studies in the context of programming languages and OWL have focused on inferring complex datatype through axioms, assigned operations, and inference rules [11,16,26], without considering simple datatypes. • There are many tools available on the Web that infer datatypes by mainly a pattern-matching process of the lexical spaces. As unknown inference criteria are used, each tool provides different datatypes for the same XML data.
Thus, in the context of RDF document matching/integration, current works are not suitable mainly for three reasons: 1. Hierarchy-based methods [6,7,14] cannot infer all simple datatypes, since they consider a reduced set of datatypes and there are intersections between datatype lexical spaces (e.g., 1999 can be an integer or a gYear according to the lexical space of both W3C datatypes); 2. Complex datatype inference methods [4,5,11,16,31] cannot be applied to simple datatypes, since in RDF, a simple datatype is an atomic value associated with a predicate; and 3. Tools available on the Web [12,21,32] have unknown criteria of inference, providing different datatypes for the same data.
In a previous work, we proposed a framework that considers, in addition to the lexical space analysis, the analysis of the predicate information related to the object [10]. It consists of four steps: (i) analysis of predicate information, such as range property that defines and qualifies the type of the object value; (ii) analysis of lexical space of the object value, by a pattern-matching process; (iii) semantic analysis of the predicate and its semantic context, which consists in identifying related words or synonyms that can disambiguate two datatypes with similar lexical space; and (iv) generalization of Numeric and Binary datatypes, to ensure a possible integration among RDF documents. Although our previous approach was able to detect many cases, some are still undetected. For example, according to this proposal, string, decimal, and base64Binary are candidate datatypes for the literal value ''1''. Moreover, primitive datatypes were only considered in the study, but derived datatypes are also part of simple datatypes; thus, the inference is incomplete for simple datatypes. For that, we extend our inference datatype framework, called RDF Datatype in Ferring Framework (RDF-F), by proposing a new process based on the modification of the existing literal values through new nonambiguous lexical spaces as an alternative to the four-step process to infer simple datatypes (primitive and derived). We focus on eliminating the ambiguity among lexical space representations, due to the high performance obtained in our previous study.
In summary, this study has the following contributions: 1. New lexical space representations based on the ones proposed by W3C; 2. A modification of the Apache Jena 2 sources to support the new lexical representation of datatypes and to keep the interoperability among existing Semantic Web services; and 3. An exhaustive validation of our framework through more experiments for the Semantic Web.
Additionally, we detail our proposal presented in [10], by adding more technical information. This paper is organized as follows: Sect. 2 presents a motivating scenario to illustrate the importance of datatypes. Section 3 surveys the related literature. Definitions and RDF terminologies are presented in Sect. 4. Section 5 describes our inference approach. A complexity analysis is presented in Sect. 6. Section 7 shows the experiments to evaluate the accuracy and performance of our approach. Finally, we present conclusions in Sect. 8.

Motivating Scenario
In order to illustrate the importance of datatype information in RDF document matching, we consider a scenario in which we show the need to integrate three RDF documents with similar concepts (resources), but based on different vocabularies. Figure 1 shows three concepts from three different RDF documents. Figure 1a, b describes the concept Light Switch, with a property (predicate) isLight, whose datatype is boolean. However, they are represented with different lexical spaces: binary lexical space with a value 1 in Fig. 1a and string lexical space with a value true in Fig. 1b. In both cases, isLight property expresses the state of the light switch (i.e., turned on or turned off). Figure 1c shows the concept Light Bulb, with a property Light, whose datatype is float, and property weight with datatype double. For their integration, it is necessary to analyze the information of related concept properties. Intuitively, considering the datatype information, one can say that: 1. Both Light Switch concepts from Fig. 1a When the datatype information is missing and the integration is made only based on literals, several problems may emerge related to the ambiguity of properties. Contrary to our intuition, concepts in Fig. 1a, b are incompatible because of the use of different lexical spaces (i.e., value 1 is not compatible with the value true, which can be considered as a string datatype instead of boolean). Moreover, the integration of concept Light Switch from Fig. 1a  Thus, an approach capable of inferring the datatype from the existing information is needed when one needs to match or integrate RDF documents.
In the following section, we survey existing works on datatype inference. We highlight their limitations and discuss their possible applications on RDF document matching/integration.

Related Work
To the best of our knowledge, no prior work manages simple datatype inference for RDF documents. However, datatype inference has been addressed in other contexts, such as (theoretical approaches): XML Schema Definition (XSD) [6,7,14], programming languages [4,5,11,16,31], and OWL [13,19,26,28]. Moreover, we have revised some tools available on the Web for XSD inference  Following sections describe the theoretical approaches and the revised tools.

Theoretical Approaches
According to similar solutions, we classify the existing works into three groups: hierarchy-based approaches, where datatypes are inferred using hierarchies over candidate datatypes obtained by a matching process of the lexical spaces; function-based approaches, where axioms, operations, and constructions are used to infer datatypes in the context of programming languages; and knowledgebased approaches using probabilities among candidate datatypes and external services to provide an analysis of local information. We describe the groups in the following sections:

Hierarchy-Based Approaches
In the inference of XSD from XML documents, the authors in [14] reduce the datatypes to a small set of values (date, decimal, integer, boolean, and string). They propose a hierarchy between the reduced datatypes according to the lexical spaces of the W3C Recommendation (see Fig. 2). The proposal returns the most specific datatype that subsumes the candidate datatypes obtained by the pattern matching of the values. However, a gYear value is reduced to integer, which is incorrect. In the same context, the author of [6,7] proposes an inference method based on a hierarchy applied to a set of candidate datatypes. This set contains some derived datatypes of numeric group as nonNegativeInterger, un-signedInteger, and unsignedShort. In this case, the smallest datatype is chosen. For example, for a literal value 1999, whose datatype is gYear, the smallest among the candidate datatypes is unsignedShort, according to the hierarchy shown in Fig. 3. Table 1 shows the lexical spaces of simple datatypes according to the W3C.

Function-Based Approaches
In the context of programming languages, the authors in [11] focus on inferring complex datatypes, modeling them as a collection of constructor, destructor, and coercion functions. Other works [16,31] also use axioms and pattern matching over the constructors of the datatype during the inference process. In [4,5], operations and a syntax associated with datatypes are analyzed to infer complex datatypes. Simple datatypes such as date and integer are mainly inferred by a pattern-matching process of the value format using the lexical spaces. However, several simple datatypes having intersection among their lexical spaces as gYear and integer cannot be inferred using this pattern-matching process.

Knowledge-Based Approaches
In the context of OWL, the authors in [26] propose a rulebased method to heuristically generate datatype information by exploiting axioms in a knowledge base. They assign Hierarchical structure to infer datatypes. The most general datatype is string, while the most specific one is unsignedByte [6] datatype probabilities to the assertions. In the domain of health care, the work presented in [28] proposes a datatype recognition approach (inference type) by associating a weight with each predicate, using support vector machines and by building a dictionary to map instances. For [20], the Semantic Web needs an incremental and distributed inference method due to the long ontology size. The authors use a parallel and distributed process (MapReduce) to ''reduce'' the ''map'' of new inference rules. The authors in [19] state that DBpedia only provides 63.7% of datatype information. Hence, they propose an approach to discover complex datatypes in RDF datasets by grouping entities according to the similarity between incoming and outgoing properties. They also use a hierarchical clustering and the confidence of types for an entity. Although the use of knowledge and inference rules can infer datatypes where a specific information is known (e.g., type of properties, knowledge database), RDF data are not always available with its respective ontology, which makes impossible the task of detecting rules. In [13], the authors analyze two types of predicates: object property (semantic type, e.g., dbr:Barack Obama) and datatype property (syntactic type, e.g., xsd:string). They propose an approach to infer the semantic type of string literals using the word detection technique called Stanford CoreNLP 3 to identify the principal term and the UMBC 4 semantic similarity service to discover the semantic class. However, a semantic type is not always related to the same datatype, since it depends on the datatype defined in the structure. For example, the same data can be expressed as a string or integer according to two different ontologies.

Tools
Several tools that generate XSD from XML documents are provided in the literature to infer the type of data from existing values (lexical spaces), such as XMLgrid [32], FreeFormatted [12], and XmlSchemaInference by Microsoft [21]. However, they do not share a standard process to infer datatypes. For example, the attributes weight and isLight from the following XML document extracted from Fig. 1 have different inferred datatypes according to these three tools: • XMLgrid infers weight as double and isLight as int; • FreeFormatted infers weight as float and isLight as byte; • While XmlSchemaInference infers weight as decimal and isLight as unsignedByte. Without known criteria, these existing tools cannot be adopted to infer properly datatypes of RDF resources. Table 2 summarizes the existing approaches according to our criteria. Note that only our previous work satisfies all the defined requirements. However, derived datatypes were not considered. Following sections describe some RDF terminologies and definitions in order to formalize our datatype inference framework before presenting our proposal in detail.

RDF Terminologies and Definitions
RDF is the common format to describe resources that represent the abstraction of an entity (document, abstract concept, person, company, etc.) in the real world. RDF uses IRIs, blank nodes, and literal nodes as elements to build triples and provide relationships among resources.
The RDF Schema (RDFS) is a set of classes with certain properties (vocabulary), which are extensions of the basic RDF vocabulary [8]. RDFS defines properties to better describe resources. For example, the rdfs:domain property designates the type of subject that can be associated with a predicate and the rdfs:range property designates the type of object. The Semantic Web proposes an implicit representation of the datatype property in the literal object as a description of the value (e.g., ''value''^^xml:string). Definition 1 presents the formal definition of a simple datatype according to W3C [17].
Definition 1 Simple datatype (dt): In RDF, a simple datatype, denoted as dt, is characterized by: (i) a value space, denoted as VS(dt), which is a nonempty set of distinct valid values; (ii) a lexical space, denoted as LS(dt), which is a non-empty set of Unicode strings; and (iii) a total mapping from the lexical space to the value space, denoted as L2V(dt) [17].
The datatype boolean from Fig. 1a has the following characteristics: Table 3 presents several sets of RDF elements, that we use in our formal approach description in the following sections.
As we mentioned before, RDF links resources by the use of IRIs, blank node, and literal nodes making atomic structures called triples or statements. A triple is defined as follows.  Definition 2 Triple (t) [30]: A triple is defined as an atomic structure consisting of a 3-tuple with a subject (s), a predicate (p), and object (o), denoted as t : hs; p; oi, where: • s 2 I [ BN represents the subject to be described; • p is a property defined as an IRI in the form namespace prefix : property name; namespace prefix is a local identifier of the IRI, where the property (property name) is defined; The predicate (p) is also known as the property of the triple.
The example presented in Fig. 1 underlines four triples with different RDF resources, properties, and literals, as follows: In the following section, we describe our datatype inference framework.

RDF-F: Our Inference Process Approach
Our datatype inference approach relies on two independent processes: (i) a four-step analysis, where the literal values and the predicates related to them are analyzed syntactically and semantically, and (ii) a non-ambiguous lexical space matching, where the initial literal values are modified according to new lexical space representations and inferred by a matching process. Figure 4 shows the two processes of our inference framework. The input of our framework is an RDF Description which can be represented in different serializations formats (such as RDF/XML, Turtle, N3) and the user parameters (i.e., type of process, inference steps, and their order). The output is an RDF Description with its respective inferred datatypes.

RDF-F: Four-Step Process
This process considers the annotations on the predicate, the specific format of literal object values, the semantic context of the predicate, and the generalization of datatype for Numeric and Binary groups. Each step can be applied independently and in different orders according to user parameters.
A description of each step is presented as follows.

Predicate Information Analysis (Step 1)
In a triple t : hs; p; oi, the predicate p establishes the relationship between the subject s and the object o, making the object value o a characteristic of s. Information (properties) such as rdfs:domain and rdfs:range can be associated with each predicate to determine the type of subject and object, respectively. To deduce the simple datatype of a particular literal object, we propose to inspect the property rdfs:range, when exists. We formally describe this Step 1 with the following definitions and rule.
Definition 3 Predicate Information (PI): Given a triple t : hs; p; oi, the Predicate Information is a function, denoted as PIðtÞ, that returns a set of triples defined as: • p i is an RDF defined property 2 frdfs:type, rdfs:label, rdfs:rangeg; • o i is the object of t i . Table 4 shows the set of triples returned by the function Predicate Information, when applied on the property dbp:weight presented in Fig. 1c. Definition 4 Predicate Range Information (PRI): Given a triple t : hs; p; oi, the Predicate Range Information is a function, denoted as PRIðtÞ, that returns the value associated with the rdfs:range property, defined as: null otherwise: Applying Definition 4 to the set of Predicate Information (PI) of the property dbp:weight (see Table 4), the Predicate Range Information function returns the value xsd:double.
Definition 5 IsAvailable (IA): Given a predicate p, IsAvailable is a boolean function, denoted as IAðpÞ, that verifies if p is an IRI available on the Web: Using the three previous definitions, we formalize our first inference rule. As an input, the algorithm receives the triple t : hs; p; oi, whose object datatype is to be determined. If the IRI representing the predicate exists (line 1-Definition 5), the link is explored to extract all available information as triples (line 2-Definition 3). For example, if dbp:weight (more specifically, http://www.dbpedia.org/ ontology/weight) is the predicate, we can get the list of triples shown in Table 4. (Each row represents a triple.) If among these triples, there is the property rdfs:range, then its associated object value, which is the datatype, is returned (lines 3 to 5). Otherwise, an unknown datatype is returned (lines 7-Definition 4).
As shown in Fig. 1c, the output of the algorithm will be the datatype xsd:double.
This algorithm examines external information and is independent of the query language. Rule 1 can be implemented as a simple SPARQL query, such as: where ?subject ?predicate ?literal are the triple to be analyzed; ?predicate is the analyzed predicate; and ?datatype is the returned result.
The following step analyzes the lexical space of a literal object in order to infer its datatype.

Datatype Lexical Space Analysis (Step 2)
According to Definition 1, a datatype is a 3-tuple consisting of: (i) a set of distinct valid values, called value space; (ii) a set of lexical representations, called lexical space; and (iii) a total mapping from the lexical space to the value space. In some cases, the datatype can be inferred from its lexical space, when it is uniquely formatted (e.g., value 1999-05-31 matches with the format CCYY-MM-DD, which is the lexical space of datatype date). However, in other cases (such as boolean, gYear, decimal, double, float, integer, base64Binary, and hexBinary), the lexical spaces of datatypes have common characteristics, leading to a certain ambiguity (e.g., value 1999 matches with lexical spaces of gYear and float -see Table 1). Figure 5 illustrates graphically the lexical space intersections of W3C simple datatypes (primitives and integer).
To compare the datatype lexical spaces with the literal values, we establish an order based on the lexical spaces intersections (from a general lexical space to a specific one).
To analyze the lexical spaces, we propose the following definition. Based on this definition, we formally define our second inference rule as follows. Rule 2 analyzes the number of possible datatypes of a literal object value. The order to analyze the lexical space of each datatype is established by the lexical space intersections. In all cases, the datatype string is a candidate datatype, since it has the most general lexical space (see Fig. 5); if the number of candidate datatypes is one, then the only datatype, which is string, is returned. If the number of candidate datatypes is two, then the other datatype is returned. Otherwise, we have an ambiguous case and any datatype, different from string, can be provided. Hence, the inference process remains incomplete due to the ambiguous cases and further analysis is needed.
A pseudo-code following the definition of Rule 2 is proposed in Algorithm 2. The algorithm receives a triple t : hs; p; oi and returns the datatype that can be associated with the object. An initial list is initialized with the datatype string, because any object value is a string (line 1). According to the lexical spaces defined by the W3C (see Table 1), the list of candidate datatypes is generated by a pattern-matching process (line 2 in Algorithm 2-Definition 6) following the order obtained from the lexical space intersections. If the number of candidate datatypes is more than 2, we are under an ambiguous case, since the lexical space of the literal value matches with several lexical spaces of the datatypes (lines 3-4 of Algorithm 2). If we have only string as a candidate datatype, then this is the returned information (line 7 of Algorithm 2). If we get two candidate datatypes, one of them is a string datatype and the other one is the datatype returned for the object value (line 9 of Algorithm 2).
The following step analyzes semantically the predicate of the literal object through the definition of context rules.

Predicate Semantic Analysis (Step 3)
In the presence of ambiguous cases (unknown), a semantic analysis of the predicate needs to be performed. The predicate name can define the context of the information in a scenario where the data are consistent. Regarding the W3C datatype lexical spaces, the datatypes boolean, gYear, decimal, double, float, integer, base64Binary, and hexBinary are ambiguous. However, the ambiguity of boolean, gYear, and integer, in some specific scenarios, can be resolved by examining the context of its predicate according to a knowledge base.
For example, the predicate dbp:dateOfBirth has the context date, and then it is possible to assume gYear as the datatype; the predicate dbp:era has the context period and the datatype assigned can be integer; however, for predicate dbp:salary, it is possible to assign datatypes decimal, double, or float; the ambiguous case persists.
In order to describe our inference process in this step, we formalize a knowledge base as follows: Definition 7 Knowledge Base (KB): A knowledge base (thesaurus, taxonomies, and ontologies) provides a framework to organize entities (words/expressions, generic concepts, etc.) into a semantic space, which is capable of capturing meaning. Our knowledge base has the following defined functions: • Similarity (sim): Given two entities n and m, Similarity is a function, denoted as simðn; mÞ, that returns the value of the relation among both entities: simðn; mÞ ¼ A relation value 2 ½0; 1 between n and m according to KB: • IsPlural (IP): Given an entity n, IsPlural is a function, denoted as IP(n), that returns True if the entity n is plural: IPðnÞ ¼ True if n is plural according to KB; False otherwise: • IsCondition (IC): Given an entity n, IsCondition is a function, denoted as IC(n), that returns True if the entity n is a condition: ICðnÞ ¼ True if n is a condition according to KB; False otherwise: In this scenario, our knowledge base is reduced to the relations among words.
The semantic context is formalized, based on the knowledge base, as follows: Definition 8 Context (ct): A context is a related word (synonym), which clarifies or generalizes the domain of a word. It is associated with a relation value according to a Knowledge Base. A context is denoted as a 3-tuple ct : hw; y; vi, where w is a word; y is a related word of w; and v is a relation value between w and y, sim(w,y) 2 ½0; 1. Definition 9 Set of contexts (CT): Given a word w, its set of contexts is defined as CT ¼ fct i j ct i : hw; y i ; v i i is a context of wg.
Definition 10 Predicate Context (PC): Given a triple t : hs; p; oi and a threshold h, Predicate Context is a function, denoted as PC(t, h), that returns a set of contexts defined as: The context can determine the datatype for some literal objects through a semantic analysis, and then we assume three scenarios for an ambiguous case: • If date is in the context (e.g., hword,date,0:5i, with h ¼ 0:5) and the literal value is a number (e.g., 1999), then the datatype is gYear because gYear (1999) is a part of datatype date (1999-05-31); • If period is in the context (e.g., hword;period,0:5i, with h ¼ 0:5) and the literal value is a number (e.g., 3 months), then the datatype is integer because it is about quantity. • However, if the context is date, the word from which we obtain the context cannot be plural, since plural words express quantities. Thus, in this case the word is related to the datatype integer according to our scenarios.
Definition 11 generalizes our scenarios to assign a datatype to a literal object, according to the context of its corresponding predicate name. In addition, to determine a datatype as boolean, we assume that a word is defined as a condition in a knowledge base (e.g., WordNet).
Using the previous definitions, we formally define our third inference rule. Rule 3 returns the datatype of the object value when a defined context associated with the predicate exists. If that is not the case, we are still under an ambiguous case. Note that Rule 3 is proposed for a scenario where the data are consistent with the W3C Recommendations (e.g., self-descriptive names).
Algorithm 3 is a pseudo-code of our semantic analysis step. The algorithm receives the triple t : hs; p; oi to be analyzed. For the analysis of the predicate name, an external service is required in order to obtain the synonyms of the predicate name, called contexts (line 3 in Algorithm 3). If more than one defined context is available in the set of contexts (Definition 10), the algorithm returns the one which has the highest similarity value (line 20 in Algorithm 3). An unknown datatype is returned if no defined context is present.
The following step describes the generalization method for literal values that are part of Numeric and Binary groups.

Generalization of Numeric and Binary Groups (Step 4)
As an alternative to disambiguate the datatypes decimal, double, float, integer, base64Binary, and hexBinary, we propose two groups of datatypes: Numeric and Binary. In each group, we define a total order among the datatypes by considering lexical space intersection (see Fig. 5 According to these groups, we return the most general datatype, if all candidate datatypes belong only to one of these two groups. Definition 12 Generalization (G): Given a literal object o, the set of its candidate datatypes is reduced by the function Generalization, defined as: Note that datatype string is always part of candidate datatypes. We formally define our fourth inference rule as follows.
Rule 4 Datatype Generalization: Given a triple t : hs; p; oi, in which o 2 L, the datatype of o is determined as follows: However, we can have a case where an object value has decimal and base64Binary as candidate datatypes because of similar value representations and our inference approach cannot determinate the most appropriate datatype.
Algorithm 4 is a pseudo-code of a possible implementation of Rule 4. The algorithm receives the triple t : hs; p; oi to be analyzed. The list of candidate datatypes is reduced removing specific datatypes and keeping the most general ones (decimal and base64Binary) (line 2 in Algorithm 4). If the list of candidate datatypes has only a value, the datatype is string (line 4 in Algorithm 4); however, if there are two, the datatype is the second one (line 6 in Algorithm 4), since the first one is always string. If there are more than two datatypes, the ambiguity persists and this step is not able to produce a result. Our first process of the RDF-F allows to improve the datatype analysis for RDF matching/integration by complying with the identified requirements (see Sect. 3): (i) the use of local available information, as the predicate value in Step 1 and Step 3 and the datatype lexical space in Step 2, as well as external available information, such the predicate information in Step 1 and the predicate context in Step 3; and (ii) this method is objective and complete for the Semantic Web, since all simple datatypes are considered, which are available in the most common Semantic Web databases.
In the following section, we present our second process where new lexical space representations, for simple datatypes based on the W3C, are proposed.

RDF-F: Non-ambiguous Lexical-Space-Matching Process
Even though the first process of the RDF-F is performed, some datatypes could remain unknown. However, the inference process can be reduced to a simple matching between lexical spaces and the format of literal values if we associate with each simple datatype lexical space a different representation. To achieve this, we extend the W3C definitions to provide such representations. Table 5 summarizes the set of representations using regular expressions. For instance, float values have a lexical representation consisting of a consonant ''f'' before a mantissa followed, optionally, by the character ''E'' or ''e'', followed by an exponent. The exponent must be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the lexical rules for integer and decimal. If the ''E'' or ''e'' and the following exponent are omitted, an exponent value of 0 is assumed.
The special values positive and negative infinity and not-anumber have lexical representations INF, -INF, and NaN, respectively. Lexical representations for zero may take a positive or negative sign (e.g., f-1E4, f1267.43233E12, f12, f-0, f0, f-1E4, f1267.43233E12, f12, f-0, f0). The corresponding regular expression of datatype float representation is: f½ þ À?ð½0 À 9 Ã ½:Þ?ðEjeÞ?½0 À 9þ. Using this simple solution, the lexical spaces become unique, and thus by a lexical space matching, we can infer the simple datatypes. However, changing the lexical spaces leads to change the processing engine (e.g., Jena) in order to provide compatibility between previous W3C lexical spaces and the new proposed ones.
In the following section, we describe a complexity analysis of our framework.

Complexity Analysis
As the framework relies on two independent processes, two different temporal complexity analyses need to be performed. A complexity analysis of the first process of our inference approach indicates a linear order performance in terms of the number of triples (O(n)).
• In Step 1, the predicate information of each triple is extracted to search the rdfs:range property, since the number of properties associated with the predicate of each triple (Definition 3) is constant, and then its execution order is of O(n). • In Step 2, for each triple a pattern matching is executed for all simple datatypes (finite number of executions); thus, it is of linear order (O(n)). • In Step 3, for each triple, its set of contexts is extracted to determine the best related work (in a constant time); thus, its time complexity is also O(n). • Finally, Step 4 reduces the finite set of candidate datatypes (generalization) in a linear order (O(n)).
As the four steps are executed sequentially, the whole first inference datatype process exhibits a linear order complexity, O(n). The second process of our approach also indicates a linear order performance in terms of the number of triples (O(n)). This process is similar to the Step 2 performed in the four-step process. Each triple is analyzed by a pattern matching for all simple datatypes (finite number of executions).
The following section evaluates the accuracy and demonstrates the linear order performance of our proposal.

Experimental Evaluation
To evaluate and validate our inference approach, an online prototype system, called RDF2rRDF, 5 was developed using PHP and Java. Figure 6 shows the graphic user interface of the prototype, where the processes of inference can be selected according to user preferences.
For our four-step inference process, contexts in Step 3 were implemented using the semantic similarity service UMBC. 6 Also, we used WordNet 7 to recognize if a word is plural assuming that every word has a root lemma where the default plurality is singular. Additionally, we assumed in our implementation that a word is a condition if it has the prefix ''is'' or ''has''. All these assumptions compose our knowledge base.
For our non-ambiguous lexical-space-matching process, we modified the Jena sources in order to support the new lexical space representations. 8 The idea is to propose compatibility between previous lexical spaces and the new proposed ones through the modification of the Jena sources. New designs can adopt the proposed lexical spaces without losing their properties in existing services. Additionally, we implemented a tool where automatically the RDF data can be modified according to the new lexical space representations. RDF data have to be annotated with their respective datatypes to produce consistent documents.
Different Semantic Web databases are currently available on the Web (e.g., DBpedia, WordNet, GeoLinked data). However, they do not provide enough variety of datatypes. (WordNet considers only strings, and Geo-Linked data consider complex datatypes.) In DBpedia dataset, only nine datatypes are present (integer, gYear, date, gMonthDay, float, nonNegative, double, Integer, and decimal). Consequently, we chose DBpedia as the dataset to perform our experiments.
Experiments were carried out on a MacBook Pro, 2.2 GHz Intel Core(TM) i7 with 16.00 GB, running a MacOS Sierra and using a Sun JDK 1.7 programming environment.
Our prototype was used to perform a large battery of experiments to evaluate the accuracy and the performance (execution time) of our approach in comparison with the related work. To do so, we considered two datasets:

Accuracy Evaluation
To evaluate the accuracy of our approach, we calculated the F-score, based on the Recall (R) and Precision (PR). These criteria are commonly adopted in information retrieval and are calculated as follows: where Valid is the number of correctly inferred datatypes; Invalid is the number of wrongly inferred datatypes; and Ambiguous is the number of datatypes not inferred by our inference approach. For our four-step process, in Case 1, we evaluated the accuracy and performance of each step, all the combinations (Step 1 ? Step 2, Step 1 ? Step 3, Step 2 ? Step 3, Step 1 ? Step 4, Step 2 ? Step 4, Step 1 ? Step 2 ?
Step 4), and the whole inference process. The order of the whole inference process was established starting from a general solution (Step 1), that can be applied to all simple datatypes, until a specific solution for particular cases (Step 3 and Step 4). Extra experiments were performed in order to evaluate the accuracy with respect to the existing approaches, and to measure the behavior of the process when some datatypes are available in the RDF data. In Case 2, we only evaluated the whole four-step process, since the aim was to evaluate the execution time when having a high number of triples.
For our non-ambiguous lexical space process, the literal values of Case 1 were modified according to the new lexical space representations. To do so, we used the developed tool available on our online prototype. As a high number of triples are available in Case 2, we used this dataset to evaluate the performance of this inference process.  Step 2 ?
Step 4) show that Step 1 plays an important role during the inference process due to the lowest F-score value obtained when this step is not considered.
The best F-score was obtained with the whole inference process; however, the Precision decreased from 99.89% (Step 1) to 97.71% because of Step 3 and Step 4 (Precision 95.20% and 89.60%, respectively). Table 7 shows the Precision, Recall, and F-score for each datatype available in Case 1. In this table, the datatype date was not correctly inferred 7 times; however, according to the W3C Recommendation, its lexical space representation is unique and the datatype can be inferred by a simple lexical space matching; regarding the data, these 7 cases have the format YY-MM-DD instead of CCYY-MM-DD, which is the cause of the incorrect inferences (inconsistencies of the data).
In Case 2, the Precision decreased to 76.01% due to the noise and inconsistencies of the DBpedia datasets [27] (e.g., dbo:deathDate should have the datatype property date, but in the queried datasets, it was set as gYear).
Test 2: We also evaluated the accuracy of our four-step process in comparison with alternative methods and tools, namely Xstruct [14], XMLgrid [32], FreeFormatted [12], and XMLMicrosoft [21]. Since these works infer datatypes in XML documents, we transformed all literal nodes to XML format by using the value and its relation. Table 8 shows the accuracy results obtained for Case 1. Note that our process has the best Precision and F-score. Our Recall is less than the other ones because we consider a bigger number of datatypes, and thus, there are more ambiguous cases (lexical space intersections). Test 3: For Case 1, we performed an extra experiment to measure the behavior of our four-step inference process when a partial number of datatypes are missed (25%, 50%, and 75%). Table 9 shows the results obtained for this experiment. Precision, Recall, and F-score were measured with respect to the number of missed datatypes. Since each document has at most two same predicates, the results have not increased significantly. However, when a huge number of the same predicates are presented, the known datatype of a literal node is added to all the literal nodes associated with its predicate, leading to a better and easy inference.

Non-ambiguous Lexical-Space-Matching Process
Test 4: In Table 10, almost all simple datatypes were inferred by a high Precision and Recall (100.00% in both cases) for Case 1. However, due to the inconsistency of the data, the datatype date was considered as string in 7 cases, where the lexical space representation did not match

Performance Evaluation
To evaluate the performance of our inference processes, we measured the average time of 10 executions for each test.

Four-Step Inference Process
Test 5: Table 11 shows the results obtained in our four-step inference process performance evaluation. In Case 1, the execution time of Step 1 was greater than that of Step 2, because the use of external calls increased the execution time. However, the execution time of Step 1 ?
Step 2 was similar to Step 1, since Step 1 works as a filter of triples and leaves less analysis for Step 2.
Step 3 has the greatest execution time, since it depends on an external service.
Step 4 depends on the list of candidate datatypes; thus, its execution time is greater than that of Step 2 due to the use of extra operations to reduce the set of datatypes (generalization). Test 6: Additionally, we implemented in Step 1 and Step 3 the use of cache to store predicate information and predicate contexts, respectively (see Table 11-column 3). This cache is reused for consequential analysis of triples, since the same predicates are available in different triples. In Case 1, the use of cache in Step 1 reduced the execution time for more than 65% and made the execution time of Step 1 ? Step 2 less than those of Step 1 and Step 2, separately. The cache in the whole inference approach represented more than 70% of improvement in the performance and an average of 157 Â 10 À7 s per triple. Moreover, for more than 16 millions of triples (Case 2), the execution time remained in the order of seconds (59.28 s) and the average execution time per tripe was reduced to 35 Â 10 À7 s. We presume that in Case 2 the majority of triples were inferred by Step 1, which uses cache. Figure 7 shows the execution time with respect to the number of triples. The performance obtained confirms the linearity of our inference approach.
Note that the use of cache makes the function stable for high number of triples because of the finite number of predicates available in the DBpedia database.

Non-ambiguous Lexical-Space-Matching Process
Test 7: As a lexical space matching is performed during the parser of the RDF data, we compared the parsing time of the original Jena framework with respect to the one from the modified version. Table 12 shows the execution times for Case 1. We observed a minimum increment of the original Jena source with respect to the modified one (11.192 s and 11.955 s, respectively).  7 Execution time of the four-step inference process Test 8: In Case 2, we measured the parsing time with respect to the number of triples. Figure 8 confirms the linearity of our non-ambiguous lexical-space-matching process. This test demonstrates that the modification in Jena source has an insignificant impact on the total performance of the parsing.

Discussion and Comparison
According to our experiments for the Semantic Web, the four-step inference process overcomes existing inference tools, in terms of Precision and Recall. We obtained up to 97.10% of F-score value. We suggest the use of the four steps in a particular order starting from a general solution (Step 1), that can be applied to all datatypes, until a specific one for particular cases (Step 4). Following this steps-order, we obtained the best results during experimentation.

Since
Step 2 showed a high accuracy and performance during experimentation, we worked on the lexical spaces in order to improve the results. The non-ambiguous lexicalspace-matching process is behaving better than the fourstep inference process in accuracy (99.30% of F-score) and performance (11.955 s), but it demands the modification of engines that manage RDF data as triples, as well as the modification of the RDF data itself to support the new lexical space representations. Table 13 summarizes the results obtained by the accuracy and performance evaluations. Note that our second inference process overcomes the one proposed in the work [10] for both evaluations.
In Sect. 3, we have identified a set of criteria of comparison to evaluate the existing works according to the boundaries of this study. Table 14 shows the criteria satisfied by our inference processes. External information is not needed for our non-ambiguous lexical-space-matching   process, since the value representation is only used (local data). This process can be applied to any context, such as XML/XSD and databases, where datatypes are considered. However, the modification of the data and its respective parser is needed. We modified Jena, one of the most used processing RDFs. This modification allows compatibility among lexical spaces, by accepting literal values that follow the new lexical space representations. For example, the literal values 3.1416 and f3.1416 are equivalent since both formats match with the lexical space representations of float and the values themselves are equal. With this process, we demonstrate the feasibility of an appropriate approach, when having non-ambiguous lexical spaces. We proposed simple lexical space modifications, but more sophisticated proposals need to be devised.

Conclusions
In this paper, we investigated the issue of datatype inference for RDF documents matching/integration. We proposed a RDF Datatype inFerring Framework based on two independent processes: 1) four-step inference, consisting of (i) the analysis of the predicate information associated with the object value, (ii) analysis of the lexical space of the value itself, (iii) semantic analysis of the predicate name, (iv) and generalization of datatypes; and 2) non-ambiguous lexical space matching, where literal values can be used to infer datatypes through a matching process, following a new lexical space representations. We evaluated the accuracy and performance of our inference process with DBpedia datasets (DBpedia person data). Results show that the inference approach increases the F-score up to 97.10% by our four-step process, where no modification of the RDF data is required, and the F-score up to 99.30% for our nonambiguous lexical space process for data aligned to new lexical space representations. We modified the Jena engine to support the new lexical spaces and to provide compatibility among existing tools.
We are currently working on extending this work to include complex datatypes. We also plan to evaluate our approach with other databases of Semantic Web initiatives.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creative commons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.