Evaluation of Identification Method of Corresponding Numerical Attributes in Heterogeneous Databases Based on Instances

This paper experimentally evaluates the instance-based schema matching method for the attributes storing numerical data. This method uses data distributions and correlations between two attributes. Experimental results show that this method could find appropriate attribute correspondences even for the attributes of different kinds of tables in databases.


Introduction
Recently, databases become very popular and usually used in almost all of information systems. As databases are independently developed according to the purpose and place of use, it causes data heterogeneity. It is often required to integrate multiple databases by such as business integration. It is, however, difficult to integrate databases because of the data heterogeneity. Identifying corresponding attributes in heterogeneous databases is important in integrating and using them. The problem of this identification is widely known as a schema matching problem [1].
As this problem is one of the important problems in database integration, many studies have been reported [2][3][4][5]. Schema information, such as attribute names, data types, and schema structure, is generally used to identify correspondences between two attributes. However, schemas are often not defined in a unified format, and it is often difficult to apply methods using schema information. In such a case, the schema information cannot directly be used. It is difficult to identify the correspondence by the method using schema information.
When schema information is not worth to use, a method using instances is an alternative approach to identify corresponding attributes [3][4][5][6]. As instances contain the actual features of the attributes, the corresponding attributes can be identified using instances.
We have proposed a schema matching method for numerical attributes [8]. The method proposed uses data distributions and correlations. Although the preliminary experiments showed that this method was effective in identifying the same attributes, only a kind of tables was used in the experiments [8]. It is not clear whether this method can identify the similar attributes of different kinds of tables. This paper evaluates this method using different kinds of tables to show the effectivity of the method. The experimental results show that this method correctly identifies the attributes of different types of tables.
The remainder of the paper is organized follows. Section 2 describes related works. Section 3 explains the instance-based schema matching method applicable to the numerical attributes and shows the preliminary experimental results. This method is evaluated in Sect. 4. Section 5 gives some considerations. Finally, Sect. 6 concludes the paper.

Related Works
Partyka et al. proposed a schema matching algorithm called TSim, which uses Normalized Google Distance (NGD) [4]. This method analyzes instances and calculates the entropy-based distribution (EBD). As TSim performs semantic comparison, it is possible to eliminate syntactic  [5]. For the numeric type and the mixed type attribute, regular expression is used to analyze appearance patterns of characters and identify correspondences between attributes. For the alphabetic type attribute, Google Similarity is used to calculate the semantic similarity between attributes. Their proposed method could identify 1-1 matches with high accuracy despite using not entire instances but sampling.
Nozaki et al. have proposed a semantic schema matching for string attributes considering word vectors [7]. They used Word2Vec [9]. Word2Vec can vectorize words in hundreds dimensions and calculate similarities between words considering their semantics. Attribute vectors are calculated from appearance ratio of their instances and vectors of words contained in their instances and are used for calculating similarity score. The experimental results show their method could identify the corresponding attribute regardless of the attribute name. As this method can be applied to only string attribute, it needs to consider numerical attributes.

Schema Matching Method Using Data Distribution and Correlation
This section explains the method for finding the numerical attribute and briefly shows preliminary experimental results [8].

Schema Matching Method Based on Data Distribution and Correlation
The schema matching method [8] compares the data distribution and the correlation of two attributes to identify whether they match or not. This method is composed of four phases: identification using a probability-probability plot (P-P) plot, identification using a quantile-quantile plot (Q-Q) plot, comparison of two identification results of previous two phases, and verification of the comparison result.
Here, the P-P plot and the Q-Q plot are probability plots. The P-P plot is the graph plotting cumulative distribution function of two data against each other. It is used for evaluating graphically how closely two data are. The Q-Q plot is the graph plotting quantiles of two data each other. It can compares two probability distributions. Missing values of each attribute are firstly deleted. Then, each phase is conducted according to the flow shown in Fig. 1.

Identification Using a P-P Plot
This section describes the calculation of the distance between attributes using a P-P plot. The empirical distribution function of an attribute s of a table S is defined as F n (x) , and the empirical distribution function of an attribute t of a table T is defined as G n (x) . The points plotted by a P-P plot are represented by Eq. (1).
If F n (x) and G n (x) are the same distribution, these points are plotted on the line. The distances between y = x and points obtained from s and t are calculated, and the mean value is calculated to use as the measure of the correspondence of the attributes. The equation calculating the mean value of the distances is shown in Eq. (2).
Here, n indicates the number of data of attribute s, and X i indicates the data included in attribute s. By calculating d p-p for all combinations, all distances between the attributes from tables S and T can be obtained. Using the distances Flow of the instance-based schema matching method calculated between the attributes, the order of the attributes of the table T is determined in ascending order of the distance for each attribute of the table S. Here, the combination of attributes whose distance is 0.3 or more are regarded as different combination and deleted from the ranking. Next, the correspondence of attributes is uniquely determined according to the following steps.
1. The attribute of the table T with the 1st rank is corresponded to each attribute of the table S. 2. If there is no duplicate correspondence, the combination is adopted. 3. If there are duplicate correspondences, the distances are compared between the combinations of attributes whose correspondences are duplicated. 4. The combination with the smallest distance is adopted, and the other combinations correspond to the next ranked attributes. If an attribute in S has no next ranked attribute, it is corresponded to no attribute. 5. If there are no more duplicate correspondence, the combinations are adopted. If there are duplicate correspondences, return to step 3.
As a result, the combinations of attributes between S and T without duplicate correspondences are obtained. After identifying correspondences between attributes of S and T, this process is executed again with replacing the source table and the target table. As a result, the combinations of attributes having the same correspondence in both identification results are obtained as the same attribute.

Identification Using a Q-Q Plot
This section describes the calculation of the distance between attributes using a Q-Q plot. The number of attribute s is obtained as n s , and the point normalized to [0, 1] for each axis is defined as Eq. (3).
are the same distribution, these points are plotted on the line y = ax + b . The distances between the regression line y = ax + b and points obtained from s and t are calculated, and the mean value is calculated to use as the measure of the correspondence of the attributes. The equation calculating the mean value of the distances is shown in Eq. (4).
By calculating d q-q for all combinations, all distances between the attributes from tables S and T can be obtained. Using the distances calculated between the attributes, the order of the attributes of the table T is determined in ascending order of the distance for each attribute of the table S. Next, the combinations of attributes are uniquely determined according to the same comparison method described in a P-P plot. After identifying correspondences between attributes of S and T, this process is executed again with replacing the source table and the target table. As a result, the combination of attributes having the same correspondence in both identifications is obtained as the same attribute.

Comparison of Two Identification Results
The comparison result is determined using the identification result of the P-P plot and the Q-Q plot. The identification result obtained from the P-P plot is defined as Result P = {(s 1 , p 1 ), (s 2 , p 2 ), … , (s n , p n )} , and the identification result obtained from the Q-Q plot is defined as 1. The correlation coefficient array R si = {r si1 , r si2 , … , r sim } of attribute s i is calculated to the attributes included in S to which corresponding attributes are already determined. Here, m is the number of attributes included in S whose correspondences are already determined. 2. The correlation coefficient array R pi = {r pi1 , r pi2 , … , r pim } of attribute p i is calculated to the m attributes included in T to which corresponding attributes are already determined. 3. The distance dis pi between s i and p i is calculated as the sum of the differences between the correlation coefficients of attributes, as described in Eq. (5).

3
4. The distance dis qi is calculated by conducting step 2 and step 3 using identification result obtained from a Q-Q plot. 5. In case of dis pi ≤ dis qi , p i is determined to the correspond to s i , and in case of dis pi > dis qi , q i is determined to correspond to s i .
From the steps described above, the appropriate corresponding attribute is selected using the identification results obtained from the P-P plot and the Q-Q plot.

Verification of Comparison Result
Finally, the method verifies whether the comparison result between the table S and the table Q is appropriate. The attributes corresponding to each other between the tables are considered to have similar correlations with other attributes in their tables. Therefore, when the attribute s in the table S and the attribute t in the table T correspond, the verification is conducted by using the correlation as follows: 1. The correlation array between an attribute s i and other attributes in the table S is defined as R si , and the correlation array between the attribute t i and other attributes in the table T is defined as R ti . At this time, these correlation arrays are transformed to R ′ si and R ′ ti depending on their correlation values r as described in Eq. (6).

Using the transformed correlation arrays R ′
si and R ′ ti , the distance between the attribute s and t are calculated as described in Eq. (7).
Here, r si ∈ R � si , and r ti ∈ R � t i . 3. If the obtained dis i is less than a threshold, it is determined that s i corresponds to t i . If it exceeds the threshold, they do not correspond each other, and the attribute s i is determined to have no corresponding attribute. The threshold is set to (the number of attribute combinations − 1).

Experimental Results for a Kind of Tables
The previous work evaluated the proposed method using a kind of tables [8]. These tables are included in the dataset named "World Happiness," which is described in Sect. 4.1.
The proposed method could identify the attributes indicating the same concept, but differing in the distributions [8]. The proposed method determined no corresponding attributes if two tables do not have the same attributes [8]. Thus, it is useful for reducing the human burden in the schema matching.

Dataset
Datasets used in the experiments are explained.

Identification of the Same Datasets
Here, we try to identify the attributes of the same datasets, attributes having different numbers of digits, and different attributes. Tables   Identifications are conducted Tables 3, 4, and 5, respectively. In each table, "Phase 1" ("Phase 2," respectively) shows the result of identification using a P-P (Q-Q) plot, and "Phase 3" ("Phase 4," respectively) shows the result of comparison of identification results (verification of comparison result). The alphabets in Tables 3, 4, 5 correspond to the IDs in Table 1. The left column is the attribute of the source table, and the others are the corresponding attributes of the target table in the result of each phase. The    H  H  H  H  H  I  I  A result of Phase 4, the verification result phase, is the result of the proposed method. The results of Employee Compensation Plan were correctly identified.

Identification of Attributes with Different Numbers of Digits
The attributes of A to E in Employee Compensation Plan 2018, which are half of included attributes, were also multiplied by 100, and this table is compared with Employee Compensation Plan 2017. The results are shown in Table 6. If the number of digits is different, it could not be identified in the P-P plot phase, but could be identified in the Q-Q plot phase. The reason for this result is that the Q-Q plot compares distributions without differences of parameters, mean values and variances. Furthermore, the normalization of data in the Q-Q plot phase was also effective.

Accuracy Evaluation
Two sub-tables are created by randomly selecting eleven attributes from the Adult dataset. The attribute is not be selected twice or more in a sub-table. Some attributes may not occur in both sub-tables. These sub-tables were populated with instances randomly taken from the selected attribute group. The evaluation metrics are precision, recall and F-measure shown in Eq.

Consideration
In Table 3, Table 4, and

Conclusion
This paper evaluated a schema matching method for identifying numerical attributes based on data distribution and correlation. The result of experiments indicated that the attributes expressing the same concept can correctly be identified even if the number of digits were different. It was also shown that recall of the proposed method achieved 100%. Some results could not identify attributes correctly. The future work contains the improvement of accuracy. It is necessary to consider the method of combining the P-P plot and the Q-Q plot and modify the threshold value for verification. Treating numerical attributes without correlation is also included in future work.

Data availability statement
The data used in the experiments are openly available in the Web site of Kaggle [10] at https:// www. kaggle. com/ and the UCI Machine Learning Repository [11] at http:// archi ve. ics. uci. edu/ ml/ index. php.

Conflict of interest
The authors declare they have no conflicts of interest. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.