1 Introduction

Recently, databases become very popular and usually used in almost all of information systems. As databases are independently developed according to the purpose and place of use, it causes data heterogeneity. It is often required to integrate multiple databases by such as business integration. It is, however, difficult to integrate databases because of the data heterogeneity. Identifying corresponding attributes in heterogeneous databases is important in integrating and using them. The problem of this identification is widely known as a schema matching problem [1].

As this problem is one of the important problems in database integration, many studies have been reported [2,3,4,5]. Schema information, such as attribute names, data types, and schema structure, is generally used to identify correspondences between two attributes. However, schemas are often not defined in a unified format, and it is often difficult to apply methods using schema information. In such a case, the schema information cannot directly be used. It is difficult to identify the correspondence by the method using schema information.

When schema information is not worth to use, a method using instances is an alternative approach to identify corresponding attributes [3,4,5,6]. As instances contain the actual features of the attributes, the corresponding attributes can be identified using instances.

We have proposed a schema matching method for numerical attributes [8]. The method proposed uses data distributions and correlations. Although the preliminary experiments showed that this method was effective in identifying the same attributes, only a kind of tables was used in the experiments [8]. It is not clear whether this method can identify the similar attributes of different kinds of tables.

This paper evaluates this method using different kinds of tables to show the effectivity of the method. The experimental results show that this method correctly identifies the attributes of different types of tables.

The remainder of the paper is organized follows. Section 2 describes related works. Section 3 explains the instance-based schema matching method applicable to the numerical attributes and shows the preliminary experimental results. This method is evaluated in Sect. 4. Section 5 gives some considerations. Finally, Sect. 6 concludes the paper.

2 Related Works

Partyka et al. proposed a schema matching algorithm called TSim, which uses Normalized Google Distance (NGD) [4]. This method analyzes instances and calculates the entropy-based distribution (EBD). As TSim performs semantic comparison, it is possible to eliminate syntactic dependencies of instances and it is possible to identify attributes correctly. The evaluation result indicates that TSim is effective against N-gram, which syntactically analyzes instances.

Mehdi et al. proposed an instance-based schema matching using Google Similarity and regular expression [5]. For the numeric type and the mixed type attribute, regular expression is used to analyze appearance patterns of characters and identify correspondences between attributes. For the alphabetic type attribute, Google Similarity is used to calculate the semantic similarity between attributes. Their proposed method could identify 1–1 matches with high accuracy despite using not entire instances but sampling.

Nozaki et al. have proposed a semantic schema matching for string attributes considering word vectors [7]. They used Word2Vec [9]. Word2Vec can vectorize words in hundreds dimensions and calculate similarities between words considering their semantics. Attribute vectors are calculated from appearance ratio of their instances and vectors of words contained in their instances and are used for calculating similarity score. The experimental results show their method could identify the corresponding attribute regardless of the attribute name. As this method can be applied to only string attribute, it needs to consider numerical attributes.

3 Schema Matching Method Using Data Distribution and Correlation

This section explains the method for finding the numerical attribute and briefly shows preliminary experimental results [8].

3.1 Schema Matching Method Based on Data Distribution and Correlation

The schema matching method [8] compares the data distribution and the correlation of two attributes to identify whether they match or not. This method is composed of four phases: identification using a probability–probability plot (P–P) plot, identification using a quantile–quantile plot (Q–Q) plot, comparison of two identification results of previous two phases, and verification of the comparison result. Here, the P–P plot and the Q–Q plot are probability plots. The P–P plot is the graph plotting cumulative distribution function of two data against each other. It is used for evaluating graphically how closely two data are. The Q–Q plot is the graph plotting quantiles of two data each other. It can compares two probability distributions. Missing values of each attribute are firstly deleted. Then, each phase is conducted according to the flow shown in Fig. 1.

Fig. 1
figure 1

Flow of the instance-based schema matching method

3.1.1 Identification Using a P–P Plot

This section describes the calculation of the distance between attributes using a P–P plot. The empirical distribution function of an attribute s of a table S is defined as \(F_{n}(x)\), and the empirical distribution function of an attribute t of a table T is defined as \(G_{n}(x)\). The points plotted by a P–P plot are represented by Eq. (1).

$$\begin{aligned} (F_{n}(x), G_{n}(x)), x \in s \end{aligned}$$
(1)

If \(F_{n}(x)\) and \(G_{n}(x)\) are the same distribution, these points are plotted on the line. The distances between \(y=x\) and points obtained from s and t are calculated, and the mean value is calculated to use as the measure of the correspondence of the attributes. The equation calculating the mean value of the distances is shown in Eq. (2).

$$\begin{aligned} d_{p \text {-} p} = \frac{1}{n}\sum ^{n}_{i = 1}\frac{|F_{n}(X_{i}) - G_{n}(X_{i})|}{\sqrt{2}} \end{aligned}$$
(2)

Here, n indicates the number of data of attribute s, and \(X_{i}\) indicates the data included in attribute s. By calculating \(d_{p\text {-}p}\) for all combinations, all distances between the attributes from tables S and T can be obtained. Using the distances calculated between the attributes, the order of the attributes of the table T is determined in ascending order of the distance for each attribute of the table S. Here, the combination of attributes whose distance is 0.3 or more are regarded as different combination and deleted from the ranking.

Next, the correspondence of attributes is uniquely determined according to the following steps.

  1. 1.

    The attribute of the table T with the 1st rank is corresponded to each attribute of the table S.

  2. 2.

    If there is no duplicate correspondence, the combination is adopted.

  3. 3.

    If there are duplicate correspondences, the distances are compared between the combinations of attributes whose correspondences are duplicated.

  4. 4.

    The combination with the smallest distance is adopted, and the other combinations correspond to the next ranked attributes. If an attribute in S has no next ranked attribute, it is corresponded to no attribute.

  5. 5.

    If there are no more duplicate correspondence, the combinations are adopted. If there are duplicate correspondences, return to step 3.

As a result, the combinations of attributes between S and T without duplicate correspondences are obtained.

After identifying correspondences between attributes of S and T, this process is executed again with replacing the source table and the target table. As a result, the combinations of attributes having the same correspondence in both identification results are obtained as the same attribute.

3.1.2 Identification Using a Q–Q Plot

This section describes the calculation of the distance between attributes using a Q–Q plot. The number of attribute s is obtained as \(n_s\), and the point normalized to [0, 1] for each axis is defined as Eq. (3).

$$\begin{aligned} (F_{n}^{\prime -1}(p_{i}), G_{n}^{\prime -1}(p_{i})), p_{i} = \frac{1}{n_{s}}, \frac{2}{n_{s}}, \ldots , \frac{n_{s}}{n_{s}} \end{aligned}$$
(3)

If \(F_{n}^{\prime -1}(p_{i})\) and \(G_{n}^{\prime -1}(p_{i})\) are the same distribution, these points are plotted on the line \(y=ax+b\). The distances between the regression line \(y=ax+b\) and points obtained from s and t are calculated, and the mean value is calculated to use as the measure of the correspondence of the attributes. The equation calculating the mean value of the distances is shown in Eq. (4).

$$\begin{aligned} d_{q\text {-}q} = \frac{1}{n_{s}}\sum ^{n_{s}}_{i = 1}\frac{|a \times F_{n}^{\prime -1}(p_{i}) - G_{n}^{\prime -1}(p_{i}) + b|}{\sqrt{a^{2} + 1}} \end{aligned}$$
(4)

By calculating \(d_{q\text {-}q}\) for all combinations, all distances between the attributes from tables S and T can be obtained. Using the distances calculated between the attributes, the order of the attributes of the table T is determined in ascending order of the distance for each attribute of the table S.

Next, the combinations of attributes are uniquely determined according to the same comparison method described in a P–P plot. After identifying correspondences between attributes of S and T, this process is executed again with replacing the source table and the target table. As a result, the combination of attributes having the same correspondence in both identifications is obtained as the same attribute.

3.1.3 Comparison of Two Identification Results

The comparison result is determined using the identification result of the P–P plot and the Q–Q plot. The identification result obtained from the P–P plot is defined as \(Result_{P}\) \(= \{(s_{1}, p_{1}),\) \((s_{2}, p_{2}),\) \(\ldots ,\)  \((s_{n}, p_{n})\}\), and the identification result obtained from the Q–Q plot is defined as \(Result_{Q}\) \(= \{(s_{1}, q_{1}),\) \((s_{2}, q_{2}),\) \(\ldots ,\)  \((s_{n}, q_{n})\}\). Here, \(s_{i}\) indicates an attribute included in the table S, and \(p_{i}\) and \(q_{i}\) indicate two of the attributes included in the table T. The comparison result is determined as follows.

  • Same identification results: If the comparison results are the same (\(p_{i} = q_{i}\)), the corresponding attribute to \(s_{i}\) is determined to \(p_{i} ( =q_{i})\).

  • One identification result: If only one identification result is obtained, \(p_{i}\) or \(q_{i}\) is determined as the corresponding attribute to \(s_{i}\).

  • No identification result: If no identification result is obtained, the corresponding attribute to \(s_{i}\) is no attribute.

  • Different comparison results: If the identification results obtained from the P–P plot and the Q–Q plot are different (\(p_{i} \ne q_{i}\)), the corresponding attribute is determined by comparing the correlation coefficient with other attributes in their tables. The comparison method is as follows.

    1. 1.

      The correlation coefficient array \(R_{si}\) \(= \{r_{si1},\) \(r_{si2},\) \(\ldots ,\)  \(r_{sim}\}\) of attribute \(s_{i}\) is calculated to the attributes included in S to which corresponding attributes are already determined. Here, m is the number of attributes included in S whose correspondences are already determined.

    2. 2.

      The correlation coefficient array \(R_{pi}\) \(= \{r_{pi1},\) \(r_{pi2},\) \(\ldots ,\)  \(r_{pim}\}\) of attribute \(p_{i}\) is calculated to the m attributes included in T to which corresponding attributes are already determined.

    3. 3.

      The distance \(dis_{pi}\) between \(s_{i}\) and \(p_{i}\) is calculated as the sum of the differences between the correlation coefficients of attributes, as described in Eq. (5).

      $$\begin{aligned} dis_{pi} = \sum ^{m}_{j=1}|r_{sij}-r_{pij}| \end{aligned}$$
      (5)
    4. 4.

      The distance \(dis_{qi}\) is calculated by conducting step 2 and step 3 using identification result obtained from a Q–Q plot.

    5. 5.

      In case of \(dis_{pi} \le dis_{qi}\), \(p_{i}\) is determined to the correspond to \(s_{i}\), and in case of \(dis_{pi} > dis_{qi}\), \(q_{i}\) is determined to correspond to \(s_{i}\).

From the steps described above, the appropriate corresponding attribute is selected using the identification results obtained from the P–P plot and the Q–Q plot.

3.1.4 Verification of Comparison Result

Finally, the method verifies whether the comparison result between the table S and the table Q is appropriate. The attributes corresponding to each other between the tables are considered to have similar correlations with other attributes in their tables. Therefore, when the attribute s in the table S and the attribute t in the table T correspond, the verification is conducted by using the correlation as follows:

  1. 1.

    The correlation array between an attribute \(s_{i}\) and other attributes in the table S is defined as \(R_{si}\), and the correlation array between the attribute \(t_{i}\) and other attributes in the table T is defined as \(R_{ti}\). At this time, these correlation arrays are transformed to \(R_{si}^{\prime }\) and \(R_{ti}^{\prime }\) depending on their correlation values r as described in Eq. (6).

    $$\begin{aligned} {\left\{ \begin{array}{ll} positive\,strong\,correlation = 2 &{} 0.7 \le r \\ positive\,weak\,correlation = 1 &{} 0.4 \le r< 0.7 \\ no\,correlation = 0 &{} -0.4< r< 0.4 \\ negative\,weak\,correlation = -1 &{} -0.7 < r \le -0.4 \\ negative\,strong\,correlation = -2 &{} r \le -0.7 \\ \end{array}\right. } \end{aligned}$$
    (6)
  2. 2.

    Using the transformed correlation arrays \(R_{si}^{\prime }\) and \(R_{ti}^{\prime }\), the distance between the attribute s and t are calculated as described in Eq. (7).

    $$\begin{aligned} dis_{i} = \sum ^{}_{}|r_{si}-r_{ti}| \end{aligned}$$
    (7)

    Here, \(r_{si} \in R_{si}^{\prime }\), and \(r_{ti} \in R_{t_i}^{\prime }\).

  3. 3.

    If the obtained \(dis_{i}\) is less than a threshold, it is determined that \(s_{i}\) corresponds to \(t_{i}\). If it exceeds the threshold, they do not correspond each other, and the attribute \(s_{i}\) is determined to have no corresponding attribute. The threshold is set to (the number of attribute combinations − 1).

With the above procedure, it is verified whether the comparison result is correct. The verification result becomes the identification result of numerical attributes of the tables S and T.

3.2 Experimental Results for a Kind of Tables

The previous work evaluated the proposed method using a kind of tables [8]. These tables are included in the dataset named “World Happiness,” which is described in Sect. 4.1.

The proposed method could identify the attributes indicating the same concept, but differing in the distributions [8]. The proposed method determined no corresponding attributes if two tables do not have the same attributes [8]. Thus, it is useful for reducing the human burden in the schema matching.

4 Experiments

4.1 Dataset

Datasets used in the experiments are explained.

  • Employee Compensation Plan [10]: This is the dataset of employee compensation plans in San Jose city. The 2016, 2017, and 2018 datasets are used. These contain thirteen attributes, and 7,897, 8,200, and 8,249 records, respectively. The numerical attributes are shown in Table 1. Similar attributes and IDs are similarly shown.

  • World Happiness [10]: This dataset is a landmark survey of the state of global happiness. The 2015, 2016, and 2017 surveys are used. These contain 12, 13, and 12 attributes, and 158, 157, and 155 records, respectively. The numerical attributes are shown in Table 2, and similar attributes are shown on the same row. IDs are assigned to the attributes. IDs are used in showing identification results.

  • Adult [11]: This is a census survey dataset containing 32,561 records and fourteen attributes. Eight attributes contain string data, while six attributes contain numerical data.

Table 1 Numerical attributes in employee compensation plan
Table 2 Numerical attributes in world happiness

4.2 Identification of the Same Datasets

Here, we try to identify the attributes of the same datasets, attributes having different numbers of digits, and different attributes.

4.2.1 Identification of Attributes of Similar Tables

Identifications are conducted between Employee Compensation Plan

2016 and 2017, 2017 and 2018, and 2018 and 2016. The results are shown in Tables 3, 4, and 5, respectively. In each table, “Phase 1” (“Phase 2,” respectively) shows the result of identification using a P–P (Q–Q) plot, and “Phase 3” (“Phase 4,” respectively) shows the result of comparison of identification results (verification of comparison result). The alphabets in Tables 3, 4, 5 correspond to the IDs in Table 1. The left column is the attribute of the source table, and the others are the corresponding attributes of the target table in the result of each phase. The result of Phase 4, the verification result phase, is the result of the proposed method.

The results of Employee Compensation Plan were correctly identified.

Table 3 Result of identification between 2016 and 2017 in employee compensation plan
Table 4 Result of identification between 2017 and 2018 in employee compensation plan
Table 5 Result of identification between 2018 and 2016 in employee compensation plan

4.2.2 Identification of Attributes with Different Numbers of Digits

The attributes of A to E in Employee Compensation Plan 2018, which are half of included attributes, were also multiplied by 100, and this table is compared with Employee Compensation Plan 2017. The results are shown in Table 6. If the number of digits is different, it could not be identified in the P–P plot phase, but could be identified in the Q–Q plot phase. The reason for this result is that the Q–Q plot compares distributions without differences of parameters, mean values and variances. Furthermore, the normalization of data in the Q–Q plot phase was also effective.

Table 6 Result of identification between 2017 and modified 2018 in employee compensation plan

4.2.3 Identification of Different Attributes

Table 7 shows a comparison result between the 2016 dataset of World Happiness and the 2016 dataset of Employee Compensation Plan. This result indicates that it is possible to show there is no corresponding attribute if two tables do not have the same attributes.

Table 7 Result of identification between 2016 of world happiness and 2018 of employee compensation plan

4.3 Accuracy Evaluation

Two sub-tables are created by randomly selecting eleven attributes from the Adult dataset. The attribute is not be selected twice or more in a sub-table. Some attributes may not occur in both sub-tables. These sub-tables were populated with instances randomly taken from the selected attribute group.

The evaluation metrics are precision, recall and F-measure shown in Eq. (8), Eq. (9) and Eq. (10), respectively, where True Positive (TP) (False Positive (FP), respectively) is the set of the matching (unmatched) attributes correctly (incorrectly) determined to match and True Negative (TN) (False Negative (FN), respectively) is the set of the unmatched (matching) attributes correctly (incorrectly) determined not to match.

$$\begin{aligned} Precision= & {} \frac{|TP|}{|TP| + |FP|} \end{aligned}$$
(8)
$$\begin{aligned} Recall= & {} \frac{|TP|}{|TP| + |FN|} \end{aligned}$$
(9)
$$\begin{aligned} F\text {-}measure= & {} \frac{2 * Precision * Recall}{Precision + Recall} \end{aligned}$$
(10)

As the result of experiment, precision, recall, and F-measure obtained are 92%, 100%, and 95%, respectively. The obtained values are over 90%. Recall attained 100%.

5 Consideration

In Table 3, Table 4, and Table 5, all results are correctly identified. These results obtained in Sect. 4.2.1 showed that each phase properly worked.

From Sect. 4.2.2, the proposed method could identify the attributes that indicate the same concept, but that differ in the distributions due to the different units. This is due to the normalization of the data and the characteristics of the Q–Q plot as described in Sect. 3.2.

The results of Sect. 4.2.3 indicated that the proposed method determined no corresponding attributes if two tables do not have the same attributes. Thus, it is useful for reducing the human burden in the schema matching.

The results of Sect. 4.3 indicated that the proposed method attains very high precision, recall, and F-measure. As for recall, 100% was attained. The proposed method may reduce the schema matching cost.

6 Conclusion

This paper evaluated a schema matching method for identifying numerical attributes based on data distribution and correlation. The result of experiments indicated that the attributes expressing the same concept can correctly be identified even if the number of digits were different. It was also shown that recall of the proposed method achieved 100%.

Some results could not identify attributes correctly. The future work contains the improvement of accuracy. It is necessary to consider the method of combining the P–P plot and the Q–Q plot and modify the threshold value for verification. Treating numerical attributes without correlation is also included in future work.