A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method


Data nowadays are an extremely valuable resource. Data can come from different sources, and it can originate from the government of a country, an organization, a company, or just a normal person. Furthermore, the content of data is varied: the data could be about primary education in the U.K, it could be about medical care in the U.S., or it could be about agriculture in Vietnam, etc. It is reasonable to assume that among those datasets, some datasets would be about the same topic. Moreover, those datasets could have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into more meaningful datasets: The unionized datasets would contain the collective information of the datasets, and the users and scientists do not have to spend a lot of time searching and combining the datasets themselves, etc. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability for JSON-format data. Besides, we also performed some experiments to evaluate our method and prove its feasibility.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11




