A Novel Clustering-Based Approach to Schema Matching

  • Jin Pei
  • Jun Hong
  • David Bell
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4243)


Schema matching is a critical step in data integration from multiple heterogeneous data sources. This paper presents a new approach to schema matching, based on two observations. First, it is easier to find attribute correspondences between those schemas that are contextually similar. Second, the attribute correspondences found between these schemas can be used to help find new attribute correspondences between other schemas. Motivated by these observations, we propose a novel clustering-based approach to schema matching. First, we cluster schemas on the basis of their contextual similarity. Second, we cluster attributes of the schemas that are in the same schema cluster to find attribute correspondences between these schemas. Third, we cluster attributes across different schema clusters using statistical information gleaned from the existing attribute clusters to find attribute correspondences between more schemas. We leverage a fast clustering algorithm, the K-Means algorithm, to the above three clustering tasks. We have evaluated our approach in the context of integrating information from multiple web interfaces and the results show the effectiveness of our approach.


Schema Match Schema Cluster Attribute Cluster Document Cluster Query Interface 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE 2002, Washington, DC, USA, pp. 117–128. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  2. 2.
    He, B., Chang, K., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: KDD 2004, pp. 148–157. ACM Press, New York (2004)CrossRefGoogle Scholar
  3. 3.
    Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD 2004, pp. 95–106. ACM Press, New York (2004)CrossRefGoogle Scholar
  4. 4.
    He, B., Chang, K.: Statistical schema matching across web query interfaces. In: SIGMOD 2003, pp. 217–228. ACM Press, New York (2003)CrossRefGoogle Scholar
  5. 5.
    Do, H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: VLDB 2002, HongKong (2002)Google Scholar
  6. 6.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB jounal 10, 334–350 (2001)MATHCrossRefGoogle Scholar
  7. 7.
    He, H., Meng, W., Yu, C.T., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for ecommerce. In: VLDB 2003, pp. 357–268 (2003)Google Scholar
  8. 8.
    Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE 2005 (2005)Google Scholar
  9. 9.
    Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)Google Scholar
  10. 10.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999, pp. 16–22. ACM Press, New York (1999)CrossRefGoogle Scholar
  11. 11.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McCraw-Hill, New York (1983)MATHGoogle Scholar
  12. 12.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  13. 13.
    Lange, T., Roth, V., Braun, M.L., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16, 1299–1323 (2004)MATHCrossRefGoogle Scholar
  14. 14.
    Levine, E.E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)MATHCrossRefGoogle Scholar
  15. 15.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jin Pei
    • 1
  • Jun Hong
    • 1
  • David Bell
    • 1
  1. 1.School of Electronics, Electrical Engineering and Computer ScienceQueen’s University BelfastBelfastUK

Personalised recommendations