Skip to main content
Log in

A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In this paper we propose a semi-automatic technique for deriving the similarity degree between two portions of heterogeneous information sources (hereafter, sub-sources). The proposed technique consists in two phases: the first one selects the most promising pairs of sub-sources, whereas the second one computes the similarity degree relative to each promising pair. We show that the detection of sub-source similarities is a special case (and a very interesting one, for semi-structured information sources) of the more general problem of Scheme Match. In addition, we present a real example case to clarify the proposed technique, a set of experiments we have conducted to verify the quality of its results, a discussion about its computational complexity and its classification in the context of related literature. Finally, we discuss some possible applications which can benefit by derived similarities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. C. Batini and M. Lenzerini, “A methodology for data schema integration in the entity relationship model,” IEEE Transactions on Software Engineering 10(6), 1984, 650-664.

    Google Scholar 

  2. S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano, “Semantic integration and query of heterogeneous information sources,” Data & Knowledge Engineering 36(3), 2001, 215-249.

    Google Scholar 

  3. P. Bernstein and E. Rahm, “Data warehouse scenarios for model management,” in Proc. of International Conference on Conceptual Modeling (ER'00), Salt Lake City, UT, Lecture Notes in Computer Science, Vol. 1920, Springer, 2000, pp. 1-15.

    Google Scholar 

  4. S. Castano, V. D. Antonellis, and S. D. C. di Vimercati, “Global viewing of heterogeneous data sources,” Transactions on Data and Knowledge Engineering 13(2), 2001, 277-297.

    Google Scholar 

  5. S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” ACM SIGMOD RECORD 26(1), 1997, 65-74.

    Google Scholar 

  6. A. Doan, P. D omingos, and A. Halevy, “Reconciling schemas of disparate data sources, a machine-learning approach,” in Proc. of International Conference on Management of Data (SIGMOD 2001), Santa Barbara, CA, ACM Press, 2001, pp. 509-520.

    Google Scholar 

  7. P. Fankhauser, M. Kracker, and E. Neuhold, “Semantic vs. structural resemblance of classes,” ACM SIGMOD RECORD 20(4), 1991, 59-63.

    Google Scholar 

  8. Z. Galil, “Efficient algorithms for finding maximum matching in graphs,” ACM Computing Surveys 18, 1986, 23-38.

    Google Scholar 

  9. W. Gotthard, P. Lockemann, and A. Neufeld, “System-guided view integration for object-oriented databases,” IEEE Transactions on Knowledge and Data Engineering 4(1), 1992 1-22.

    Google Scholar 

  10. A. Jain and R. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988.

    Google Scholar 

  11. L. Kaufman and P. Rousseeuw, Findings Groups in Data: an Introduction to Cluster Analysis, Wiley, New York, 1990.

    Google Scholar 

  12. J. Larson, S. Navathe, and R. Elmastri, “A theory of attribute equivalence in databases with application to schema integration,” IEEE Transactions on Softwware Engineering 15(4), 1989, 449-463.

    Google Scholar 

  13. J. Madhavan, P. Bernstein, and E. Rahm, “Generic schema matching with cupid,” in Proc. of International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, Morgan Kaufmann, 2001, pp. 49-58.

    Google Scholar 

  14. A. Miller, “WordNet: a lexical database for english,” Communications of the ACM 38(11), 1995, 39-41.

    Google Scholar 

  15. T. Milo and S. Zohar, “Using schema matching to simplify heterogenous data translations,” in Proc. of International Conference on Very Large Data Bases (VLDB'98), New York, Morgan Kaufmann, 1998, pp. 122-133.

    Google Scholar 

  16. P. Mitra, G. Wiederhold, and J. Jannink, “Semi-automatic integration of knowledge sources,” in Proc. of Fusion'99, Sunnyvale, CA, 1999.

    Google Scholar 

  17. L. Palopoli, D. Rosaci, G. Terracina, and D. Ursino, “Un modello concettuale per rappresentare e derivare la semantica associata a sorgenti informative strutturate e semi-strutturate,” In Atti del Congresso sui Sistemi Evoluti per Basi di Dati (SEBD 2001), Venezia, Italy, 2001, pp. 131-145 (in Italian).

  18. L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities,” in Proc. of Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, UK, IEEE Computer Society, 1999, pp. 34-45.

    Google Scholar 

  19. L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases,” IEEE Transactions on Knowledge and Data Engineering 15(2), 2003, 271-294.

    Google Scholar 

  20. L. Palopoli, G. Terracina, and D. Ursino, “A graph-based approach for extracting terminological properties of elements of XML documents,” in Proc. of International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany, IEEE Computer Society, 2001, pp. 330-337.

    Google Scholar 

  21. E. Rahm and P. Bernstein, “A survey of approaches to automatic schema matching,” VLDB Journal 10(4), 2001, 334-350.

    Google Scholar 

  22. S. Richardson, W. Dolan, and L. Vanderwende, “MindNet: acquiring and structuring semantic information from text,” in Proc. of International Conference on Computational Linguistics (COLING-ACL'98), Montreal, Quebec, Canada, Morgan Kaufmann, 1998, pp. 1098-1102.

    Google Scholar 

  23. N. Rishe, J. Yuan, R. Athauda, S.-C. Chen, X. Lu, X. Ma, A. Vaschillo, A. Shaposhnikov, and D. Vasilevsky, “Semantic access: semantic interface for querying databases,” in Proc. of International Conference on Very Large Data Bases (VLDB 2000), Il Cairo, Egypt, Morgan Kaufmann, 2000, pp. 591-594.

    Google Scholar 

  24. D. Rosaci, G. Sarnè, and D. Ursino, “A multi-agent model for handling e-commerce activities,” in Proc.of International Database Engineering and Applications Symposium (IDEAS 2002), Edmonton, Alberta, Canada, IEEE Computer Society, 2002, pp. 202-211.

    Google Scholar 

  25. D. Rosaci, G. Terracina, and D. Ursino, “An approach for deriving a global representation of data sources having different formats and structures,” Knowledge and Information Systems, to appear.

  26. S. Spaccapietra and C. Parent, “View integration: a step forward in solving structural conflicts,” IEEE Transactions on Knowledge and Data Engineering 6(2), 1994, 258-274.

    Google Scholar 

  27. G. Terracina and D. Ursino, “Deriving synonymies and homonymies of object classes in semi-structured information sources,” in Proc. of International Conference on Management of Data (COMAD 2000), Pune, India, McGraw Hill, 2000, pp. 21-32.

    Google Scholar 

  28. G. Terracina and D. Ursino, “A uniform methodology for extracting type conflicts and subscheme similarities from heterogeneous databases,” Information Systems 25(8), 2000, 527-552.

    Google Scholar 

  29. J. Wald and P. Sorenson, “Explaining ambiguity in a formal query language,” ACM Transactions on Database Systems 15(2), 1990, 125-161.

    Google Scholar 

  30. J. Widom, “Research problems in data warehousing,” in Proc. of International Conference on Information and Knowledge Management (CIKM'95), Baltimore, MD, ACM Press, 1995, pp. 25-30.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rosaci, D., Terracina, G. & Ursino, D. A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats. World Wide Web 6, 375–399 (2003). https://doi.org/10.1023/A:1025614005307

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025614005307

Navigation