A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats

Rosaci, Domenico; Terracina, Giorgio; Ursino, Domenico

doi:10.1023/A:1025614005307

A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats

Published: December 2003

Volume 6, pages 375–399, (2003)
Cite this article

World Wide Web Aims and scope Submit manuscript

Domenico Rosaci¹,
Giorgio Terracina² &
Domenico Ursino¹

2 Citations
Explore all metrics

Abstract

In this paper we propose a semi-automatic technique for deriving the similarity degree between two portions of heterogeneous information sources (hereafter, sub-sources). The proposed technique consists in two phases: the first one selects the most promising pairs of sub-sources, whereas the second one computes the similarity degree relative to each promising pair. We show that the detection of sub-source similarities is a special case (and a very interesting one, for semi-structured information sources) of the more general problem of Scheme Match. In addition, we present a real example case to clarify the proposed technique, a set of experiments we have conducted to verify the quality of its results, a discussion about its computational complexity and its classification in the context of related literature. Finally, we discuss some possible applications which can benefit by derived similarities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximating the Schema of a Set of Documents by Means of Resemblance

Article 02 June 2018

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Dimensions of Semantic Similarity

References

C. Batini and M. Lenzerini, “A methodology for data schema integration in the entity relationship model,” IEEE Transactions on Software Engineering 10(6), 1984, 650-664.
Google Scholar
S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano, “Semantic integration and query of heterogeneous information sources,” Data & Knowledge Engineering 36(3), 2001, 215-249.
Google Scholar
P. Bernstein and E. Rahm, “Data warehouse scenarios for model management,” in Proc. of International Conference on Conceptual Modeling (ER'00), Salt Lake City, UT, Lecture Notes in Computer Science, Vol. 1920, Springer, 2000, pp. 1-15.
Google Scholar
S. Castano, V. D. Antonellis, and S. D. C. di Vimercati, “Global viewing of heterogeneous data sources,” Transactions on Data and Knowledge Engineering 13(2), 2001, 277-297.
Google Scholar
S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” ACM SIGMOD RECORD 26(1), 1997, 65-74.
Google Scholar
A. Doan, P. D omingos, and A. Halevy, “Reconciling schemas of disparate data sources, a machine-learning approach,” in Proc. of International Conference on Management of Data (SIGMOD 2001), Santa Barbara, CA, ACM Press, 2001, pp. 509-520.
Google Scholar
P. Fankhauser, M. Kracker, and E. Neuhold, “Semantic vs. structural resemblance of classes,” ACM SIGMOD RECORD 20(4), 1991, 59-63.
Google Scholar
Z. Galil, “Efficient algorithms for finding maximum matching in graphs,” ACM Computing Surveys 18, 1986, 23-38.
Google Scholar
W. Gotthard, P. Lockemann, and A. Neufeld, “System-guided view integration for object-oriented databases,” IEEE Transactions on Knowledge and Data Engineering 4(1), 1992 1-22.
Google Scholar
A. Jain and R. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988.
Google Scholar
L. Kaufman and P. Rousseeuw, Findings Groups in Data: an Introduction to Cluster Analysis, Wiley, New York, 1990.
Google Scholar
J. Larson, S. Navathe, and R. Elmastri, “A theory of attribute equivalence in databases with application to schema integration,” IEEE Transactions on Softwware Engineering 15(4), 1989, 449-463.
Google Scholar
J. Madhavan, P. Bernstein, and E. Rahm, “Generic schema matching with cupid,” in Proc. of International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, Morgan Kaufmann, 2001, pp. 49-58.
Google Scholar
A. Miller, “WordNet: a lexical database for english,” Communications of the ACM 38(11), 1995, 39-41.
Google Scholar
T. Milo and S. Zohar, “Using schema matching to simplify heterogenous data translations,” in Proc. of International Conference on Very Large Data Bases (VLDB'98), New York, Morgan Kaufmann, 1998, pp. 122-133.
Google Scholar
P. Mitra, G. Wiederhold, and J. Jannink, “Semi-automatic integration of knowledge sources,” in Proc. of Fusion'99, Sunnyvale, CA, 1999.
Google Scholar
L. Palopoli, D. Rosaci, G. Terracina, and D. Ursino, “Un modello concettuale per rappresentare e derivare la semantica associata a sorgenti informative strutturate e semi-strutturate,” In Atti del Congresso sui Sistemi Evoluti per Basi di Dati (SEBD 2001), Venezia, Italy, 2001, pp. 131-145 (in Italian).
L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities,” in Proc. of Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, UK, IEEE Computer Society, 1999, pp. 34-45.
Google Scholar
L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases,” IEEE Transactions on Knowledge and Data Engineering 15(2), 2003, 271-294.
Google Scholar
L. Palopoli, G. Terracina, and D. Ursino, “A graph-based approach for extracting terminological properties of elements of XML documents,” in Proc. of International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany, IEEE Computer Society, 2001, pp. 330-337.
Google Scholar
E. Rahm and P. Bernstein, “A survey of approaches to automatic schema matching,” VLDB Journal 10(4), 2001, 334-350.
Google Scholar
S. Richardson, W. Dolan, and L. Vanderwende, “MindNet: acquiring and structuring semantic information from text,” in Proc. of International Conference on Computational Linguistics (COLING-ACL'98), Montreal, Quebec, Canada, Morgan Kaufmann, 1998, pp. 1098-1102.
Google Scholar
N. Rishe, J. Yuan, R. Athauda, S.-C. Chen, X. Lu, X. Ma, A. Vaschillo, A. Shaposhnikov, and D. Vasilevsky, “Semantic access: semantic interface for querying databases,” in Proc. of International Conference on Very Large Data Bases (VLDB 2000), Il Cairo, Egypt, Morgan Kaufmann, 2000, pp. 591-594.
Google Scholar
D. Rosaci, G. Sarnè, and D. Ursino, “A multi-agent model for handling e-commerce activities,” in Proc.of International Database Engineering and Applications Symposium (IDEAS 2002), Edmonton, Alberta, Canada, IEEE Computer Society, 2002, pp. 202-211.
Google Scholar
D. Rosaci, G. Terracina, and D. Ursino, “An approach for deriving a global representation of data sources having different formats and structures,” Knowledge and Information Systems, to appear.
S. Spaccapietra and C. Parent, “View integration: a step forward in solving structural conflicts,” IEEE Transactions on Knowledge and Data Engineering 6(2), 1994, 258-274.
Google Scholar
G. Terracina and D. Ursino, “Deriving synonymies and homonymies of object classes in semi-structured information sources,” in Proc. of International Conference on Management of Data (COMAD 2000), Pune, India, McGraw Hill, 2000, pp. 21-32.
Google Scholar
G. Terracina and D. Ursino, “A uniform methodology for extracting type conflicts and subscheme similarities from heterogeneous databases,” Information Systems 25(8), 2000, 527-552.
Google Scholar
J. Wald and P. Sorenson, “Explaining ambiguity in a formal query language,” ACM Transactions on Database Systems 15(2), 1990, 125-161.
Google Scholar
J. Widom, “Research problems in data warehousing,” in Proc. of International Conference on Information and Knowledge Management (CIKM'95), Baltimore, MD, ACM Press, 1995, pp. 25-30.
Google Scholar

Download references

Author information

Authors and Affiliations

DIMET, Università “Mediterranea” di Reggio Calabria, Via Graziella, Località Feo di Vito, 89060, Reggio Calabria, Italy
Domenico Rosaci & Domenico Ursino
Dipartimento di Matematica, Università della Calabria, Via P. Bucci, 87036, Rende (CS), Italy
Giorgio Terracina

Authors

Domenico Rosaci
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Terracina
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Ursino
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rosaci, D., Terracina, G. & Ursino, D. A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats. World Wide Web 6, 375–399 (2003). https://doi.org/10.1023/A:1025614005307

Download citation

Issue Date: December 2003
DOI: https://doi.org/10.1023/A:1025614005307

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats

Abstract

Access this article

Similar content being viewed by others

Approximating the Schema of a Set of Documents by Means of Resemblance

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Dimensions of Semantic Similarity

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats

Abstract

Access this article

Similar content being viewed by others

Approximating the Schema of a Set of Documents by Means of Resemblance

Similarity vs. Relevance: From Simple Searches to Complex Discovery

Dimensions of Semantic Similarity

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation