Abstract
Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.
This work has been partly founded by the French Government in the framework of the REANIMATIC project
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agarwal S., Keller A. M., Wiederhold G., Krichna S. “Flexible relation: an approach for integrating data from multiple, possibly inconsistent databases.” Eleventh International Conference on Data Engineering, IEEE (1995).
Batini C., Lenzerini M., Navathe S. B. “A Comparative Analysis of Methodologies for Database Schema Integration”, ACM Computing Surveys, 15(4), Dec. 1986.
Chatterjee A., Segev A. “Data Manipulation in Heterogeneous Databases”, Sigmod Record, Vol. 20, N°4, December 1991.
“EDR Electronic Dictionary Technical Guide”, Japan Electronic Dictionary Research Institute, Ltd. Mita-Kokusai-Bldg. Annex, Mita 1-4-28, Minato-Ku, Tokyo 108, Japan. August 1993.
Fan W., Lu H., Madnick S. E., Chueng D. “Discovering and reconciling value conflicts for numerical data integration”, Information Systems Journal, Vol 26, N°8, dec. 2001.
Fankhauser, P., Kracker, M., Neuhold, E. J. “Semantic vs. Structural Resemblance of Classes” Sigmod Record, 20 (4) October (1991).
Fellbaum C., “WordNet, an Electronic Lexical Database”, The MIT Press, ISBN 0-262-06197-X, 1998.
Galhardas H., Florescu D., Shasha D., Simon E., Saita C. “Declarative Data Cleaning: Language, Model and Algorithms”, INRIA report n°4149, March 2001.
Guarino N. editor, “Formal Ontology in Information Systems”, IOS Press, ISBN 90-5199-399-4, 1998.
Hernandez M. A., Stolf S. J., “The Merge/Purge Problem for Large Databases”, SIGMOD’ 95.
Johannesson P. “Using conceptual graph theory to support schema integration” Proc. of the 12th ER Conf. (1993).
Lenat, D. B. “ CYC: A Large-Scale Investment in Knowledge Infrastructure” in CACM 38 (11): 32–38 (1995).
Lenat D. B., Millar G. A., Yokoi T., “CYC, WordNet, and EDR: Critiques and Responses” in CACM 38 (11): 45–48 (1995).
Low W. L., Lee M. L., Ling T. W. “A knowledge based approach for duplicate elimination in data cleaning”, Information Systems Journal, Vol 26, N°8, december 2001.
Métais E., Meunier J.-N., Levreau G., “Database Schema Design: A perspective from natural Language techniques to Validation and View Integration”, 12th International Conference on the Entity/Relationship Approach, Dallas(Texas), Dec. 1993.
Métais E., Kedad Z., Comyn-Wattiau I., Bouzeghoub M. “Using Linguistic Knowledge in View Integration: toward a third generation of tools”, DKE 1997.
Mirbel I. “Semantic integration of conceptuel schemes” First International Workshop on Application of Natural Language to Data Bases (1995).
Monge A. E. “Matching Algorithms within a Duplicate Detection System”. IEEE Data Engineering Bulletin 23(4) (2000)
Rahm E., Do H. H. “Data Cleaning: Problems and Current Approaches” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1999.
Raman V., Hellerstein J. M., “Potter’s Wheel: An Interactive Data Cleaning System” Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.
Resnik P. “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, IJCAI’95 (1995).
Singh M. P., Cannata P. E., Huhns M. N., Jacobs, N., Ksiezyk T, Ong K, Sheth, A. P., Tomlinson C., Woelk D. “The Carnot Heterogeneous Database Project: Implemented Applications. In Distributed and Parallel Databases” Journal, vol. 5, n° 2, pages 207–225, April (1997).
Song W. W., Johannesson P., Bubenko, J. A. “Semantic similarity relations and computation in schema integration” in the review “Data and Knowledge Engineering”, 19(1996).
Storey V. C., “Understanding Semantic Relationships”, VLDB Journal, 2, 455–488, 1993.
Tejada S., Knoblock C. A., Minton S. “Learning object identification rules for information integration” Information Systems Journal, Vol 26, N°8, December 2001.
Vassiliadis P., Vagena Z., Skiadopoulos S., Karayannidis N., Sellis T. “ARKTOS: towards the modeling, design, control and execution of ETL processes” Information Systems, Vol 26, N°8, December 2001.
Vossen P. “EuroWordNet-A Multilingual Database with Lexical Semantic Networks”, Kluwer Academic Publishers, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kedad, Z., Métais, E. (2002). Ontology-Based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_12
Download citation
DOI: https://doi.org/10.1007/3-540-36271-1_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00307-6
Online ISBN: 978-3-540-36271-5
eBook Packages: Springer Book Archive