Ontology-Based Data Cleaning

Kedad, Zoubida; Métais, Elisabeth

doi:10.1007/3-540-36271-1_12

Zoubida Kedad⁵ &
Elisabeth Métais⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2553))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

559 Accesses
21 Citations

Abstract

Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.

This work has been partly founded by the French Government in the framework of the REANIMATIC project

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agarwal S., Keller A. M., Wiederhold G., Krichna S. “Flexible relation: an approach for integrating data from multiple, possibly inconsistent databases.” Eleventh International Conference on Data Engineering, IEEE (1995).
Google Scholar
Batini C., Lenzerini M., Navathe S. B. “A Comparative Analysis of Methodologies for Database Schema Integration”, ACM Computing Surveys, 15(4), Dec. 1986.
Google Scholar
Chatterjee A., Segev A. “Data Manipulation in Heterogeneous Databases”, Sigmod Record, Vol. 20, N°4, December 1991.
Google Scholar
“EDR Electronic Dictionary Technical Guide”, Japan Electronic Dictionary Research Institute, Ltd. Mita-Kokusai-Bldg. Annex, Mita 1-4-28, Minato-Ku, Tokyo 108, Japan. August 1993.
Google Scholar
Fan W., Lu H., Madnick S. E., Chueng D. “Discovering and reconciling value conflicts for numerical data integration”, Information Systems Journal, Vol 26, N°8, dec. 2001.
Google Scholar
Fankhauser, P., Kracker, M., Neuhold, E. J. “Semantic vs. Structural Resemblance of Classes” Sigmod Record, 20 (4) October (1991).
Google Scholar
Fellbaum C., “WordNet, an Electronic Lexical Database”, The MIT Press, ISBN 0-262-06197-X, 1998.
Google Scholar
Galhardas H., Florescu D., Shasha D., Simon E., Saita C. “Declarative Data Cleaning: Language, Model and Algorithms”, INRIA report n°4149, March 2001.
Google Scholar
Guarino N. editor, “Formal Ontology in Information Systems”, IOS Press, ISBN 90-5199-399-4, 1998.
Google Scholar
Hernandez M. A., Stolf S. J., “The Merge/Purge Problem for Large Databases”, SIGMOD’ 95.
Google Scholar
Johannesson P. “Using conceptual graph theory to support schema integration” Proc. of the 12th ER Conf. (1993).
Google Scholar
Lenat, D. B. “ CYC: A Large-Scale Investment in Knowledge Infrastructure” in CACM 38 (11): 32–38 (1995).
Google Scholar
Lenat D. B., Millar G. A., Yokoi T., “CYC, WordNet, and EDR: Critiques and Responses” in CACM 38 (11): 45–48 (1995).
Google Scholar
Low W. L., Lee M. L., Ling T. W. “A knowledge based approach for duplicate elimination in data cleaning”, Information Systems Journal, Vol 26, N°8, december 2001.
Google Scholar
Métais E., Meunier J.-N., Levreau G., “Database Schema Design: A perspective from natural Language techniques to Validation and View Integration”, 12th International Conference on the Entity/Relationship Approach, Dallas(Texas), Dec. 1993.
Google Scholar
Métais E., Kedad Z., Comyn-Wattiau I., Bouzeghoub M. “Using Linguistic Knowledge in View Integration: toward a third generation of tools”, DKE 1997.
Google Scholar
Mirbel I. “Semantic integration of conceptuel schemes” First International Workshop on Application of Natural Language to Data Bases (1995).
Google Scholar
Monge A. E. “Matching Algorithms within a Duplicate Detection System”. IEEE Data Engineering Bulletin 23(4) (2000)
Google Scholar
Rahm E., Do H. H. “Data Cleaning: Problems and Current Approaches” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1999.
Google Scholar
Raman V., Hellerstein J. M., “Potter’s Wheel: An Interactive Data Cleaning System” Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.
Google Scholar
Resnik P. “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, IJCAI’95 (1995).
Google Scholar
Singh M. P., Cannata P. E., Huhns M. N., Jacobs, N., Ksiezyk T, Ong K, Sheth, A. P., Tomlinson C., Woelk D. “The Carnot Heterogeneous Database Project: Implemented Applications. In Distributed and Parallel Databases” Journal, vol. 5, n° 2, pages 207–225, April (1997).
Google Scholar
Song W. W., Johannesson P., Bubenko, J. A. “Semantic similarity relations and computation in schema integration” in the review “Data and Knowledge Engineering”, 19(1996).
Google Scholar
Storey V. C., “Understanding Semantic Relationships”, VLDB Journal, 2, 455–488, 1993.
Article Google Scholar
Tejada S., Knoblock C. A., Minton S. “Learning object identification rules for information integration” Information Systems Journal, Vol 26, N°8, December 2001.
Google Scholar
Vassiliadis P., Vagena Z., Skiadopoulos S., Karayannidis N., Sellis T. “ARKTOS: towards the modeling, design, control and execution of ETL processes” Information Systems, Vol 26, N°8, December 2001.
Google Scholar
Vossen P. “EuroWordNet-A Multilingual Database with Lexical Semantic Networks”, Kluwer Academic Publishers, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire PRiSM, Université de Versailles 45, avenue des Etats-Unis, 78035, Versailles Cedex, France
Zoubida Kedad
Laboratoire Cedric, CNAM, 192 rue Saint Martin, 75141, Paris cedex 3, France
Elisabeth Métais

Authors

Zoubida Kedad
View author publications
You can also search for this author in PubMed Google Scholar
Elisabeth Métais
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Systems Sciences, Royal Institute of Technology, Forum 100, 16440, Kista, Sweden
Birger Andersson , Maria Bergholtz & Paul Johannesson , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kedad, Z., Métais, E. (2002). Ontology-Based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_12

Download citation

DOI: https://doi.org/10.1007/3-540-36271-1_12
Published: 28 February 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00307-6
Online ISBN: 978-3-540-36271-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics