Skip to main content

Ontology-Based Data Cleaning

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2553))

Abstract

Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.

This work has been partly founded by the French Government in the framework of the REANIMATIC project

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal S., Keller A. M., Wiederhold G., Krichna S. “Flexible relation: an approach for integrating data from multiple, possibly inconsistent databases.” Eleventh International Conference on Data Engineering, IEEE (1995).

    Google Scholar 

  2. Batini C., Lenzerini M., Navathe S. B. “A Comparative Analysis of Methodologies for Database Schema Integration”, ACM Computing Surveys, 15(4), Dec. 1986.

    Google Scholar 

  3. Chatterjee A., Segev A. “Data Manipulation in Heterogeneous Databases”, Sigmod Record, Vol. 20, N°4, December 1991.

    Google Scholar 

  4. “EDR Electronic Dictionary Technical Guide”, Japan Electronic Dictionary Research Institute, Ltd. Mita-Kokusai-Bldg. Annex, Mita 1-4-28, Minato-Ku, Tokyo 108, Japan. August 1993.

    Google Scholar 

  5. Fan W., Lu H., Madnick S. E., Chueng D. “Discovering and reconciling value conflicts for numerical data integration”, Information Systems Journal, Vol 26, N°8, dec. 2001.

    Google Scholar 

  6. Fankhauser, P., Kracker, M., Neuhold, E. J. “Semantic vs. Structural Resemblance of Classes” Sigmod Record, 20 (4) October (1991).

    Google Scholar 

  7. Fellbaum C., “WordNet, an Electronic Lexical Database”, The MIT Press, ISBN 0-262-06197-X, 1998.

    Google Scholar 

  8. Galhardas H., Florescu D., Shasha D., Simon E., Saita C. “Declarative Data Cleaning: Language, Model and Algorithms”, INRIA report n°4149, March 2001.

    Google Scholar 

  9. Guarino N. editor, “Formal Ontology in Information Systems”, IOS Press, ISBN 90-5199-399-4, 1998.

    Google Scholar 

  10. Hernandez M. A., Stolf S. J., “The Merge/Purge Problem for Large Databases”, SIGMOD’ 95.

    Google Scholar 

  11. Johannesson P. “Using conceptual graph theory to support schema integration” Proc. of the 12th ER Conf. (1993).

    Google Scholar 

  12. Lenat, D. B. “ CYC: A Large-Scale Investment in Knowledge Infrastructure” in CACM 38 (11): 32–38 (1995).

    Google Scholar 

  13. Lenat D. B., Millar G. A., Yokoi T., “CYC, WordNet, and EDR: Critiques and Responses” in CACM 38 (11): 45–48 (1995).

    Google Scholar 

  14. Low W. L., Lee M. L., Ling T. W. “A knowledge based approach for duplicate elimination in data cleaning”, Information Systems Journal, Vol 26, N°8, december 2001.

    Google Scholar 

  15. Métais E., Meunier J.-N., Levreau G., “Database Schema Design: A perspective from natural Language techniques to Validation and View Integration”, 12th International Conference on the Entity/Relationship Approach, Dallas(Texas), Dec. 1993.

    Google Scholar 

  16. Métais E., Kedad Z., Comyn-Wattiau I., Bouzeghoub M. “Using Linguistic Knowledge in View Integration: toward a third generation of tools”, DKE 1997.

    Google Scholar 

  17. Mirbel I. “Semantic integration of conceptuel schemes” First International Workshop on Application of Natural Language to Data Bases (1995).

    Google Scholar 

  18. Monge A. E. “Matching Algorithms within a Duplicate Detection System”. IEEE Data Engineering Bulletin 23(4) (2000)

    Google Scholar 

  19. Rahm E., Do H. H. “Data Cleaning: Problems and Current Approaches” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1999.

    Google Scholar 

  20. Raman V., Hellerstein J. M., “Potter’s Wheel: An Interactive Data Cleaning System” Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.

    Google Scholar 

  21. Resnik P. “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, IJCAI’95 (1995).

    Google Scholar 

  22. Singh M. P., Cannata P. E., Huhns M. N., Jacobs, N., Ksiezyk T, Ong K, Sheth, A. P., Tomlinson C., Woelk D. “The Carnot Heterogeneous Database Project: Implemented Applications. In Distributed and Parallel Databases” Journal, vol. 5, n° 2, pages 207–225, April (1997).

    Google Scholar 

  23. Song W. W., Johannesson P., Bubenko, J. A. “Semantic similarity relations and computation in schema integration” in the review “Data and Knowledge Engineering”, 19(1996).

    Google Scholar 

  24. Storey V. C., “Understanding Semantic Relationships”, VLDB Journal, 2, 455–488, 1993.

    Article  Google Scholar 

  25. Tejada S., Knoblock C. A., Minton S. “Learning object identification rules for information integration” Information Systems Journal, Vol 26, N°8, December 2001.

    Google Scholar 

  26. Vassiliadis P., Vagena Z., Skiadopoulos S., Karayannidis N., Sellis T. “ARKTOS: towards the modeling, design, control and execution of ETL processes” Information Systems, Vol 26, N°8, December 2001.

    Google Scholar 

  27. Vossen P. “EuroWordNet-A Multilingual Database with Lexical Semantic Networks”, Kluwer Academic Publishers, 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kedad, Z., Métais, E. (2002). Ontology-Based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds) Natural Language Processing and Information Systems. NLDB 2002. Lecture Notes in Computer Science, vol 2553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36271-1_12

Download citation

  • DOI: https://doi.org/10.1007/3-540-36271-1_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00307-6

  • Online ISBN: 978-3-540-36271-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics