Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3762))

Abstract

Real data is often affected by errors and inconsistencies. Many of them depend on the fact that schemas cannot represent a sufficiently wide range of constraints. Data cleaning is the process of identifying and possibly correcting data quality problems that affect the data. Cleaning data requires to gather knowledge on the domain to which the data refer. Anyway, existing data cleaning techniques still access this knowledge as a fragmented collection of heterogenous rules and ad hoc data transformations. Furthermore, data cleaning methodologies for an important class of data based on the semistructured XML data model have not yet been proposed. In this paper we introduce the OXC framework, that offers a methodology for XML data cleaning based on a uniform representation of domain knowledge through an ontology We describe how to define XML related data quality metrics based on our domain knowledge representation, and give a definition of various metrics related to the completeness data quality dimension.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arenas, M., Libkin, L.: A normal form for XML documents. ACM Trans. Database Syst. 29 (2004)

    Google Scholar 

  2. Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, USA (2000)

    Google Scholar 

  3. Jagadish, H.V., Lakshmanan, L.V.S., Scannapieco, M., Srivastava, D., Wiwatwattana, N.: Colorful XML: One Hierarchy Isn’t Enough. In: Proceedings of the 2004 ACM SIGMOD Confrence (SIGMOD 2004), Paris, France (2004)

    Google Scholar 

  4. Liu, L., Chi, L.: Evolutionary Data Quality. In: 7th International Conference on Information Quality, Boston, MA (2002)

    Google Scholar 

  5. Bovee, M., Srivastava, R.P., Mak, B.R.: A Conceptual Framework and Belief-Function Approach to Assessing Overall Information Quality. In: Proceedings of the 6th International Conference on Information Quality, Boston, MA (2001)

    Google Scholar 

  6. Milano, D., Scannapieco, M., Catarci, T.: Using Ontologies for XML Data Cleaning (Extended Version), Available on-line at http://www.dis.uniroma1.it/~milano/docs/oxc.pdf

  7. Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Information Systems 29(7) (2004)

    Google Scholar 

  8. Freytag, J.C., Conrad, R., Scheffner, D.: Xml conceptual modeling using uml. In: 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA (2000)

    Google Scholar 

  9. Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB, Roma (2001)

    Google Scholar 

  10. Scannapieco, M., Batini, C.: Completeness in the Relational Model: A Comprehensive Framework. In: 9th International Conference on Information Quality, Boston, MA (2004)

    Google Scholar 

  11. Wai, L.L., Lee, M.L., Ling, T.W.: A Knowledge-based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26 (2001)

    Google Scholar 

  12. Wang, R.Y.: A Product Perspective on Total Data Quality Management. Communications of the ACM 41(2) (1998)

    Google Scholar 

  13. Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML. In: SIGMOD Conference, Maryland (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Milano, D., Scannapieco, M., Catarci, T. (2005). Using Ontologies for XML Data Cleaning. In: Meersman, R., Tari, Z., Herrero, P. (eds) On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops. OTM 2005. Lecture Notes in Computer Science, vol 3762. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575863_75

Download citation

  • DOI: https://doi.org/10.1007/11575863_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29739-0

  • Online ISBN: 978-3-540-32132-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics