Abstract
Real data is often affected by errors and inconsistencies. Many of them depend on the fact that schemas cannot represent a sufficiently wide range of constraints. Data cleaning is the process of identifying and possibly correcting data quality problems that affect the data. Cleaning data requires to gather knowledge on the domain to which the data refer. Anyway, existing data cleaning techniques still access this knowledge as a fragmented collection of heterogenous rules and ad hoc data transformations. Furthermore, data cleaning methodologies for an important class of data based on the semistructured XML data model have not yet been proposed. In this paper we introduce the OXC framework, that offers a methodology for XML data cleaning based on a uniform representation of domain knowledge through an ontology We describe how to define XML related data quality metrics based on our domain knowledge representation, and give a definition of various metrics related to the completeness data quality dimension.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arenas, M., Libkin, L.: A normal form for XML documents. ACM Trans. Database Syst. 29 (2004)
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, USA (2000)
Jagadish, H.V., Lakshmanan, L.V.S., Scannapieco, M., Srivastava, D., Wiwatwattana, N.: Colorful XML: One Hierarchy Isn’t Enough. In: Proceedings of the 2004 ACM SIGMOD Confrence (SIGMOD 2004), Paris, France (2004)
Liu, L., Chi, L.: Evolutionary Data Quality. In: 7th International Conference on Information Quality, Boston, MA (2002)
Bovee, M., Srivastava, R.P., Mak, B.R.: A Conceptual Framework and Belief-Function Approach to Assessing Overall Information Quality. In: Proceedings of the 6th International Conference on Information Quality, Boston, MA (2001)
Milano, D., Scannapieco, M., Catarci, T.: Using Ontologies for XML Data Cleaning (Extended Version), Available on-line at http://www.dis.uniroma1.it/~milano/docs/oxc.pdf
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Information Systems 29(7) (2004)
Freytag, J.C., Conrad, R., Scheffner, D.: Xml conceptual modeling using uml. In: 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA (2000)
Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB, Roma (2001)
Scannapieco, M., Batini, C.: Completeness in the Relational Model: A Comprehensive Framework. In: 9th International Conference on Information Quality, Boston, MA (2004)
Wai, L.L., Lee, M.L., Ling, T.W.: A Knowledge-based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26 (2001)
Wang, R.Y.: A Product Perspective on Total Data Quality Management. Communications of the ACM 41(2) (1998)
Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML. In: SIGMOD Conference, Maryland (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Milano, D., Scannapieco, M., Catarci, T. (2005). Using Ontologies for XML Data Cleaning. In: Meersman, R., Tari, Z., Herrero, P. (eds) On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops. OTM 2005. Lecture Notes in Computer Science, vol 3762. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575863_75
Download citation
DOI: https://doi.org/10.1007/11575863_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29739-0
Online ISBN: 978-3-540-32132-3
eBook Packages: Computer ScienceComputer Science (R0)