Using Ontologies for XML Data Cleaning

Milano, Diego; Scannapieco, Monica; Catarci, Tiziana

doi:10.1007/11575863_75

Diego Milano¹⁹,
Monica Scannapieco¹⁹ &
Tiziana Catarci¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3762))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

534 Accesses
10 Citations

Abstract

Real data is often affected by errors and inconsistencies. Many of them depend on the fact that schemas cannot represent a sufficiently wide range of constraints. Data cleaning is the process of identifying and possibly correcting data quality problems that affect the data. Cleaning data requires to gather knowledge on the domain to which the data refer. Anyway, existing data cleaning techniques still access this knowledge as a fragmented collection of heterogenous rules and ad hoc data transformations. Furthermore, data cleaning methodologies for an important class of data based on the semistructured XML data model have not yet been proposed. In this paper we introduce the OXC framework, that offers a methodology for XML data cleaning based on a uniform representation of domain knowledge through an ontology We describe how to define XML related data quality metrics based on our domain knowledge representation, and give a definition of various metrics related to the completeness data quality dimension.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arenas, M., Libkin, L.: A normal form for XML documents. ACM Trans. Database Syst. 29 (2004)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, USA (2000)
Google Scholar
Jagadish, H.V., Lakshmanan, L.V.S., Scannapieco, M., Srivastava, D., Wiwatwattana, N.: Colorful XML: One Hierarchy Isn’t Enough. In: Proceedings of the 2004 ACM SIGMOD Confrence (SIGMOD 2004), Paris, France (2004)
Google Scholar
Liu, L., Chi, L.: Evolutionary Data Quality. In: 7th International Conference on Information Quality, Boston, MA (2002)
Google Scholar
Bovee, M., Srivastava, R.P., Mak, B.R.: A Conceptual Framework and Belief-Function Approach to Assessing Overall Information Quality. In: Proceedings of the 6th International Conference on Information Quality, Boston, MA (2001)
Google Scholar
Milano, D., Scannapieco, M., Catarci, T.: Using Ontologies for XML Data Cleaning (Extended Version), Available on-line at http://www.dis.uniroma1.it/~milano/docs/oxc.pdf
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Information Systems 29(7) (2004)
Google Scholar
Freytag, J.C., Conrad, R., Scheffner, D.: Xml conceptual modeling using uml. In: 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA (2000)
Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB, Roma (2001)
Google Scholar
Scannapieco, M., Batini, C.: Completeness in the Relational Model: A Comprehensive Framework. In: 9th International Conference on Information Quality, Boston, MA (2004)
Google Scholar
Wai, L.L., Lee, M.L., Ling, T.W.: A Knowledge-based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26 (2001)
Google Scholar
Wang, R.Y.: A Product Perspective on Total Data Quality Management. Communications of the ACM 41(2) (1998)
Google Scholar
Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML. In: SIGMOD Conference, Maryland (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica e Sistemistica, Universitá degli Studi di Roma “La Sapienza”, Via Salaria 113, Roma, Italy
Diego Milano, Monica Scannapieco & Tiziana Catarci

Authors

Diego Milano
View author publications
You can also search for this author in PubMed Google Scholar
Monica Scannapieco
View author publications
You can also search for this author in PubMed Google Scholar
Tiziana Catarci
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB), Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, 3001, Melbourne, VIC, Australia
Zahir Tari
Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660, Boadilla del Monte, Madrid, Spain
Pilar Herrero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Milano, D., Scannapieco, M., Catarci, T. (2005). Using Ontologies for XML Data Cleaning. In: Meersman, R., Tari, Z., Herrero, P. (eds) On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops. OTM 2005. Lecture Notes in Computer Science, vol 3762. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575863_75

Download citation

DOI: https://doi.org/10.1007/11575863_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29739-0
Online ISBN: 978-3-540-32132-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics