Abstract
It is commonly accepted that ‘real world data are dirty’ [141]. Dirty data include errors at different levels, starting from typographical mistakes and variations in single attributes, missing attribute values, values that are swapped between attributes, all the way to completely missing attributes, or attributes that were wrongly matched across different database schemas. In order to achieve good quality data matching and deduplication, data pre-processing in the form of cleaning and standardisation of the input data are crucial first steps in the data matching process. Various techniques for achieving this goal have been developed, including rule-based systems that require manual development of standardisation rules, and statistical learning based systems that can exploit large reference databases to automate the cleaning and standardisation process. This chapter covers the issues and challenges involved in data pre-processing, and it provides an overview of the different techniques that have been developed to improve the quality of the data used for data matching or deduplication.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAuthor information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Christen, P. (2012). Data Pre-Processing. In: Data Matching. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31164-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-31164-2_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31163-5
Online ISBN: 978-3-642-31164-2
eBook Packages: Computer ScienceComputer Science (R0)