Data Pre-Processing

  • Peter Christen
Part of the Data-Centric Systems and Applications book series (DCSA)


It is commonly accepted that ‘real world data are dirty’ [141]. Dirty data include errors at different levels, starting from typographical mistakes and variations in single attributes, missing attribute values, values that are swapped between attributes, all the way to completely missing attributes, or attributes that were wrongly matched across different database schemas. In order to achieve good quality data matching and deduplication, data pre-processing in the form of cleaning and standardisation of the input data are crucial first steps in the data matching process. Various techniques for achieving this goal have been developed, including rule-based systems that require manual development of standardisation rules, and statistical learning based systems that can exploit large reference databases to automate the cleaning and standardisation process. This chapter covers the issues and challenges involved in data pre-processing, and it provides an overview of the different techniques that have been developed to improve the quality of the data used for data matching or deduplication.


Data Cleaning Data Match Output Field Output Symbol Token Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Peter Christen
    • 1
  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations