Data Pre-Processing

Christen, Peter

doi:10.1007/978-3-642-31164-2_3

Data Pre-Processing

Peter Christen²

Chapter
First Online: 01 January 2012

5658 Accesses
2 Citations

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

It is commonly accepted that ‘real world data are dirty’ [141]. Dirty data include errors at different levels, starting from typographical mistakes and variations in single attributes, missing attribute values, values that are swapped between attributes, all the way to completely missing attributes, or attributes that were wrongly matched across different database schemas. In order to achieve good quality data matching and deduplication, data pre-processing in the form of cleaning and standardisation of the input data are crucial first steps in the data matching process. Various techniques for achieving this goal have been developed, including rule-based systems that require manual development of standardisation rules, and statistical learning based systems that can exploit large reference databases to automate the cleaning and standardisation process. This chapter covers the issues and challenges involved in data pre-processing, and it provides an overview of the different techniques that have been developed to improve the quality of the data used for data matching or deduplication.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
See: http://www.google.com/press/zeitgeist.html.
2.
See: http://www.thinkbabynames.com/meaning/0/Amelia.

Author information

Authors and Affiliations

Research School of Computer Science, The Australian National University, Canberra, ACT, Australia
Peter Christen

Authors

Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Christen, P. (2012). Data Pre-Processing. In: Data Matching. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31164-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-31164-2_3
Published: 05 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31163-5
Online ISBN: 978-3-642-31164-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics