Skip to main content

Data Pre-Processing

  • Chapter
  • First Online:

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

It is commonly accepted that ‘real world data are dirty’ [141]. Dirty data include errors at different levels, starting from typographical mistakes and variations in single attributes, missing attribute values, values that are swapped between attributes, all the way to completely missing attributes, or attributes that were wrongly matched across different database schemas. In order to achieve good quality data matching and deduplication, data pre-processing in the form of cleaning and standardisation of the input data are crucial first steps in the data matching process. Various techniques for achieving this goal have been developed, including rule-based systems that require manual development of standardisation rules, and statistical learning based systems that can exploit large reference databases to automate the cleaning and standardisation process. This chapter covers the issues and challenges involved in data pre-processing, and it provides an overview of the different techniques that have been developed to improve the quality of the data used for data matching or deduplication.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    See: http://www.google.com/press/zeitgeist.html.

  2. 2.

    See: http://www.thinkbabynames.com/meaning/0/Amelia.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Christen, P. (2012). Data Pre-Processing. In: Data Matching. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31164-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31164-2_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31163-5

  • Online ISBN: 978-3-642-31164-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics