It is commonly accepted that ‘real world data are dirty’ . Dirty data include errors at different levels, starting from typographical mistakes and variations in single attributes, missing attribute values, values that are swapped between attributes, all the way to completely missing attributes, or attributes that were wrongly matched across different database schemas. In order to achieve good quality data matching and deduplication, data pre-processing in the form of cleaning and standardisation of the input data are crucial first steps in the data matching process. Various techniques for achieving this goal have been developed, including rule-based systems that require manual development of standardisation rules, and statistical learning based systems that can exploit large reference databases to automate the cleaning and standardisation process. This chapter covers the issues and challenges involved in data pre-processing, and it provides an overview of the different techniques that have been developed to improve the quality of the data used for data matching or deduplication.