Abstract
The naive approach of matching two databases— comparing each record from one database with all records from the other database—has a quadratic computation complexity. Clearly this approach is not feasible for today’s large databases that contain many millions or even billions of records. Not only would the number of record pair comparisons be huge, the number of possible matches compared to the number of non-matches would also be very small, because the number of matches only grows linearly with the size of the databases to be matched while the number of record pair comparisons grows quadratically. Techniques are required that reduce the potentially large number of record pairs that are compared, by generating candidate record pairs that likely refer to true matches. This process has traditionally been referred to as blocking, while more generally it is known as indexing. Various indexing techniques for data matching have been developed in the past decade by researchers from different fields. This chapter covers the different issues that need to be considered in order to achieve efficient indexing, it provides an overview of the major indexing techniques proposed, and it includes a comparative evaluation of these techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Available from: https://sourceforge.net/projects/febrl/
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Christen, P. (2012). Indexing. In: Data Matching. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31164-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-31164-2_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31163-5
Online ISBN: 978-3-642-31164-2
eBook Packages: Computer ScienceComputer Science (R0)