Abstract
In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, August 27, pp. 25–27 (2003)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD International Conference on on Management of Data, San Diego, USA, pp. 313–324 (2003)
Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making (December 2002), Available online at http://www.biomedcentral.com/1472-6947/2/9/
Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of SIGMOD, Seattle (1998)
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A Record Linkage Toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (2002)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (1969)
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the Inter. Conference on Data Engineering (2000)
Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM-SIGMOD Conference (1995)
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. In: Data Mining and Knowledge Discovery 2, Kluwer Academic Publishers, Dordrecht (1998)
Maletic, J.I., Marcus, A.: Data Cleansing: Beyond Integrity Analysis. In: Proceedings of the Conference on Information Quality (IQ 2000) (October 2000)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. Knowledge Discovery and Data Mining, 169–178 (2000)
Gropp, W., Lusk, E., Skjellum, A.: Using MPI – Portable Parallel Programming with the Message Passing Interface, 2nd edn. MIT Press, Cambridge (1999)
Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two Approaches to Handling Noisy Variation in Text Mining. In: Proceedings of the ICML-2002 Workshop on Text Learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)
Newcombe, H.B., Kennedy, J.M.: Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information. Communications of the ACM 5(11) (1962)
Porter, E., Winkler, W.E.: Approximate String Comparison and its Effect on an Advanced Record Linkage System. RR 1997-02, US Bureau of the Census (1997)
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2) (February 1989)
Rahm, E., Do, H.H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin (2000)
Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR 1999-04, US Bureau of the Census (1999)
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- Sunter model of record linkage. RR 2000-05, US Bureau of the Census (2000)
Yancey, W.E.: BigMatch: A Program for Extracting Probable Matches from a Large File for Record Linkage. RR 2002-01, US Bureau of the Census (March 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Christen, P., Churches, T., Hegland, M. (2004). Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_75
Download citation
DOI: https://doi.org/10.1007/978-3-540-24775-3_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive