Febrl – A Parallel Open Source Data Linkage System

Christen, Peter; Churches, Tim; Hegland, Markus

doi:10.1007/978-3-540-24775-3_75

Peter Christen¹⁹,
Tim Churches²⁰ &
Markus Hegland²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3129 Accesses
47 Citations

Abstract

In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, August 27, pp. 25–27 (2003)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD International Conference on on Management of Data, San Diego, USA, pp. 313–324 (2003)
Google Scholar
Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making (December 2002), Available online at http://www.biomedcentral.com/1472-6947/2/9/
Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of SIGMOD, Seattle (1998)
Google Scholar
Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A Record Linkage Toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (2002)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (1969)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the Inter. Conference on Data Engineering (2000)
Google Scholar
Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM-SIGMOD Conference (1995)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. In: Data Mining and Knowledge Discovery 2, Kluwer Academic Publishers, Dordrecht (1998)
Google Scholar
Maletic, J.I., Marcus, A.: Data Cleansing: Beyond Integrity Analysis. In: Proceedings of the Conference on Information Quality (IQ 2000) (October 2000)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. Knowledge Discovery and Data Mining, 169–178 (2000)
Google Scholar
Gropp, W., Lusk, E., Skjellum, A.: Using MPI – Portable Parallel Programming with the Message Passing Interface, 2nd edn. MIT Press, Cambridge (1999)
Google Scholar
Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two Approaches to Handling Noisy Variation in Text Mining. In: Proceedings of the ICML-2002 Workshop on Text Learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)
Google Scholar
Newcombe, H.B., Kennedy, J.M.: Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information. Communications of the ACM 5(11) (1962)
Google Scholar
Porter, E., Winkler, W.E.: Approximate String Comparison and its Effect on an Advanced Record Linkage System. RR 1997-02, US Bureau of the Census (1997)
Google Scholar
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2) (February 1989)
Google Scholar
Rahm, E., Do, H.H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin (2000)
Google Scholar
Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR 1999-04, US Bureau of the Census (1999)
Google Scholar
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- Sunter model of record linkage. RR 2000-05, US Bureau of the Census (2000)
Google Scholar
Yancey, W.E.: BigMatch: A Program for Extracting Probable Matches from a Large File for Record Linkage. RR 2002-01, US Bureau of the Census (March 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Australian National University, Canberra, ACT, 0200, Australia
Peter Christen
New South Wales Department of Health, Centre for Epidemiology and Research, Locked Mail Bag 961, North Sydney, NSW, 2059, Australia
Tim Churches
Centre for Mathematics and its Applications, Mathematical Sciences Institute, Australian National University, Canberra, ACT, 0200, Australia
Markus Hegland

Authors

Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar
Tim Churches
View author publications
You can also search for this author in PubMed Google Scholar
Markus Hegland
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Christen, P., Churches, T., Hegland, M. (2004). Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_75

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics