Skip to main content

Febrl – A Parallel Open Source Data Linkage System

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Abstract

In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, August 27, pp. 25–27 (2003)

    Google Scholar 

  2. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD International Conference on on Management of Data, San Diego, USA, pp. 313–324 (2003)

    Google Scholar 

  3. Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making (December 2002), Available online at http://www.biomedcentral.com/1472-6947/2/9/

  4. Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of SIGMOD, Seattle (1998)

    Google Scholar 

  5. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A Record Linkage Toolbox. In: Proceedings of the ICDE 2002, San Jose, USA (2002)

    Google Scholar 

  6. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Society (1969)

    Google Scholar 

  7. Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An Extensible Framework for Data Cleaning. In: Proceedings of the Inter. Conference on Data Engineering (2000)

    Google Scholar 

  8. Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)

    Google Scholar 

  9. Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM-SIGMOD Conference (1995)

    Google Scholar 

  10. Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. In: Data Mining and Knowledge Discovery 2, Kluwer Academic Publishers, Dordrecht (1998)

    Google Scholar 

  11. Maletic, J.I., Marcus, A.: Data Cleansing: Beyond Integrity Analysis. In: Proceedings of the Conference on Information Quality (IQ 2000) (October 2000)

    Google Scholar 

  12. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. Knowledge Discovery and Data Mining, 169–178 (2000)

    Google Scholar 

  13. Gropp, W., Lusk, E., Skjellum, A.: Using MPI – Portable Parallel Programming with the Message Passing Interface, 2nd edn. MIT Press, Cambridge (1999)

    Google Scholar 

  14. Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two Approaches to Handling Noisy Variation in Text Mining. In: Proceedings of the ICML-2002 Workshop on Text Learning (TextML 2002), Sydney, Australia, July 2002, pp. 18–27 (2002)

    Google Scholar 

  15. Newcombe, H.B., Kennedy, J.M.: Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information. Communications of the ACM 5(11) (1962)

    Google Scholar 

  16. Porter, E., Winkler, W.E.: Approximate String Comparison and its Effect on an Advanced Record Linkage System. RR 1997-02, US Bureau of the Census (1997)

    Google Scholar 

  17. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2) (February 1989)

    Google Scholar 

  18. Rahm, E., Do, H.H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin (2000)

    Google Scholar 

  19. Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR 1999-04, US Bureau of the Census (1999)

    Google Scholar 

  20. Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- Sunter model of record linkage. RR 2000-05, US Bureau of the Census (2000)

    Google Scholar 

  21. Yancey, W.E.: BigMatch: A Program for Extracting Probable Matches from a Large File for Record Linkage. RR 2002-01, US Bureau of the Census (March 2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Christen, P., Churches, T., Hegland, M. (2004). Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_75

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24775-3_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22064-0

  • Online ISBN: 978-3-540-24775-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics