Automatic Training Example Selection for Scalable Unsupervised Record Linkage

  • Peter Christen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5012)


Linking records from two or more databases is an increasingly important data preparation step in many data mining projects, as linked data can enable studies that are not feasible otherwise, or that would require expensive collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. Many modern classification techniques are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and they are then used in the second step to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and also be much faster than other classification techniques.


data linkage entity resolution clustering support vector machines data mining preprocessing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM KDD 2003 workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC, pp. 25–27 (2003)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM KDD 2003, Washington DC, pp. 39–48 (2003)Google Scholar
  3. 3.
    Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. Manual, Department of Computer Science, National Taiwan University (2001)Google Scholar
  4. 4.
    Christen, P.: Probabilistic data generation for deduplication and data linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)Google Scholar
  5. 5.
    Christen, P.: A two-step classification approach to unsupervised record linkage. In: AusDM 2007, CRPIT, Gold Coast, Australia, vol. 70 (2007)Google Scholar
  6. 6.
    Christen, P.: Febrl - a freely available record linkage system with a graphical user interface. In: HDKM 2008, CRPIT, Wollongong, Australia, vol. 80 (2008)Google Scholar
  7. 7.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: A record linkage toolbox. In: ICDE 2002, San Jose, pp. 17–28 (2002)Google Scholar
  9. 9.
    Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially supervised classification of text documents. In: ICML 2002, Sydney, Australia, pp. 387–394 (2002)Google Scholar
  10. 10.
    Nahm, U.Y., Bilenko, M., Mooney, R.J.: Two approaches to handling noisy variation in text mining. In: TextML 2002, Sydney, pp. 18–27 (2002)Google Scholar
  11. 11.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: ACM KDD 2002, Edmonton, pp. 350–359 (2002)Google Scholar
  12. 12.
    Winkler, W.E.: Methods for evaluating and creating data quality. Elsevier Information Systems 29(7), 531–550 (2004)MathSciNetGoogle Scholar
  13. 13.
    Yu, H., Han, J., Chang, K.C.C.: PEBL: positive example based learning for Web page classification using SVM. In: ACM KDD 2002, Edmonton (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Peter Christen
    • 1
  1. 1.Department of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations