Skip to main content

Effective record linkage for mining campaign contribution data


Up to now, most campaign contribution data have been reported at the level of the donation. While these are interesting, one often needs to have information at the level of the donor. Obtaining information at that level is difficult as there is neither a unique repository of donations nor any standard across existing repositories. In order to more meaningfully mine campaign contribution data, political scientists need an accurate way of grouping, or linking, together donations made by the same donor. In this paper, we describe a record linkage technique that is applicable to various sources and across large geographical areas. We show how it may be effectively applied in the context of nationwide donation data and report on new, previously unattainable results about campaign contributors in the 2007–2008 US election cycle.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. See

  2. See

  3. The privacy concern may actually be quite prevalent as these same individuals (found in our linkage but who did not report giving to any candidates in the survey) are also more than twice as likely as others not to report their income.

  4. Even when the “offending” individuals are not removed, FPR does not exceed 0.039 and precision does not go below 0.71 for any of the candidates.

  5. See


  1. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intel Syst 18(5):16–23

    Article  Google Scholar 

  2. Cheatham M, Hitzler P (2013) String similarity metrics for ontology alignment. In: Proceedings of the twelfth international semantic Web conference (LNCS 8219), pp 294–309

  3. Christen P (2006) A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-2, Department of Computer Science, The Australian National University

  4. Christen P (2012) Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin

    Book  Google Scholar 

  5. Cohen W, Ravikumar P, Fiendberg S (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the eighteenth international joint conference on artificial intelligence, pp 73–78

  6. Elfeky MG, Verykios VS, Elmagarmid AK, Ghanem TM, Huwait AR (2003) Record linkage: a machine learning approach, a toolbox, and a digital government Web service. Technical Report 03–024, Department of Computer Science, Purdue University

  7. Elmagarmid A, Ipeitoris P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  8. Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  9. Fu Z, Christen P, Boot M (2011) Automatic cleaning and linking of historical census data using household information. In: Proceedings of the IEEE eleventh international conference on data mining workshops, pp 413–420

  10. Fu Z, Christen P, Zhou J (2014) A Graph Matching Method for Historical Census Household Linkage. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (LNAI 8443), pp 485–496

  11. Gadd T (1990) PHONIX : the algorithm. Prog Autom Library Inform Syst 24(4):363–366

    Article  Google Scholar 

  12. Gu L, Baxter R, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Tech. Rep. No. 03/83, CSIRO Mathematical and Information Sciences

  13. Herzog TH, Scheuren F, Winkler WE (2010) Record Linkage. Wiley Interdiscip Rev Comput Stat 2(5):535–543

    Article  Google Scholar 

  14. Hettiarachchi GP, Attygalle D, Hettiarachchi DS, Ebisuya A (2013) A generic statistical machine learning and data mining framework for record classification and linkage. Int J Intel Inform Process 4(2):96–106

    Google Scholar 

  15. Howe GR, Lindsay J (1981) A generalized iterative record linkage computer system for use in medical follow-up studies. Comput Biomed Res 14(4):327–340

    Article  Google Scholar 

  16. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  Google Scholar 

  17. Irvine KA, Taylor LK (2011) The Centre for Health Record Linkage: fostering population health research in NSW. NSW Pub Health Bull 22(2):17–18

    Article  Google Scholar 

  18. Ivie S, Pixton B, Giraud-Carrier C (2007) Metric-based data mining model for genealogical record linkage. In: Proceedings of the IEEE international conference on information reuse and integration, pp 538–543

  19. Jaro M (1995) Probabilistic linkage of large public health data file. Stat Med 14(5–7):491–498

    Article  Google Scholar 

  20. Lain SJ, Algert CS, Tasevski V, Morris JM, Roberts CL (2009) Record linkage to obtain birth outcomes for the evaluation of screening biomarkers in pregnancy: a feasibility study. BMC Med Res Methodol 9:48

    Article  Google Scholar 

  21. Lait A, Randell B (1993) An assessment of name matching algorithms. Department of Computer Science, University of Newcastle upon Tyne, UK, Tech. rep

  22. Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady 10:707–710

    MathSciNet  Google Scholar 

  23. Monge A, Elkan C (1996) The field-matching problem: algorithm and applications. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 267–270

  24. Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  25. Newcombe H, Kennedy J, Axford S, James A (1959) Automatic linkage of vital records. Science 130(3381):954–959

    Article  Google Scholar 

  26. Pfeifer U, Poersch T, Fuhr N (1996) Retrieval effectiveness of proper name search methods. Inf Process Manag 32(6):667–679

    Article  Google Scholar 

  27. Philips L (2000) The double-metaphone search algorithm. C/C++ Users J 18(6):38–43

    MathSciNet  Google Scholar 

  28. Pixton B, Giraud-Carrier C (2005) MAL4:6 - Using data mining for record linkage. In: Proceedings of the 5th annual Workshop on technology for family history and genealogical research

  29. Quass D, Starkey P (2003) Record Linkage for Genealogical Databases. In: Proceedings of the ACM SIGKDD workshop on data cleaning, record linkage, and object consolidation

  30. Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  31. Ruggles S (2002) Linking historical censuses: a new approach. Hist Comput 14(1+2):213–224

    Article  Google Scholar 

  32. Solomon J (2007) Obama takes lead in money raised. Washington Post, July 2:A1

  33. Stavrou EP, Baker DF, Bishop JF (2009) Maternal smoking during pregnancy and childhood cancer in New South Wales: a record linkage investigation. Cancer Causes Control 20(9):1551–1558

    Article  Google Scholar 

  34. St. Sauver JL, Grossardt BR, Yawn BP, Melton LJ 3rd, Pankratz JJ, Brue SM, Rocca WA (2012) Data resource profile: the Rochester Epidemiology Project (REP) medical records-linkage system. Int J Epidemiol 41(6):1614–1624

    Article  Google Scholar 

  35. Sweet C, Odyer T, Alhajj R (2007) Enhanced graph based genealogical record linkage. In: Proceedings of the third international conference on advanced data mining and applications (LNAI 4632), pp 476–487

  36. Wilson DR (2011) Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: Proceedings of the international joint conference on neural networks, pp 9–14

  37. Winkler WE (2001) Record linkage software and methods for merging administrative lists. Statistical research report series No. RR2001/03.

  38. Winkler W (2006) Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2).

  39. Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exp 1:331–345

    Article  Google Scholar 

Download references


Our thanks to Yao Huang, Weston Rowley, David Wilcox and David Lassen for research assistance and computer code. We are also grateful to David Magleby and Joseph Olson for their support, encouragement, and advice. Finally, we thank the anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations


Corresponding author

Correspondence to C. Giraud-Carrier.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Giraud-Carrier, C., Goodliffe, J., Jones, B.M. et al. Effective record linkage for mining campaign contribution data. Knowl Inf Syst 45, 389–416 (2015).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Record linkage
  • Multiset distance
  • Domain knowledge
  • Campaign contributions
  • Political data