Knowledge and Information Systems

, Volume 45, Issue 2, pp 389–416 | Cite as

Effective record linkage for mining campaign contribution data

  • C. Giraud-Carrier
  • J. Goodliffe
  • B. M. Jones
  • S. Cueva
Regular Paper

Abstract

Up to now, most campaign contribution data have been reported at the level of the donation. While these are interesting, one often needs to have information at the level of the donor. Obtaining information at that level is difficult as there is neither a unique repository of donations nor any standard across existing repositories. In order to more meaningfully mine campaign contribution data, political scientists need an accurate way of grouping, or linking, together donations made by the same donor. In this paper, we describe a record linkage technique that is applicable to various sources and across large geographical areas. We show how it may be effectively applied in the context of nationwide donation data and report on new, previously unattainable results about campaign contributors in the 2007–2008 US election cycle.

Keywords

Record linkage Multiset distance Domain knowledge  Campaign contributions Political data 

References

  1. 1.
    Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intel Syst 18(5):16–23CrossRefGoogle Scholar
  2. 2.
    Cheatham M, Hitzler P (2013) String similarity metrics for ontology alignment. In: Proceedings of the twelfth international semantic Web conference (LNCS 8219), pp 294–309Google Scholar
  3. 3.
    Christen P (2006) A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-2, Department of Computer Science, The Australian National UniversityGoogle Scholar
  4. 4.
    Christen P (2012) Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, BerlinCrossRefGoogle Scholar
  5. 5.
    Cohen W, Ravikumar P, Fiendberg S (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the eighteenth international joint conference on artificial intelligence, pp 73–78Google Scholar
  6. 6.
    Elfeky MG, Verykios VS, Elmagarmid AK, Ghanem TM, Huwait AR (2003) Record linkage: a machine learning approach, a toolbox, and a digital government Web service. Technical Report 03–024, Department of Computer Science, Purdue UniversityGoogle Scholar
  7. 7.
    Elmagarmid A, Ipeitoris P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRefGoogle Scholar
  8. 8.
    Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210CrossRefGoogle Scholar
  9. 9.
    Fu Z, Christen P, Boot M (2011) Automatic cleaning and linking of historical census data using household information. In: Proceedings of the IEEE eleventh international conference on data mining workshops, pp 413–420Google Scholar
  10. 10.
    Fu Z, Christen P, Zhou J (2014) A Graph Matching Method for Historical Census Household Linkage. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (LNAI 8443), pp 485–496Google Scholar
  11. 11.
    Gadd T (1990) PHONIX : the algorithm. Prog Autom Library Inform Syst 24(4):363–366CrossRefGoogle Scholar
  12. 12.
    Gu L, Baxter R, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Tech. Rep. No. 03/83, CSIRO Mathematical and Information SciencesGoogle Scholar
  13. 13.
    Herzog TH, Scheuren F, Winkler WE (2010) Record Linkage. Wiley Interdiscip Rev Comput Stat 2(5):535–543CrossRefGoogle Scholar
  14. 14.
    Hettiarachchi GP, Attygalle D, Hettiarachchi DS, Ebisuya A (2013) A generic statistical machine learning and data mining framework for record classification and linkage. Int J Intel Inform Process 4(2):96–106Google Scholar
  15. 15.
    Howe GR, Lindsay J (1981) A generalized iterative record linkage computer system for use in medical follow-up studies. Comput Biomed Res 14(4):327–340CrossRefGoogle Scholar
  16. 16.
    Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
  17. 17.
    Irvine KA, Taylor LK (2011) The Centre for Health Record Linkage: fostering population health research in NSW. NSW Pub Health Bull 22(2):17–18CrossRefGoogle Scholar
  18. 18.
    Ivie S, Pixton B, Giraud-Carrier C (2007) Metric-based data mining model for genealogical record linkage. In: Proceedings of the IEEE international conference on information reuse and integration, pp 538–543Google Scholar
  19. 19.
    Jaro M (1995) Probabilistic linkage of large public health data file. Stat Med 14(5–7):491–498CrossRefGoogle Scholar
  20. 20.
    Lain SJ, Algert CS, Tasevski V, Morris JM, Roberts CL (2009) Record linkage to obtain birth outcomes for the evaluation of screening biomarkers in pregnancy: a feasibility study. BMC Med Res Methodol 9:48CrossRefGoogle Scholar
  21. 21.
    Lait A, Randell B (1993) An assessment of name matching algorithms. Department of Computer Science, University of Newcastle upon Tyne, UK, Tech. repGoogle Scholar
  22. 22.
    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady 10:707–710MathSciNetGoogle Scholar
  23. 23.
    Monge A, Elkan C (1996) The field-matching problem: algorithm and applications. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 267–270Google Scholar
  24. 24.
    Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453CrossRefGoogle Scholar
  25. 25.
    Newcombe H, Kennedy J, Axford S, James A (1959) Automatic linkage of vital records. Science 130(3381):954–959CrossRefGoogle Scholar
  26. 26.
    Pfeifer U, Poersch T, Fuhr N (1996) Retrieval effectiveness of proper name search methods. Inf Process Manag 32(6):667–679CrossRefGoogle Scholar
  27. 27.
    Philips L (2000) The double-metaphone search algorithm. C/C++ Users J 18(6):38–43MathSciNetGoogle Scholar
  28. 28.
    Pixton B, Giraud-Carrier C (2005) MAL4:6 - Using data mining for record linkage. In: Proceedings of the 5th annual Workshop on technology for family history and genealogical researchGoogle Scholar
  29. 29.
    Quass D, Starkey P (2003) Record Linkage for Genealogical Databases. In: Proceedings of the ACM SIGKDD workshop on data cleaning, record linkage, and object consolidationGoogle Scholar
  30. 30.
    Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRefGoogle Scholar
  31. 31.
    Ruggles S (2002) Linking historical censuses: a new approach. Hist Comput 14(1+2):213–224CrossRefGoogle Scholar
  32. 32.
    Solomon J (2007) Obama takes lead in money raised. Washington Post, July 2:A1Google Scholar
  33. 33.
    Stavrou EP, Baker DF, Bishop JF (2009) Maternal smoking during pregnancy and childhood cancer in New South Wales: a record linkage investigation. Cancer Causes Control 20(9):1551–1558CrossRefGoogle Scholar
  34. 34.
    St. Sauver JL, Grossardt BR, Yawn BP, Melton LJ 3rd, Pankratz JJ, Brue SM, Rocca WA (2012) Data resource profile: the Rochester Epidemiology Project (REP) medical records-linkage system. Int J Epidemiol 41(6):1614–1624CrossRefGoogle Scholar
  35. 35.
    Sweet C, Odyer T, Alhajj R (2007) Enhanced graph based genealogical record linkage. In: Proceedings of the third international conference on advanced data mining and applications (LNAI 4632), pp 476–487Google Scholar
  36. 36.
    Wilson DR (2011) Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: Proceedings of the international joint conference on neural networks, pp 9–14Google Scholar
  37. 37.
    Winkler WE (2001) Record linkage software and methods for merging administrative lists. Statistical research report series No. RR2001/03. http://www.vrdc.cornell.edu/info7470/2011/Readings/rr2001-03
  38. 38.
    Winkler W (2006) Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2). http://www.census.gov/srd/papers/pdf/rrs2006-02
  39. 39.
    Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exp 1:331–345CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • C. Giraud-Carrier
    • 1
  • J. Goodliffe
    • 2
  • B. M. Jones
    • 2
  • S. Cueva
    • 1
  1. 1.Department of Computer ScienceBrigham Young UniversityProvoUSA
  2. 2.Department of Political ScienceBrigham Young UniversityProvoUSA

Personalised recommendations