Scalable Automated Linking Technology for Big Data Computing

Middleton, Anthony M.; Bayliss, David; Foreman, Bob

doi:10.1007/978-3-319-44550-2_7

Anthony M. Middleton³,
David Bayliss³ &
Bob Foreman³

3994 Accesses

Abstract

The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem, which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage.

This chapter has been developed by Anthony M. Middleton, David Bayliss, and Bob Foreman from LexisNexis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the KDD ‘08 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, NV; 2008. p. 151–9.
Google Scholar
Herzog TN, Scheuren FJ, Winkler WE. Data quality and record linkage techniques. New York: Springer Science and Business Media LLC; 2007.
MATH Google Scholar
Middleton AM. Data-intensive technologies for cloud computing. In: Furht B, Escalante A, editors. Handbook of cloud computing. New York: Springer; 2010. p. 83–136.
Chapter Google Scholar
Winkler WE. Record linkage software and methods for merging administrative lists (No. Statistical Research Report Series No. RR/2001/03). Washington, DC: US Bureau of the Census; 2001.
Google Scholar
Cohen W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the KDD ‘02 Eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada; 2002.
Google Scholar
Cochinwala M, Dalal S, Elmagarmid AK, Verykios VV. Record matching: past, present and future (No. Technical Report CSD-TR #01–013): Department of Computer Sciences, Purdue University; 2001.
Google Scholar
Gravano L, Ipeirotis PG, Koudas N, Srivastava D. Text joins in an rdbms for web data integration. In: Proceedings of the WWW ‘03 12th international conference on world wide web, Budapest, Hungary, 20–24 May; 2003.
Google Scholar
Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2000; 18(3).
Google Scholar
Gu L, Baxter R, Vickers D, Rainsford C. Record linkage: current practice and future directions (No. CMIS Technical Report No. 03/83): CSIRO Mathematical and Information Sciences; 2003.
Google Scholar
Winkler WE. Advanced methods for record linkage. In: Proceedings of the section on survey research methods, American Statistical Association; 1994. p. 274–9.
Google Scholar
Winkler WE. Matching and record linkage. In: Cox BG, Binder DA, Chinnappa BN, Christianson MJ, Colledge MJ, Kott PS, editors. Business survey methods. New York: Wiley; 1995.
Google Scholar
Jones KS. A statistical interpretation of term specificity and its application in information retrieval. J Doc. 1972;28(1):11–21.
Article Google Scholar
Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF. J Doc. 2004;60(5):503–20.
Article Google Scholar
Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130:954–9.
Article Google Scholar
Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64(328):1183–210.
Article MATH Google Scholar
Winkler WE. Frequency-based matching in fellegi-sunter model of record linkage. In: Proceedings of the section on survey research methods, American Statistical Association; 1989. p. 778–8.
Google Scholar
Cohen WW, Ravikumar P, Fienberg SE. A comparison of string distance metrics for name matching tasks. In: Proceedings of the IJCAI-03 workshop on information integration, Acapulco, Mexico, August; 2003. p. 73–8.
Google Scholar
Koudas N, Marathe A, Srivastava D. Flexible string matching against large databases in practice. In: Proceedings of the 30th VLDB Conference, Toronto, Canada; 2004. p. 1078–86.
Google Scholar
Bilenko M, Mooney RJ Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the KDD ‘03 Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, 24–27 August; 2003. p. 39–48.
Google Scholar
Branting LK. A comparative evaluation of name-matching algorithms. In: Proceedings of the ICAIL ‘03 9th international conference on artificial intelligence and law, Edinburgh, Scotland; 2003. p. 224–32.
Google Scholar
Cohen W, Richman J. Learning to match and cluster entity names. In: Proceedings of the ACM SIGIR’01 workshop on mathematical/formal methods in IR; 2001.
Google Scholar
Dunn HL. Record linkage. Am J Public Health. 1946;36:1412–5.
Article Google Scholar
Maggi F. A survey of probabilistic record matching models, techniques and tools (No. Advanced Topics in Information Systems B, Cycle XXII, Scientific Report TR-2008-22): DEI, Politecnico di Milano; 2008.
Google Scholar
Newcombe HB, Kennedy JM. Record linkage. Commun ACM. 1962;5(11):563–6.
Article Google Scholar
Winkler WE. The state of record linkage and current research problems. U.S. Bureau of the Census Statistical Research Division; 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

LexisNexis Risk Solutions, Alpharetta, GA, USA
Anthony M. Middleton, David Bayliss & Bob Foreman

Authors

Anthony M. Middleton
View author publications
You can also search for this author in PubMed Google Scholar
David Bayliss
View author publications
You can also search for this author in PubMed Google Scholar
Bob Foreman
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Middleton, A.M., Bayliss, D., Foreman, B. (2016). Scalable Automated Linking Technology for Big Data Computing. In: Big Data Technologies and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-44550-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-44550-2_7
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44548-9
Online ISBN: 978-3-319-44550-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics