Skip to main content

Towards a Record Linkage Layer to Support Big Data Integration

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 373))

Abstract

Record linkage is a crucial step in big data integration (BDI). It is also one of its major challenges with the increasing number of structured data sources that need to be linked and do not share common attributes. Our research-in-progress aims to develop a record linkage layer that assists data scientist in integrating a variety of data sources. A structured literature review of 68 papers reveals (1) key data sets, (2) available classification algorithms (match or no match), and (3) similarity measures to consider in BDI projects. The results highlight the foundational requirements for the development of the record linkage layer such as processing unstructured attributes. As BDI emerges as a priority for industry, our work proposes a record linkage layer that provide similarity measures and integration algorithms while assisting its selection. A record linkage layer can contribute to big data adoption in industry settings and improve quality of big data integration processes to effectively support business decision-making.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    wwww.projekt-trace.de.

  2. 2.

    https://ieeexplore.ieee.org/Xplore/home.jsp.

  3. 3.

    https://www.sciencedirect.com/.

  4. 4.

    https://dl.acm.org/dl.cfm.

  5. 5.

    DBLP-ACM, DBLP-Scholar, Abt-Buy, Amazon-Google found here https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution.

  6. 6.

    DBLP found here https://dblp.org/.

  7. 7.

    Restaurant, Census, Cora found here https://hpi.de/naumann/projects/repeatability/datasets/.

  8. 8.

    FEBRL found here https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html.

  9. 9.

    Crunchbase https://data.crunchbase.com/docs/; Standard and Poors 500 https://datahub.io/core/s-and-p-500-companies-financials; GLEIF https://www.gleif.org/en/lei-data/gleif-concatenated-file/; UPC Database https://www.upcitemdb.com/.

References

  1. Blanco, R., Enriquez, J.G., Dominguez-Mayo, F.J., Escalona, M.J., Tuya, J.: Early integration testing for entity reconciliation in the context of heterogeneous data sources. IEEE Trans. Reliab., 1–19 (2018). https://doi.org/10.1109/TR.2018.2809866

  2. Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Change 130, 99–113 (2018). https://doi.org/10.1016/j.techfore.2017.07.027

    Article  Google Scholar 

  3. Bleiholder, J., Schmid, J.: Datenintegration und Deduplizierung. In: Hildebrand, K., Gebauer, M., Hinrichs, H., Mielke, M. (eds.) Daten- und Informationsqualität, vol. 1, pp. 123–142. Vieweg+Teubner, Wiesbaden (2011). https://doi.org/10.1007/978-3-8348-9953-8_7

  4. Cato, P.: Einflüsse auf den Implementierungserfolg von Big Data Systemen. Dissertation, Verlag Dr. Kovač (2016)

    Google Scholar 

  5. Christen, P., Winkler, W.E.: Record linkage. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning and Data Mining, vol. 19, pp. 1–10. Springer, Boston (2016). https://doi.org/10.1007/978-1-4899-7502-7_712-1

    Chapter  Google Scholar 

  6. Deloitte: Mission Zukunft: So treffen Sie die besten Entscheidungen für morgen! Unsere Experten zeigen, wie die Digitalisierung Entscheidungsprozesse in Ihrem Unternehmen nachhaltig verbessern kann (2018). https://www2.deloitte.com/de/de/pages/trends/zukunft-der-entscheidungsfindung.html

  7. Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013). https://doi.org/10.1109/ICDE.2013.6544914

  8. Dong, X.L., Rekatsinas, T.: Data integration and machine learning. In: Das, G., Jermaine, C., Bernstein, P. (eds.) Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018, pp. 1645–1650. ACM Press, New York (2018). https://doi.org/10.1145/3183713.3197387

  9. Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018). https://doi.org/10.14778/3236187.3236198

    Article  Google Scholar 

  10. El-Ghafar, R.M.A., Gheith, M.H., El-Bastawissy, A.H., Nasr, E.S.: Record linkage approaches in big data: a state of art study. In: 2017 13th International Computer Engineering Conference (ICENCO), pp. 224–230. IEEE (27122017–28122017). https://doi.org/10.1109/ICENCO.2017.8289792

  11. Enríquez, J.G., Domínguez Mayo, F.J., Escalona Cuaresma, M.J., Garcia-Garcia, J., Lee, V., Goto, M.: Entity identity reconciliation based big data federation - a MDE approach (2015)

    Google Scholar 

  12. Fasel, D., Meier, A. (eds.): Big Data: Grundlagen, Systeme und Nutzungspotenziale. Edition HMD. Springer, Wiesbaden (2016). https://doi.org/10.1007/978-3-658-11589-0

  13. Gluchowski, P., Chamoni, P. (eds.): Analytische Informationssysteme. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-47763-2

    Book  Google Scholar 

  14. Golshan, B., Halevy, A., Mihaila, G., Tan, W.C.: Data integration: after the teenage years. In: van den Bussche, J., Geerts, F., Sallinger, E. (eds.) Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems - PODS 2017, pp. 101–106. ACM Press, New York (2017). https://doi.org/10.1145/3034786.3056124

  15. González Enríquez, J.: A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources. Dissertation, Universidad de Sevilla, Sevilla (2017)

    Google Scholar 

  16. Webster, J., Watson, R.T.: Analyzing the past to prepare for the future: writing a literature review. MIS Q. 26(2), 13–23 (2002). http://www.jstor.org/stable/4132319

    Google Scholar 

  17. Jupin, J., Shi, J.Y.: Identity tracking in big data: preliminary research using in-memory data graph models for record linkage and probabilistic signature hashing for approximate string matching in big health and human services databases. In: Chin, A., Zhan, J., Ding, W., Wu, J., Xu, W., Wang, F. (eds.) Proceedings of the 2014 International Conference on Big Data Science and Computing - BigDataScience 2014, pp. 1–8. ACM Press, New York (2014). https://doi.org/10.1145/2640087.2644170

  18. Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A.: Entity matching across multiple heterogeneous data sources. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016, Part I. LNCS, vol. 9642, pp. 133–146. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0_9

    Chapter  Google Scholar 

  19. Kooli, N., Allesiardo, R., Pigneul, E.: Deep learning based approach for entity resolution in databases. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) ACIIDS 2018, Part II. LNCS (LNAI), vol. 10752, pp. 3–12. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_1

    Chapter  Google Scholar 

  20. Köpcke, H.: Object Matching on real-world problems. Dissertation, Universität Leipzig, Leipzig (2014)

    Google Scholar 

  21. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010). https://doi.org/10.14778/1920841.1920904

    Article  Google Scholar 

  22. Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Rundensteiner, E., Markl, V., Manolescu, I., Amer-Yahia, S., Naumann, F., Ari, I. (eds.) Proceedings of the 15th International Conference on Extending Database Technology - EDBT 2012, p. 545. ACM Press, New York (2012). https://doi.org/10.1145/2247596.2247662

  23. Kruse, F., Dmitriyev, V., Marx Gómez, J.: Building a connection between decision maker and data-driven decision process. Arch. Data Sci. Ser. A (Online First) 4(1), 16 (2018). https://doi.org/10.5445/KSP/1000085951/03

    Article  Google Scholar 

  24. Lin, Y., Wang, H., Li, J., Gao, H.: Data source selection for information integration in big data era (2016)

    Google Scholar 

  25. Mayring, P.: Qualitative content analysis: theoretical foundation, basic procedures and software solution (2014)

    Google Scholar 

  26. Mudgal, S., et al.: Deep learning for entity matching. In: Das, G., Jermaine, C., Bernstein, P. (eds.) Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018, pp. 19–34. ACM Press, New York (2018). https://doi.org/10.1145/3183713.3196926

  27. Pershina, M.: Graph-Based Approaches to Resolve Entity Ambiguity. Dissertation, New York University, New York (2016)

    Google Scholar 

  28. Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44039-2_2

    Chapter  Google Scholar 

  29. Rahm, E., Hai Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)

    Google Scholar 

  30. Schild, C.J., Schultz, S.: Linking deutsche bundesbank company data using machine-learning-based classification. In: Proceedings of the Second International Workshop on Data Science for Macro-Modeling (DSMM 2016), pp. 1–3. The Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2951894.2951896

  31. Talburt, J.R.: Entity Resolution and Information Quality. Elsevier (2011). https://doi.org/10.1016/C2009-0-63396-1

  32. Peng, T., Li, L., Kennedy, J.: A comparison of techniques for name matching. GSTF Int. J. Comput. 2(1) (2018)

    Google Scholar 

  33. Rekatsinas, T.I., Dong, X., Getoor, L., Srivastava, D.: Finding quality in quantity: the challenge of discovering valuable sources for integration. In: CIDR (2015)

    Google Scholar 

  34. Yin, R.K.: Case Study Research and Applications: Design and Methods, 6th edn. SAGE, Los Angeles (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felix Kruse .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kruse, F. (2019). Towards a Record Linkage Layer to Support Big Data Integration. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems Workshops. BIS 2019. Lecture Notes in Business Information Processing, vol 373. Springer, Cham. https://doi.org/10.1007/978-3-030-36691-9_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36691-9_52

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36690-2

  • Online ISBN: 978-3-030-36691-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics