The Case for Holistic Data Integration

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9809)

Abstract

Current data integration approaches are mostly limited to few data sources, partly due to the use of binary match approaches between pairs of sources. We thus advocate for the development of more holistic, clustering-based data integration approaches that scale to many data sources. We outline different use cases and provide an overview of initial approaches for holistic schema/ontology integration and entity clustering. The discussion also considers open data repositories and so-called knowledge graphs.

References

  1. 1.
    Arasu, A., Chaudhuri, S., Chen, Z., Ganjam, K., Kaushik, R., Narasayya, V.R.: Experiences with using data cleaning technology for Bing services. IEEE Data Eng. Bull. 35(2), 14–23 (2012)Google Scholar
  2. 2.
    Arnold, P., Rahm, E.: SemRep: A repository for semantic mapping. In: Proceedings of the BTW, pp. 177–194 (2015)Google Scholar
  3. 3.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Balakrishnan, S., Halevy, A.Y., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., Shen, W., Wilder, K., Wu, F., Yu, C.: Applying web tables in practice. In: Proceedings of the CIDR (2015)Google Scholar
  5. 5.
    Barbosa, L., Freire, J., Silva, A.: Organizing hidden-web databases by clustering visible web documents. In: Proceedings of the ICDE, pp. 326–335 (2007)Google Scholar
  6. 6.
    Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4), 323–364 (1986)CrossRefGoogle Scholar
  7. 7.
    Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, Heidelberg (2011)MATHGoogle Scholar
  8. 8.
    Bellare, K., Curino, C., Machanavajihala, A., Mika, P., Rahurkar, M., Sane, A.: WOO: A scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11), 1114–1125 (2013)Google Scholar
  9. 9.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1 (2009)CrossRefGoogle Scholar
  10. 10.
    Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(suppl 1), D267–D270 (2004)CrossRefGoogle Scholar
  11. 11.
    Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed Web-of-Data-scale entity matching. In: Proceedings of the CIKM, pp. 2104–2108 (2012)Google Scholar
  12. 12.
    Chang, K.C.-C., He, B., Zhang, Z.: Toward large scale integration: Building a MetaQuerier over databases on the web. In: Proceedings of the CIDR (2005)Google Scholar
  13. 13.
    Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)Google Scholar
  14. 14.
    Sarma, A.D. Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the SIGMOD, pp. 861–874 (2008)Google Scholar
  15. 15.
    Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. PVLDB 6(13), 1606–1617 (2013)Google Scholar
  16. 16.
    Do, H.-H., Rahm, E.: COMA: A system for flexible combination of schema matching approaches. In: Proceedings of the VLDB, pp. 610–621 (2002)Google Scholar
  17. 17.
    Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, San Francisco (2012)Google Scholar
  18. 18.
    Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge Vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of the SIGKDD, pp. 601–610 (2014)Google Scholar
  19. 19.
    Eberius, J., Damme, P., Braunschweig, K., Thiele, M., Lehner, W.: Publish-time data integration for open data platforms. In: Proceedings of the ACM Workshop on Open Data (2013)Google Scholar
  20. 20.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  21. 21.
    Euzenat, J., Shvaiko, P., et al.: Ontology Matching. Springer, Heidelberg (2007)MATHGoogle Scholar
  22. 22.
    Galkin, M., Auer, S., Scerri, S.: Enterprise knowledge graphs: A survey. Technical report (2016). http://www.researchgate.net
  23. 23.
    Gross, A., Hartung, M., Kirsten, T., Rahm, E.: Mapping composition for matching large life science ontologies. In: Proceedings of the ICBO (2011)Google Scholar
  24. 24.
    Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)Google Scholar
  25. 25.
    Gruetze, T., Böhm, C., Naumann, F.: Holistic and scalable ontology alignment for linked open data. In: Proceedings of the LDOW (2012)Google Scholar
  26. 26.
    Gupta, R., Halevy, A., Wang, X., Whang, S.E., Wu, F.: Biperpedia: An ontology for search applications. PVLDB 7(7), 505–516 (2014)Google Scholar
  27. 27.
    Hai, R., Geisler, S., Quix, C.: Constance: An intelligent data lake system. In: Proceedings of the SIGMOD (2016)Google Scholar
  28. 28.
    Hartung, M., Groß, A., Rahm, E.: Composition methods for link discovery. In: Proceedings of the BTW Conference (2013)Google Scholar
  29. 29.
    Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)Google Scholar
  30. 30.
    Hassanzadeh, O., Ward, M.J., Rodriguez-Muro, M., Srinivas, K.: Understanding a large corpus of web tables through matching with knowledge bases-an empirical study. In: Proceedings of the Ontology Matching Workshop (2015)Google Scholar
  31. 31.
    He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of the SIGMOD, pp. 217–228 (2003)Google Scholar
  32. 32.
    He, B., Tao, T., Chang, KC.-C.: Organizing structured web sources by query schemas: A clustering approach. In: Proceedings of the CIKM, pp. 22–31 (2004)Google Scholar
  33. 33.
    He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: An automatic integrator of web search interfaces for E-commerce. In: Proceedings of the 29th VLDB Conference (2003)Google Scholar
  34. 34.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. ACM SIGMOD Rec. 24(2), 127–138 (1995)CrossRefGoogle Scholar
  35. 35.
    Hu, W., Chen, J., Zhang, H., Qu, Y.: How matchable are four thousand ontologies on the semantic web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 290–304. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  36. 36.
    Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for linked open data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  37. 37.
    Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient deduplication with hadoop. PVLDB 5(12), 1878–1881 (2012)Google Scholar
  38. 38.
    Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRefGoogle Scholar
  39. 39.
    Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Proceedings of the EDBT, pp. 545–550 (2012)Google Scholar
  40. 40.
    Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Web scale taxonomy cleansing. PVLDB 4(12), 1295–1306 (2011)Google Scholar
  41. 41.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6(2), 167–195 (2015)Google Scholar
  42. 42.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1–2), 1338–1347 (2010)Google Scholar
  43. 43.
    Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE, pp. 57–68 (2005)Google Scholar
  44. 44.
    Mahmoud, H.A., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the SIGMOD (2010)Google Scholar
  45. 45.
    Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A., et al.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13(1), R5 (2012)CrossRefGoogle Scholar
  46. 46.
    Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)CrossRefMATHGoogle Scholar
  47. 47.
    Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. University of Leipzig, Technical report (2016)Google Scholar
  48. 48.
    Nentwig, M. Hartung, M., Ngomo, A.-C.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web J. (2016)Google Scholar
  49. 49.
    Nentwig, M., Soru, T., Ngomo, A.-C.N., Rahm, E.: LinkLion: A link repository for the web of data. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC Satellite Events 2014. LNCS, vol. 8798, pp. 439–443. Springer, Heidelberg (2014)Google Scholar
  50. 50.
    Ngomo, A.-C.N., Auer, S.: LIMES - A time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of the IJCAI, pp. 2312–2317 (2011)Google Scholar
  51. 51.
    Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2016)CrossRefGoogle Scholar
  52. 52.
    Noy, N., et al.: BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009)CrossRefGoogle Scholar
  53. 53.
    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the ACM Conference Web search and data mining, pp. 53–62 (2012)Google Scholar
  54. 54.
    Papadimitriou, P., Tsaparas, P., Fuxman, A., Getoor, L.: TACI: Taxonomy-aware catalog integration. IEEE TKDE 25(7), 1643–1655 (2013)Google Scholar
  55. 55.
    Pasupuleti, P., Purra, B.S.: Data Lake Development with Big Data. Packt Publishing Ltd., Birmingham (2015)Google Scholar
  56. 56.
    Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semant. Web J. (2016)Google Scholar
  57. 57.
    Pershina, M., Yakout, M., Chakrabarti, K.: Holistic entity matching across knowledge graphs. In: IEEE International Conference on Big Data, pp. 1585–1590 (2015)Google Scholar
  58. 58.
    Pottinger, R.A., Bernstein, P.A.: Merging models based on given correspondences. In: Proceedings of the VLDB, pp. 862–873 (2003)Google Scholar
  59. 59.
    Radwan, A., Popa, L., Stanoi, I.R., Younis, A.: Top-k generation of integrated schemas based on directed and weighted correspondences. In: Proceedings of the SIGMOD, pp. 641–654 (2009)Google Scholar
  60. 60.
    Rahm, E.: Towards large-scale schema and ontology matching. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping. Data-Centric Systems and Applications, pp. 3–27. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  61. 61.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10, 334–350 (2001)CrossRefMATHGoogle Scholar
  62. 62.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  63. 63.
    Rakhmawati, N.A., Umbrich, J., Karnstedt, M., Hasnain, A., Hausenblas, M.: A Comparison of Federation over SPARQL Endpoints Frameworks. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 132–146. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  64. 64.
    Raunich, S., Rahm, E.: Target-driven merging of taxonomies with ATOM. Inf. Syst. 42, 1–14 (2014)CrossRefGoogle Scholar
  65. 65.
    Saha, B., Stanoi, I., Clarkson, K.L.: Schema covering: a step towards enabling reuse in information integration. In: ICDE, pp. 285–296 (2010)Google Scholar
  66. 66.
    Saleem, K., Bellahsene, Z., Hunt, E.: Porsche: Performance oriented schema mediation. Inf. Syst. 33(7), 637–657 (2008)CrossRefGoogle Scholar
  67. 67.
    Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  68. 68.
    Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE TKDE 27(2), 443–460 (2015)Google Scholar
  69. 69.
    Suchanek, F., Weikum, G.: Knowledge harvesting in the big-data era. In: Proceedings of the SIGMOD, pp. 933–938 (2013)Google Scholar
  70. 70.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A large ontology from wikipedia and wordnet. Web Semant. Sci. Serv. Agents World Wide Web 6(3), 203–217 (2008)CrossRefGoogle Scholar
  71. 71.
    Sun, C., Rampalli, N., Yang, F., Doan, A.: Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB 7(13), 1529–1540 (2014)Google Scholar
  72. 72.
    Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)Google Scholar
  73. 73.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. CACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  74. 74.
    Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012 Main Conference 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  75. 75.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the SIGMOD, pp. 219–232 (2009)Google Scholar
  76. 76.
    Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the SIGMOD, pp. 97–108, (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.University of LeipzigLeipzigGermany

Personalised recommendations