Abstract
Current data integration approaches are mostly limited to few data sources, partly due to the use of binary match approaches between pairs of sources. We thus advocate for the development of more holistic, clustering-based data integration approaches that scale to many data sources. We outline different use cases and provide an overview of initial approaches for holistic schema/ontology integration and entity clustering. The discussion also considers open data repositories and so-called knowledge graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, we are only concerned with metadata in the form of schemas and ontologies and their components like attributes or concepts. We are thus not considering the wide range of additional metadata (e.g., provenance information, creator, creation time, etc.) despite their importance, e.g., for data quality.
- 2.
To be more precise, we can only find matching records referring to the same real-word object. For simplification, we use the term “entity” to refer to both the records as well as the real-world objects they describe.
References
Arasu, A., Chaudhuri, S., Chen, Z., Ganjam, K., Kaushik, R., Narasayya, V.R.: Experiences with using data cleaning technology for Bing services. IEEE Data Eng. Bull. 35(2), 14–23 (2012)
Arnold, P., Rahm, E.: SemRep: A repository for semantic mapping. In: Proceedings of the BTW, pp. 177–194 (2015)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Balakrishnan, S., Halevy, A.Y., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., Shen, W., Wilder, K., Wu, F., Yu, C.: Applying web tables in practice. In: Proceedings of the CIDR (2015)
Barbosa, L., Freire, J., Silva, A.: Organizing hidden-web databases by clustering visible web documents. In: Proceedings of the ICDE, pp. 326–335 (2007)
Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4), 323–364 (1986)
Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, Heidelberg (2011)
Bellare, K., Curino, C., Machanavajihala, A., Mika, P., Rahurkar, M., Sane, A.: WOO: A scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11), 1114–1125 (2013)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1 (2009)
Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(suppl 1), D267–D270 (2004)
Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed Web-of-Data-scale entity matching. In: Proceedings of the CIKM, pp. 2104–2108 (2012)
Chang, K.C.-C., He, B., Zhang, Z.: Toward large scale integration: Building a MetaQuerier over databases on the web. In: Proceedings of the CIDR (2005)
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Sarma, A.D. Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the SIGMOD, pp. 861–874 (2008)
Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. PVLDB 6(13), 1606–1617 (2013)
Do, H.-H., Rahm, E.: COMA: A system for flexible combination of schema matching approaches. In: Proceedings of the VLDB, pp. 610–621 (2002)
Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, San Francisco (2012)
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge Vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of the SIGKDD, pp. 601–610 (2014)
Eberius, J., Damme, P., Braunschweig, K., Thiele, M., Lehner, W.: Publish-time data integration for open data platforms. In: Proceedings of the ACM Workshop on Open Data (2013)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)
Euzenat, J., Shvaiko, P., et al.: Ontology Matching. Springer, Heidelberg (2007)
Galkin, M., Auer, S., Scerri, S.: Enterprise knowledge graphs: A survey. Technical report (2016). http://www.researchgate.net
Gross, A., Hartung, M., Kirsten, T., Rahm, E.: Mapping composition for matching large life science ontologies. In: Proceedings of the ICBO (2011)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
Gruetze, T., Böhm, C., Naumann, F.: Holistic and scalable ontology alignment for linked open data. In: Proceedings of the LDOW (2012)
Gupta, R., Halevy, A., Wang, X., Whang, S.E., Wu, F.: Biperpedia: An ontology for search applications. PVLDB 7(7), 505–516 (2014)
Hai, R., Geisler, S., Quix, C.: Constance: An intelligent data lake system. In: Proceedings of the SIGMOD (2016)
Hartung, M., Groß, A., Rahm, E.: Composition methods for link discovery. In: Proceedings of the BTW Conference (2013)
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Hassanzadeh, O., Ward, M.J., Rodriguez-Muro, M., Srinivas, K.: Understanding a large corpus of web tables through matching with knowledge bases-an empirical study. In: Proceedings of the Ontology Matching Workshop (2015)
He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of the SIGMOD, pp. 217–228 (2003)
He, B., Tao, T., Chang, KC.-C.: Organizing structured web sources by query schemas: A clustering approach. In: Proceedings of the CIKM, pp. 22–31 (2004)
He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: An automatic integrator of web search interfaces for E-commerce. In: Proceedings of the 29th VLDB Conference (2003)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. ACM SIGMOD Rec. 24(2), 127–138 (1995)
Hu, W., Chen, J., Zhang, H., Qu, Y.: How matchable are four thousand ontologies on the semantic web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 290–304. Springer, Heidelberg (2011)
Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for linked open data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg (2010)
Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient deduplication with hadoop. PVLDB 5(12), 1878–1881 (2012)
Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Proceedings of the EDBT, pp. 545–550 (2012)
Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Web scale taxonomy cleansing. PVLDB 4(12), 1295–1306 (2011)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6(2), 167–195 (2015)
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1–2), 1338–1347 (2010)
Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE, pp. 57–68 (2005)
Mahmoud, H.A., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the SIGMOD (2010)
Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A., et al.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13(1), R5 (2012)
Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)
Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. University of Leipzig, Technical report (2016)
Nentwig, M. Hartung, M., Ngomo, A.-C.N., Rahm, E.: A survey of current link discovery frameworks. Semant. Web J. (2016)
Nentwig, M., Soru, T., Ngomo, A.-C.N., Rahm, E.: LinkLion: A link repository for the web of data. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC Satellite Events 2014. LNCS, vol. 8798, pp. 439–443. Springer, Heidelberg (2014)
Ngomo, A.-C.N., Auer, S.: LIMES - A time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of the IJCAI, pp. 2312–2317 (2011)
Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2016)
Noy, N., et al.: BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009)
Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the ACM Conference Web search and data mining, pp. 53–62 (2012)
Papadimitriou, P., Tsaparas, P., Fuxman, A., Getoor, L.: TACI: Taxonomy-aware catalog integration. IEEE TKDE 25(7), 1643–1655 (2013)
Pasupuleti, P., Purra, B.S.: Data Lake Development with Big Data. Packt Publishing Ltd., Birmingham (2015)
Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semant. Web J. (2016)
Pershina, M., Yakout, M., Chakrabarti, K.: Holistic entity matching across knowledge graphs. In: IEEE International Conference on Big Data, pp. 1585–1590 (2015)
Pottinger, R.A., Bernstein, P.A.: Merging models based on given correspondences. In: Proceedings of the VLDB, pp. 862–873 (2003)
Radwan, A., Popa, L., Stanoi, I.R., Younis, A.: Top-k generation of integrated schemas based on directed and weighted correspondences. In: Proceedings of the SIGMOD, pp. 641–654 (2009)
Rahm, E.: Towards large-scale schema and ontology matching. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping. Data-Centric Systems and Applications, pp. 3–27. Springer, Heidelberg (2011)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10, 334–350 (2001)
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Rakhmawati, N.A., Umbrich, J., Karnstedt, M., Hasnain, A., Hausenblas, M.: A Comparison of Federation over SPARQL Endpoints Frameworks. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 132–146. Springer, Heidelberg (2013)
Raunich, S., Rahm, E.: Target-driven merging of taxonomies with ATOM. Inf. Syst. 42, 1–14 (2014)
Saha, B., Stanoi, I., Clarkson, K.L.: Schema covering: a step towards enabling reuse in information integration. In: ICDE, pp. 285–296 (2010)
Saleem, K., Bellahsene, Z., Hunt, E.: Porsche: Performance oriented schema mediation. Inf. Syst. 33(7), 637–657 (2008)
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE TKDE 27(2), 443–460 (2015)
Suchanek, F., Weikum, G.: Knowledge harvesting in the big-data era. In: Proceedings of the SIGMOD, pp. 933–938 (2013)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A large ontology from wikipedia and wordnet. Web Semant. Sci. Serv. Agents World Wide Web 6(3), 203–217 (2008)
Sun, C., Rampalli, N., Yang, F., Doan, A.: Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB 7(13), 1529–1540 (2014)
Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. CACM 57(10), 78–85 (2014)
Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012 Main Conference 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the SIGMOD, pp. 219–232 (2009)
Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the SIGMOD, pp. 97–108, (2012)
Acknowledgments
I’d like to thank Sören Auer, Phil Bernstein, Peter Christen, Victor Christen, Anika Groß, Sebastian Hellmann, Dinusha Vatsalan, Qing Wang and Gerhard Weikum for helpful comments and feedback on an earlier version of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Rahm, E. (2016). The Case for Holistic Data Integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds) Advances in Databases and Information Systems. ADBIS 2016. Lecture Notes in Computer Science(), vol 9809. Springer, Cham. https://doi.org/10.1007/978-3-319-44039-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-44039-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44038-5
Online ISBN: 978-3-319-44039-2
eBook Packages: Computer ScienceComputer Science (R0)