Abstract
The Web of Data consists of numerous Linked Data (LD) sources from many largely independent publishers, giving rise to the need for data integration at scale. To address data integration at scale, automation can provide candidate integrations that underpin a pay-as-you-go approach. However, automated approaches need: (i) to operate across several data integration steps; (ii) to build on diverse sources of evidence; and (iii) to contend with uncertainty. This paper describes the construction of probabilistic models that yield degrees of belief both on the equivalence of real-world concepts, and on the ability of mapping expressions to return correct results. The paper shows how such models can underpin a Bayesian approach to assimilating different forms of evidence: syntactic (in the form of similarity scores derived by string-based matchers), semantic (in the form of semantic annotations stemming from LD vocabularies), and internal in the form of fitness values for candidate mappings. The paper presents an empirical evaluation of the methodology described with respect to equivalence and correctness judgements made by human experts. Experimental evaluation confirms that the proposed Bayesian methodology is suitable as a generic, principled approach for quantifying and assimilating different pieces of evidence throughout the various phases of an automated data integration process.
Keywords
- Probabilistic modelling
- Bayesian updating
- Data integration
- Linked Data
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
One well-known example portal is the so-called Linked Open Data (LOD) cloud, at https://lod-cloud.net/.
- 2.
For schema-less sources (e.g., Linked Data sources) schema extraction techniques can be used to infer schemas (e.g., [5]).
- 3.
- 4.
- 5.
- 6.
A Gaussian kernel was used due to its mathematical convenience. Note that any other kernel can be applied. Of course, the shape of the distribution may differ depending on the kernel characteristics.
- 7.
- 8.
Informally, the d.o.b., in the hypothesis given the evidence (the so-called posterior d.o.b.) is equal to the ratio between the product of the d.o.b. in the evidence given the hypothesis (which we call likelihood in Sect. 3) and the d.o.b. in the hypothesis (the so-called prior d.o.b.) divided by the d.o.b. in the evidence.
- 9.
- 10.
- 11.
BLOOMS was configured with a high threshold, viz., >0.8.
- 12.
We observe once more that, in this paper, the experiments have only used LD datasets but dataspaces are meant to be model-agnostic and, in particular, DSToolkit is. DSToolkit is no longer being actively developed but requests for access to the sources can be sent to the second author. The datasets used are publicly available in the LOD cloud.
References
Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD Conference, pp. 906–908 (2005)
Belhajjame, K., Paton, N.W., Embury, S.M., Fernandes, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf. Syst. 38(5), 656–687 (2013)
Bernstein, P., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)
Bowman, A.W., Azzalini, A.: Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. OUP, Oxford (1997)
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. In: Hameurlain, A., Küng, J., Wagner, R., Bianchini, D., De Antonellis, V., De Virgilio, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIX. LNCS, vol. 8990, pp. 1–25. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46562-2_1
de Vaus, D.: Surveys in Social Research: Research Methods/Sociology. Taylor & Francis, London (2002)
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. 18(2), 469–500 (2009)
Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: EvoMatch: an evolutionary algorithm for inferring schematic correspondences. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XII. LNCS, vol. 8320, pp. 1–26. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45315-1_1
Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: MatchBench: benchmarking schema matching algorithms for schematic correspondences. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds.) BNCOD 2013. LNCS, vol. 7968, pp. 92–106. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39467-6_11
Halevy, A.Y.: Why your data won’t mix: semantic heterogeneity. ACM Queue 3(8), 50–58 (2005)
Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: the teenage years. In: VLDB, pp. 9–16 (2006)
Hedeler, C., et al.: DSToolkit: an architecture for flexible dataspace management. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. LNCS, vol. 7100, pp. 126–157. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28148-8_6
Hedeler, C., Belhajjame, K., Paton, N.W., Campi, A., Fernandes, A.A.A., Embury, S.M.: Chapter 7: dataspaces. In: Ceri, S., Brambilla, M. (eds.) Search Computing. LNCS, vol. 5950, pp. 114–134. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12310-8_7
Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. IJF 22(4), 679–688 (2006)
Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for linked open data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_26
Kim, W., Seo, J.: Classifying schematic and data heterogeneity in multidatabase systems. IEEE Comput. 24(12), 12–18 (1991)
Kuicheu, N.C., Wang, N., Fanzou Tchuissang, G.N., Xu, D., Dai, G., Siewe, F.: Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 32(1), 175–202 (2013)
Lenzerini, M.: Data integration: a theoretical perspective. In: PODS, pp. 233–246 (2002)
Madhavan, J., et al.: Web-scale data integration: you can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)
Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the First International VLDB Workshop on Management of Uncertain Data in Conjunction with VLDB 2007, Vienna, Austria, 24 September 2007, pp. 18–32 (2007)
Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 60–73. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75410-7_5
Papoulis, A.: Probability, Random Variables and Stochastic Processes, 3rd edn. McGraw-Hill Companies, New York (1991)
Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A., Maskat, R.: Pay-as-you-go data integration: experiences and recurring themes. In: Freivalds, R.M., Engels, G., Catania, B. (eds.) SOFSEM 2016. LNCS, vol. 9587, pp. 81–92. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49192-8_7
Peukert, E., Maßmann, S., König, K.: Comparing similarity combination methods for schema matching. In: GI Jahrestagung, no. 1, pp. 692–701 (2010)
Polleres, A., Hogan, A., Harth, A., Decker, S.: Can we ever catch up with the web? Semant. Web 1(1–2), 45–52 (2010)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Sabou, M., d’Aquin, M., Motta, E.: Exploring the semantic web as background knowledge for ontology matching. J. Data Semant. 11, 156–190 (2008)
Sabou, M., d’Aquin, M., Motta, E.: SCARLET: Semantic relation discovery by harvesting online ontologies. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 854–858. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68234-9_72
Das Sarma, A., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)
Sarma, A.D., Dong, X.L., Halevy, A.Y.: Uncertainty in data integration and dataspace support platforms. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping. DCSA, pp. 75–108. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_4
Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25(1), 158–176 (2013)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London (1986)
Spragins, J.: A note on the iterative application of Bayes’ rule. IEEE Trans. Inf. Theory 11(4), 544–549 (2006)
van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)
Acknowledgments
Fernando R. Sanchez S. is supported by a grant from the Mexican National Council for Science and Technology (CONACyT).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Christodoulou, K., Serrano, F.R.S., Fernandes, A.A.A., Paton, N.W. (2018). Quantifying and Propagating Uncertainty in Automated Linked Data Integration. In: Hameurlain, A., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVII. Lecture Notes in Computer Science(), vol 10940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-57932-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-57932-9_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-57931-2
Online ISBN: 978-3-662-57932-9
eBook Packages: Computer ScienceComputer Science (R0)