Skip to main content

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

Part of the Lecture Notes in Computer Science book series (TLDKS,volume 10940)

Abstract

The Web of Data consists of numerous Linked Data (LD) sources from many largely independent publishers, giving rise to the need for data integration at scale. To address data integration at scale, automation can provide candidate integrations that underpin a pay-as-you-go approach. However, automated approaches need: (i) to operate across several data integration steps; (ii) to build on diverse sources of evidence; and (iii) to contend with uncertainty. This paper describes the construction of probabilistic models that yield degrees of belief both on the equivalence of real-world concepts, and on the ability of mapping expressions to return correct results. The paper shows how such models can underpin a Bayesian approach to assimilating different forms of evidence: syntactic (in the form of similarity scores derived by string-based matchers), semantic (in the form of semantic annotations stemming from LD vocabularies), and internal in the form of fitness values for candidate mappings. The paper presents an empirical evaluation of the methodology described with respect to equivalence and correctness judgements made by human experts. Experimental evaluation confirms that the proposed Bayesian methodology is suitable as a generic, principled approach for quantifying and assimilating different pieces of evidence throughout the various phases of an automated data integration process.

Keywords

  • Probabilistic modelling
  • Bayesian updating
  • Data integration
  • Linked Data

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    One well-known example portal is the so-called Linked Open Data (LOD) cloud, at https://lod-cloud.net/.

  2. 2.

    For schema-less sources (e.g., Linked Data sources) schema extraction techniques can be used to infer schemas (e.g., [5]).

  3. 3.

    https://www.jamendo.com/.

  4. 4.

    http://magnatune.com/.

  5. 5.

    http://oaei.ontologymatching.org.

  6. 6.

    A Gaussian kernel was used due to its mathematical convenience. Note that any other kernel can be applied. Of course, the shape of the distribution may differ depending on the kernel characteristics.

  7. 7.

    http://lov.okfn.org/dataset/lov/.

  8. 8.

    Informally, the d.o.b., in the hypothesis given the evidence (the so-called posterior d.o.b.) is equal to the ratio between the product of the d.o.b. in the evidence given the hypothesis (which we call likelihood in Sect. 3) and the d.o.b. in the hypothesis (the so-called prior d.o.b.) divided by the d.o.b. in the evidence.

  9. 9.

    http://dbtune.org/jamendo/

  10. 10.

    http://dbtune.org/magnatune/

  11. 11.

    BLOOMS was configured with a high threshold, viz., >0.8.

  12. 12.

    We observe once more that, in this paper, the experiments have only used LD datasets but dataspaces are meant to be model-agnostic and, in particular, DSToolkit is. DSToolkit is no longer being actively developed but requests for access to the sources can be sent to the second author. The datasets used are publicly available in the LOD cloud.

References

  1. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD Conference, pp. 906–908 (2005)

    Google Scholar 

  2. Belhajjame, K., Paton, N.W., Embury, S.M., Fernandes, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf. Syst. 38(5), 656–687 (2013)

    CrossRef  Google Scholar 

  3. Bernstein, P., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)

    Google Scholar 

  4. Bowman, A.W., Azzalini, A.: Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. OUP, Oxford (1997)

    MATH  Google Scholar 

  5. Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. In: Hameurlain, A., Küng, J., Wagner, R., Bianchini, D., De Antonellis, V., De Virgilio, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIX. LNCS, vol. 8990, pp. 1–25. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46562-2_1

    CrossRef  Google Scholar 

  6. de Vaus, D.: Surveys in Social Research: Research Methods/Sociology. Taylor & Francis, London (2002)

    Google Scholar 

  7. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. 18(2), 469–500 (2009)

    CrossRef  Google Scholar 

  8. Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: EvoMatch: an evolutionary algorithm for inferring schematic correspondences. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XII. LNCS, vol. 8320, pp. 1–26. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45315-1_1

    CrossRef  Google Scholar 

  9. Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: MatchBench: benchmarking schema matching algorithms for schematic correspondences. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds.) BNCOD 2013. LNCS, vol. 7968, pp. 92–106. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39467-6_11

    CrossRef  Google Scholar 

  10. Halevy, A.Y.: Why your data won’t mix: semantic heterogeneity. ACM Queue 3(8), 50–58 (2005)

    CrossRef  Google Scholar 

  11. Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)

    Google Scholar 

  12. Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: the teenage years. In: VLDB, pp. 9–16 (2006)

    Google Scholar 

  13. Hedeler, C., et al.: DSToolkit: an architecture for flexible dataspace management. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. LNCS, vol. 7100, pp. 126–157. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28148-8_6

    CrossRef  Google Scholar 

  14. Hedeler, C., Belhajjame, K., Paton, N.W., Campi, A., Fernandes, A.A.A., Embury, S.M.: Chapter 7: dataspaces. In: Ceri, S., Brambilla, M. (eds.) Search Computing. LNCS, vol. 5950, pp. 114–134. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12310-8_7

    CrossRef  Google Scholar 

  15. Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. IJF 22(4), 679–688 (2006)

    Google Scholar 

  16. Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for linked open data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_26

    CrossRef  Google Scholar 

  17. Kim, W., Seo, J.: Classifying schematic and data heterogeneity in multidatabase systems. IEEE Comput. 24(12), 12–18 (1991)

    CrossRef  Google Scholar 

  18. Kuicheu, N.C., Wang, N., Fanzou Tchuissang, G.N., Xu, D., Dai, G., Siewe, F.: Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 32(1), 175–202 (2013)

    MATH  Google Scholar 

  19. Lenzerini, M.: Data integration: a theoretical perspective. In: PODS, pp. 233–246 (2002)

    Google Scholar 

  20. Madhavan, J., et al.: Web-scale data integration: you can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)

    Google Scholar 

  21. Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the First International VLDB Workshop on Management of Uncertain Data in Conjunction with VLDB 2007, Vienna, Austria, 24 September 2007, pp. 18–32 (2007)

    Google Scholar 

  22. Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 60–73. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75410-7_5

    CrossRef  Google Scholar 

  23. Papoulis, A.: Probability, Random Variables and Stochastic Processes, 3rd edn. McGraw-Hill Companies, New York (1991)

    MATH  Google Scholar 

  24. Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A., Maskat, R.: Pay-as-you-go data integration: experiences and recurring themes. In: Freivalds, R.M., Engels, G., Catania, B. (eds.) SOFSEM 2016. LNCS, vol. 9587, pp. 81–92. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49192-8_7

    CrossRef  MATH  Google Scholar 

  25. Peukert, E., Maßmann, S., König, K.: Comparing similarity combination methods for schema matching. In: GI Jahrestagung, no. 1, pp. 692–701 (2010)

    Google Scholar 

  26. Polleres, A., Hogan, A., Harth, A., Decker, S.: Can we ever catch up with the web? Semant. Web 1(1–2), 45–52 (2010)

    Google Scholar 

  27. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    CrossRef  Google Scholar 

  28. Sabou, M., d’Aquin, M., Motta, E.: Exploring the semantic web as background knowledge for ontology matching. J. Data Semant. 11, 156–190 (2008)

    Google Scholar 

  29. Sabou, M., d’Aquin, M., Motta, E.: SCARLET: Semantic relation discovery by harvesting online ontologies. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 854–858. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68234-9_72

    CrossRef  Google Scholar 

  30. Das Sarma, A., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)

    Google Scholar 

  31. Sarma, A.D., Dong, X.L., Halevy, A.Y.: Uncertainty in data integration and dataspace support platforms. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping. DCSA, pp. 75–108. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_4

    CrossRef  Google Scholar 

  32. Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25(1), 158–176 (2013)

    CrossRef  Google Scholar 

  33. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London (1986)

    CrossRef  Google Scholar 

  34. Spragins, J.: A note on the iterative application of Bayes’ rule. IEEE Trans. Inf. Theory 11(4), 544–549 (2006)

    CrossRef  MathSciNet  Google Scholar 

  35. van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)

    CrossRef  Google Scholar 

Download references

Acknowledgments

Fernando R. Sanchez S. is supported by a grant from the Mexican National Council for Science and Technology (CONACyT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klitos Christodoulou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Christodoulou, K., Serrano, F.R.S., Fernandes, A.A.A., Paton, N.W. (2018). Quantifying and Propagating Uncertainty in Automated Linked Data Integration. In: Hameurlain, A., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVII. Lecture Notes in Computer Science(), vol 10940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-57932-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-57932-9_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-57931-2

  • Online ISBN: 978-3-662-57932-9

  • eBook Packages: Computer ScienceComputer Science (R0)