Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Probabilistic Data Integration

  • Maurice Van KeulenEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_18



Probabilistic data integration (PDI) is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty (van Keulen 2012), and this uncertainty is considered an important result of the integration process (Magnani and Montesi 2010).

The PDI process contains two phases (see Fig. 1): (i) a quick partial integration where certain data quality problems are not solved immediately, but explicitly represented as uncertainty in the resulting integrated data stored in a probabilistic database; (ii) continuous improvement by using the data – a probabilistic database can be queried directly resulting in possible or approximate answers (Dalvi et al. 2009) – and gathering evidence (e.g., user feedback) for improving the...
This is a preview of subscription content, log in to check access.


  1. Abiteboul S, Kimelfeld B, Sagiv Y, Senellart P (2009) On the expressiveness of probabilistic xml models. VLDB J 18(5):1041–1064. https://doi.org/10.1007/s00778-009-0146-1CrossRefGoogle Scholar
  2. Antova L, Jansen T, Koch C, Olteanu D (2008) Fast and simple relational processing of uncertain data. In: Proceedings of ICDE, pp 983–992Google Scholar
  3. Antova L, Koch C, Olteanu D (2009) \({10^{(10^{6})}}\) worlds and beyond: efficient representation and processing of incomplete information. VLDB J 18(5):1021–1040. https://doi.org/10.1007/s00778-009-0149-yCrossRefGoogle Scholar
  4. Arumugam S, Xu F, Jampani R, Jermaine C, Perez LL, Haas PJ (2010) MCDB-R: risk analysis in the database. Proc VLDB Endow 3(1–2):782–793. https://doi.org/10.14778/1920841.1920941CrossRefGoogle Scholar
  5. Dalvi N, Ré C, Suciu D (2009) Probabilistic databases: diamonds in the dirt. Commun ACM 52(7):86–94. https://doi.org/10.1145/1538788.1538810CrossRefGoogle Scholar
  6. De Raedt L, Kimmig A (2015) Probabilistic (logic) programming concepts. Mach Learn 100(1):5–47. https://doi.org/10.1007/s10994-015-5494-zMathSciNetzbMATHCrossRefGoogle Scholar
  7. Fuhr N (2000) Probabilistic datalog: implementing logical information retrieval for advanced applications. J Am Soc Inf Sci 51(2):95–110MathSciNetCrossRefGoogle Scholar
  8. Haas D, Krishnan S, Wang J, Franklin M, Wu E (2015) Wisteria: nurturing scalable data cleaning infrastructure. Proc VLDB Endow 8(12):2004–2007. https://doi.org/10.14778/2824032.2824122CrossRefGoogle Scholar
  9. Huijbrechts B, Velikova M, Michels S, Scheepens R (2015) Metis1: an integrated reference architecture for addressing uncertainty in decision-support systems. Proc Comput Sci 44(Supplement C):476–485. https://doi.org/10.1016/j.procs.2015.03.007CrossRefGoogle Scholar
  10. Jampani R, Xu F, Wu M, Perez LL, Jermaine C, Haas PJ (2008) MCDB: a monte carlo approach to managing uncertain data. In: Proceeding of SIGMOD. ACM, pp 687–700Google Scholar
  11. Jundt O, van Keulen M (2013) Sample-based XPath ranking for web information extraction. In: Proceeding of EUSFLAT. Advances in intelligent systems research. Atlantis Press.  https://doi.org/10.2991/eusflat.2013.27
  12. Koch C (2009) MayBMS: a system for managing large probabilistic databases. In: Aggarwal CC (ed) Managing and mining uncertain data. Advances in database systems, vol 35. Springer. https://doi.org/10.1007/978-0-387-09690-2_6Google Scholar
  13. Lenzerini M (2002) Data integration: a theoretical perspective. In: Proceeding of PODS. ACM, pp 233–246. https://doi.org/10.1145/543613.543644
  14. Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. JDIQ 2(1):5:1–5:33. https://doi.org/10.1145/1805286.1805291CrossRefGoogle Scholar
  15. Naumann F, Herschel M (2010) An introduction to duplicate detection. Synthesis lectures on data management. Morgan & Claypool. https://doi.org/10.2200/S00262ED1V01Y201003DTM003zbMATHCrossRefGoogle Scholar
  16. Panse F (2015) Duplicate detection in probabilistic relational databases. PhD thesis, University of HamburgGoogle Scholar
  17. Panse F, van Keulen M, Ritter N (2013) Indeterministic handling of uncertain decisions in deduplication. JDIQ 4(2):9:1–9:25. https://doi.org/10.1145/2435221.2435225CrossRefGoogle Scholar
  18. Trieschnigg R, Tjin-Kam-Jet K, Hiemstra D (2012) Ranking xpaths for extracting search result records. Technical report TR-CTIT-12-08, Centre for telematics and information technology (CTIT)Google Scholar
  19. van Keulen M (2012) Managing uncertainty: the road towards better data interoperability. IT – Inf Technol 54(3):138–146.  https://doi.org/10.1524/itit.2012.0674CrossRefGoogle Scholar
  20. van Keulen M, de Keijzer A (2009) Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J 18(5):1191–1217CrossRefGoogle Scholar
  21. Wanders B, van Keulen M (2015) Revisiting the formal foundation of probabilistic databases. In: Proceeding of IFSA-EUSFLAT. Atlantis Press, p 47.  https://doi.org/10.2991/ifsa-eusflat-15.2015.43
  22. Wanders B, van Keulen M, van der Vet P (2015) Uncertain groupings: probabilistic combination of grouping data. In: Proceeding of DEXA. LNCS, vol 9261. Springer, pp 236–250. https://doi.org/10.1007/978-3-319-22849-5_17Google Scholar
  23. Wanders B, van Keulen M, Flokstra J (2016) Judged: a probabilistic datalog with dependencies. In: Proceeding of DeLBP. AAAI PressGoogle Scholar
  24. Widom J (2004) Trio: a system for integrated management of data, accuracy, and lineage. Technical report 2004-40, Stanford InfoLab. http://ilpubs.stanford.edu:8090/658/
  25. Wijsen J (2005) Database repairing using updates. ACM TODS 30(3):722–768. https://doi.org/10.1145/1093382.1093385CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of EEMCSUniversity of TwenteEnschedeThe Netherlands