Skip to main content

Advertisement

Log in

Qualitative effects of knowledge rules and user feedback in probabilistic data integration

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort—and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: Proceedings of the 23nd international conference on data engineering (ICDE), Istanbul, Turkey, pp. 1479–1480, April 2007

  2. Abiteboul, S., Senellart, P.: Querying and updating probabilistic information in XML. In: Proceedings of the international conference on extending database technology (EDBT), Munich, Germany, pp. 1059–1068, (2006) (LNCS 3896)

  3. Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA, pp. 891–893 (2005)

  4. Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S., Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  5. Barbará, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology (EDBT) Venice, Italy, vol. 416 of LNCS, pp. 60–74. Springer, Berlin, March 1990. ISBN 3-540-52291-3

  6. Boncz, P.A., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: Proceedings of SIGMOD, Chicago, IL, pp. 479–490 (2006)

  7. Benjelloun O., Das Sarma A., Hayworth C., Widom J.: An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull 29(1), 5–16 (2006)

    Google Scholar 

  8. Baeza-Yates R., Ribeiro-Neto B.: Modern information retrieval. Addison Wesley, Reading (1999) ISBN 0-201-39829-X

    Google Scholar 

  9. Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland, New Zealand, pp. 722–735, August 2008

  10. Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: a database system for managing constantly-evolving data. In: Proceedings of the 31st international conference on very large data bases (VLDB), Trondheim, Norway, pp. 1271–1274 (2005)

  11. Cheng, T., Yan, X., Chang, C.-C.K.: EntityRank: searching entities directly and holistically. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 387–398, September 23–27, 2007. ACM, (2007)

  12. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, California, USA, pp. 509–520, May 2001. ISBN 1-58113-332-4

  13. Doan A., Domingos P., Halevy A.Y.: Learning to match the schemas of data sources: a multistrategy approach. Mach Learn 50(3), 279–301 (2003)

    Article  MATH  Google Scholar 

  14. Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Magazine, (2005)

  15. Luna Dong, X., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, September 23–27, pp. 687–698. ACM (2007)

  16. de Keijzer, A., van Keulen, M.: Quality measures in uncertain data management. In: Proceedings of the 1st international conference on scalable uncertainty management (SUM), Washington DC, vol. 4772 of LNCS, pp. 104–115 (2007)

  17. de Keijzer, A., van Keulen, M.: User feedback in probabilistic integration. In: 2nd international workshop on flexible database and information system technology (FlexDBIST), Regensburg, Germany, Los Alamitos, pp. 377–381, September 2007

  18. de Keijzer, A., van Keulen, M.: IMPrECISE: good-is-good-enough data integration. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, April 2008

  19. de Keijzer, A., van Keulen, M., Li, Y.: Taming data explosion in probabilistic information integration. In: On-line pre-proc. of IIDB, Munich, Germany, pp. 82–86 (2006). Position paper. http://ssi.umh.ac.be/iidb

  20. de Rougemont, M.: The reliability of queries. In: Proceedings of the 14th ACM symposium on principles of database systems (PODS), San Jose, CA, pages 286–291, May 1995

  21. DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R.: Dblife: A community information management platform for the database research community (demo). In: Proceedings of the 3rd biennial conference on innovative data systems research (CIDR), Asilomar, CA, pp. 169–172, January 2007

  22. DeRose, P., Shen, W., Chen, F., Doan, A., Ramakrishnan, R.: Building structured web community portals: a top-down, compositional, and incremental approach. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 399–410. ACM, September 2007

  23. Furfaro F., Greco S., Molinaro C.: A three-valued semantics for querying and repairing inconsistent databases. Ann Math Artif Intell 51(2–4), 167–193 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  24. Gal, A.: Interpreting similarity measures: bridging the gap between schema matching and data integration. In: Proceedings of the workshop on information integration methods, architectures, and systems (IIMAS), Cancun, Mexico, pp. 278–285, April 2008

  25. Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: Proceedings of the 17th ACM symposium on principles of database systems (PODS), Seattle, WA, pp. 227–234, June 1998

  26. Hung, E., Getoor, L., Subrahmanian, V.S.: PXML: A probabilistic semistructured data model and algebra. In: Proceedings of the 19th international conference on data engineering (ICDE), Bangalore, India, p. 467, March 2003. ISBN 0-7803-7665-X

  27. Hunter A., Liu W.: Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2), 230–258 (2006)

    Article  Google Scholar 

  28. Kanagal, B., Deshpande, A.: Online filtering, smoothing and probabilistic modeling of streaming data. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, pp. 1160–1169, April 2008

  29. van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, pp. 459–470, April 2005

  30. Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: Proceedings of the 34th international conference on very large data bases (VLDB2008), Auckland, New Zealand, pages 313–326, August 2008

  31. Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V.S.: ProbView: a flexible probabilistic database system. ACM Trans Database Syst 22(3), 419–469 (1997)

    Article  Google Scholar 

  32. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006

  33. Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the 1st international workshop on management of uncertain data (MUD), Vienna, Austria, number WP07-08 in CTIT workshop proceedings, September 2007. ISSN 0929-0672

  34. Milano, D., Scannapieco, M., Catarci, T.: Structure aware xml object identification. In: Proceedings of the 1st international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006

  35. Orr K.: Data quality and systems theory. Commun ACM 41(2), 66–71 (1998)

    Article  MathSciNet  Google Scholar 

  36. Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD), Vancouver, Canada, pages 861–874, (2008)

  37. Serdyukov, P., Hiemstra, D.: Modeling documents as mixtures of persons for expert finding. In: Proceedings of the 30th European conference on IR research (ECIR), Glasgow, UK, vol. 4956 of LNCS, pp. 309–320, Springer, Berlin, April 2008

  38. van Kessel, R.: Querying probabilistic xml. Master’s thesis, University of Twente, Enschede, Netherlands, April 2008

  39. Wijsen, J.: Project-join-repair: an approach to consistent query answering under functional dependencies. In: Proceedings of the 7th international conference on flexible query answering systems (FQAS), Milan, Italy, vol. 4027 of LNCS, pp. 1–12. Springer, Berlin, June 2006

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maurice van Keulen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

van Keulen, M., de Keijzer, A. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. The VLDB Journal 18, 1191–1217 (2009). https://doi.org/10.1007/s00778-009-0156-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0156-z

Keywords

Navigation