Advertisement

The VLDB Journal

, 18:1191 | Cite as

Qualitative effects of knowledge rules and user feedback in probabilistic data integration

  • Maurice van KeulenEmail author
  • Ander de Keijzer
Special Issue Paper

Abstract

In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort—and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.

Keywords

Data integration Entity resolution Uncertain databases Data quality User feedback 

References

  1. 1.
    Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: Proceedings of the 23nd international conference on data engineering (ICDE), Istanbul, Turkey, pp. 1479–1480, April 2007Google Scholar
  2. 2.
    Abiteboul, S., Senellart, P.: Querying and updating probabilistic information in XML. In: Proceedings of the international conference on extending database technology (EDBT), Munich, Germany, pp. 1059–1068, (2006) (LNCS 3896)Google Scholar
  3. 3.
    Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA, pp. 891–893 (2005)Google Scholar
  4. 4.
    Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S., Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  5. 5.
    Barbará, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology (EDBT) Venice, Italy, vol. 416 of LNCS, pp. 60–74. Springer, Berlin, March 1990. ISBN 3-540-52291-3Google Scholar
  6. 6.
    Boncz, P.A., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: Proceedings of SIGMOD, Chicago, IL, pp. 479–490 (2006)Google Scholar
  7. 7.
    Benjelloun O., Das Sarma A., Hayworth C., Widom J.: An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull 29(1), 5–16 (2006)Google Scholar
  8. 8.
    Baeza-Yates R., Ribeiro-Neto B.: Modern information retrieval. Addison Wesley, Reading (1999) ISBN 0-201-39829-XGoogle Scholar
  9. 9.
    Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland, New Zealand, pp. 722–735, August 2008Google Scholar
  10. 10.
    Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: a database system for managing constantly-evolving data. In: Proceedings of the 31st international conference on very large data bases (VLDB), Trondheim, Norway, pp. 1271–1274 (2005)Google Scholar
  11. 11.
    Cheng, T., Yan, X., Chang, C.-C.K.: EntityRank: searching entities directly and holistically. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 387–398, September 23–27, 2007. ACM, (2007)Google Scholar
  12. 12.
    Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, California, USA, pp. 509–520, May 2001. ISBN 1-58113-332-4Google Scholar
  13. 13.
    Doan A., Domingos P., Halevy A.Y.: Learning to match the schemas of data sources: a multistrategy approach. Mach Learn 50(3), 279–301 (2003)zbMATHCrossRefGoogle Scholar
  14. 14.
    Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Magazine, (2005)Google Scholar
  15. 15.
    Luna Dong, X., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, September 23–27, pp. 687–698. ACM (2007)Google Scholar
  16. 16.
    de Keijzer, A., van Keulen, M.: Quality measures in uncertain data management. In: Proceedings of the 1st international conference on scalable uncertainty management (SUM), Washington DC, vol. 4772 of LNCS, pp. 104–115 (2007)Google Scholar
  17. 17.
    de Keijzer, A., van Keulen, M.: User feedback in probabilistic integration. In: 2nd international workshop on flexible database and information system technology (FlexDBIST), Regensburg, Germany, Los Alamitos, pp. 377–381, September 2007Google Scholar
  18. 18.
    de Keijzer, A., van Keulen, M.: IMPrECISE: good-is-good-enough data integration. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, April 2008Google Scholar
  19. 19.
    de Keijzer, A., van Keulen, M., Li, Y.: Taming data explosion in probabilistic information integration. In: On-line pre-proc. of IIDB, Munich, Germany, pp. 82–86 (2006). Position paper. http://ssi.umh.ac.be/iidb
  20. 20.
    de Rougemont, M.: The reliability of queries. In: Proceedings of the 14th ACM symposium on principles of database systems (PODS), San Jose, CA, pages 286–291, May 1995Google Scholar
  21. 21.
    DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R.: Dblife: A community information management platform for the database research community (demo). In: Proceedings of the 3rd biennial conference on innovative data systems research (CIDR), Asilomar, CA, pp. 169–172, January 2007Google Scholar
  22. 22.
    DeRose, P., Shen, W., Chen, F., Doan, A., Ramakrishnan, R.: Building structured web community portals: a top-down, compositional, and incremental approach. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 399–410. ACM, September 2007Google Scholar
  23. 23.
    Furfaro F., Greco S., Molinaro C.: A three-valued semantics for querying and repairing inconsistent databases. Ann Math Artif Intell 51(2–4), 167–193 (2007)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Gal, A.: Interpreting similarity measures: bridging the gap between schema matching and data integration. In: Proceedings of the workshop on information integration methods, architectures, and systems (IIMAS), Cancun, Mexico, pp. 278–285, April 2008Google Scholar
  25. 25.
    Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: Proceedings of the 17th ACM symposium on principles of database systems (PODS), Seattle, WA, pp. 227–234, June 1998Google Scholar
  26. 26.
    Hung, E., Getoor, L., Subrahmanian, V.S.: PXML: A probabilistic semistructured data model and algebra. In: Proceedings of the 19th international conference on data engineering (ICDE), Bangalore, India, p. 467, March 2003. ISBN 0-7803-7665-XGoogle Scholar
  27. 27.
    Hunter A., Liu W.: Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2), 230–258 (2006)CrossRefGoogle Scholar
  28. 28.
    Kanagal, B., Deshpande, A.: Online filtering, smoothing and probabilistic modeling of streaming data. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, pp. 1160–1169, April 2008Google Scholar
  29. 29.
    van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, pp. 459–470, April 2005Google Scholar
  30. 30.
    Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: Proceedings of the 34th international conference on very large data bases (VLDB2008), Auckland, New Zealand, pages 313–326, August 2008Google Scholar
  31. 31.
    Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V.S.: ProbView: a flexible probabilistic database system. ACM Trans Database Syst 22(3), 419–469 (1997)CrossRefGoogle Scholar
  32. 32.
    Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006Google Scholar
  33. 33.
    Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the 1st international workshop on management of uncertain data (MUD), Vienna, Austria, number WP07-08 in CTIT workshop proceedings, September 2007. ISSN 0929-0672Google Scholar
  34. 34.
    Milano, D., Scannapieco, M., Catarci, T.: Structure aware xml object identification. In: Proceedings of the 1st international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006Google Scholar
  35. 35.
    Orr K.: Data quality and systems theory. Commun ACM 41(2), 66–71 (1998)CrossRefMathSciNetGoogle Scholar
  36. 36.
    Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD), Vancouver, Canada, pages 861–874, (2008)Google Scholar
  37. 37.
    Serdyukov, P., Hiemstra, D.: Modeling documents as mixtures of persons for expert finding. In: Proceedings of the 30th European conference on IR research (ECIR), Glasgow, UK, vol. 4956 of LNCS, pp. 309–320, Springer, Berlin, April 2008Google Scholar
  38. 38.
    van Kessel, R.: Querying probabilistic xml. Master’s thesis, University of Twente, Enschede, Netherlands, April 2008Google Scholar
  39. 39.
    Wijsen, J.: Project-join-repair: an approach to consistent query answering under functional dependencies. In: Proceedings of the 7th international conference on flexible query answering systems (FQAS), Milan, Italy, vol. 4027 of LNCS, pp. 1–12. Springer, Berlin, June 2006Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Faculty of EEMCSUniversity of TwenteEnschedeThe Netherlands
  2. 2.Institute of Technical Medicine, Faculty of Science and TechnologyUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations