Skip to main content
Log in

Creating probabilistic databases from duplicated data

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: ACM Symp. on Theory of Computing (STOC), pp. 684–693 (2005)

  2. Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: IEEE Proc. of the Int’l Conf. on Data Eng., p. 30 (2006)

  3. Antova, L., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 983–992 (2008)

  4. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 918–929 (2006)

  5. Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 952–963 (2009)

  6. Aslam J.A., Pelekhov E., Rus D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithm. Appl. 8(1), 95–129 (2004)

    MATH  MathSciNet  Google Scholar 

  7. Bansal N., Blum A., Chawla S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)

    Article  MATH  Google Scholar 

  8. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Int’l World Wide Web Conference (WWW), pp. 131–140, Banff (2007)

  9. Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S.E., Widom J.: Swoosh: a generic approach to entity resolution. Int. J. Very Large Data Bases 18(1), 255–276 (2009)

    Article  Google Scholar 

  10. Beskales, G., Soliman, M.A., Ilyas, I.F., Ben-David, S.: Modeling and querying possible repairs in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009 (To Appear). Available as University of Waterloo, Tech. Report CS-2009-15, 2009

  11. Bezdek J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Dordrecht (1981)

    MATH  Google Scholar 

  12. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Proc. of the SIAM International Conference on Data Mining (SDM), pp. 47–58, Bethesda (2006)

  13. Bhattacharya I., Getoor L.: Collective entity resolution in relational data. IEEE Data Eng. Bull. 29(2), 4–12 (2006)

    Google Scholar 

  14. Blei D.M., Ng A.Y., Jordan M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  15. Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 891–893 (2005)

  16. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 313–324 (2003)

  17. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 865–876, Washington (2005)

  18. Chaudhuri, S., Das Sarma, A., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 437–448 (2007)

  19. Cheng R., Chen J., Xie X.: Cleaning uncertain data with quality guarantees. Proc. VLDB Endow. (PVLDB) 1(1), 722–735 (2008)

    Google Scholar 

  20. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pp. 73–78, Acapulco (2003)

  21. Dalvi N., Suciu D.: Efficient query evaluation on probabilistic databases. Int. J. Very Large Data Bases 16(4), 523–544 (2007)

    Article  Google Scholar 

  22. Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 1–12 (2007)

  23. Demaine E.D., Emanuel D., Fiat A., Immorlica N.: Correlation clustering in general weighted graphs. Theor. Comput. Sci. 361(2), 172–187 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  24. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 687–698 (2007)

  25. Elmagarmid A.K., Ipeirotis P.G., Verykios V.S.: Duplicate record detection: a survey. IEEE Trans. Know. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  26. Fellegi I.P., Sunter A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  27. Flake G.W., Tarjan R.E., Tsioutsiouliklis K.: Graph clustering and minimum cut trees. Internet Math. 1(4), 385–408 (2004)

    MATH  MathSciNet  Google Scholar 

  28. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 491–500 (2001)

  29. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 965–976 (2006)

  30. Gusfield, D.: Algorithms on strings, trees, and sequences. In: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

  31. Hassanzadeh, O.: Benchmarking Declarative Approximate Selection Predicates. Master’s thesis, University of Toronto, February 2007

  32. Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), 2009

  33. Hassanzadeh, O., Sadoghi, M., Miller, R.J.: Accuracy of approximate string joins using grams. In: Proc. of the International Workshop on Quality in Databases (QDB), pp. 11–18, Vienna (2007)

  34. Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 129–134, Dallas (2000)

  35. Hernández M.A., Stolfo S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Know. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  36. Indyk, P., Motwani, R., Raghavan, P., Vempala, S.: Locality-preserving hashing in multidimensional spaces. In: ACM Symp. on Theory of Computing (STOC), pp. 618–625 (1997)

  37. Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)

    Article  Google Scholar 

  38. Li, C., Wang, B., Yang, X.: VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 303–314, Vienna (2007)

  39. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proc. of the Int’l Conf. on Knowledge Discovery & Data Mining, pp. 169–178 (2000)

  40. Miller, D.R.H., Leek, T., Schwartz, R.M.: A hidden Markov model information retrieval system. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)

  41. Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc. of SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD) (1997)

  42. Rahm E., Hai Do H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  43. Re, C., Dalvi, N., Suciu, D.: Efficient Top-k query evaluation on probabilistic data. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 886–895 (2007)

  44. Robertson S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)

    Article  Google Scholar 

  45. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: ACM SIGMOD Int’l Conf. on the Mgmt. of Data, pp. 743–754, Paris (2004)

  46. Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: ICDM Workshops, pp. 507–512 (2007)

  47. Slonim, N.: The Information Bottleneck: Theory and Applications. PhD thesis, The Hebrew University (2003)

  48. Soliman, M.A., Ilyas, I.F., Chang, K.C.: Top-k query processing in uncertain databases. In: IEEE Proc. of the Int’l Conf. on Data Eng., pp. 896–905 (2007)

  49. Soliman M.A., Ilyas I.F., Chang K.C.: Probabilistic top-k and ranking-aggregate queries. ACM Trans. Database Syst. (TODS) 33(3), 1–54 (2008)

    Article  Google Scholar 

  50. van Dongen, S.: Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht (2000)

  51. Widom J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of the Conference on Innovative Data Systems Research (CIDR), pp. 262–276 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oktie Hassanzadeh.

Additional information

Work supported in part by NSERC.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassanzadeh, O., Miller, R.J. Creating probabilistic databases from duplicated data. The VLDB Journal 18, 1141–1166 (2009). https://doi.org/10.1007/s00778-009-0161-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0161-2

Keywords

Navigation