Skip to main content

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9827))

Abstract

Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For ease of notation, the parameter \(\tau \) is omitted.

  2. 2.

    A secondary ordering is used to break ties consistently (e.g., the lexicographic ordering).

  3. 3.

    This definition can be made consistent when the input is exhausted by defining a conceptual probe set of infinite weight after the last input set.

  4. 4.

    http://dblab.cs.toronto.edu/project/stringer/clustering/.

References

  1. Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. PVLDB 6(14), 1846–1857 (2013)

    Google Scholar 

  2. Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. PVLDB 9(3), 120–131 (2015)

    Google Scholar 

  3. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the WWW Conference, pp. 131–140 (2007)

    Google Scholar 

  4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Qi, S., Whang, S.E., Widom, S.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  5. Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and xml. In: WebDyn 2002 (2002)

    Google Scholar 

  6. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5 (2006)

    Google Scholar 

  7. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012)

    Google Scholar 

  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)

    Google Scholar 

  9. Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)

    Google Scholar 

  10. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the SIGMOD Conference, pp. 277–281 (2015)

    Google Scholar 

  11. Kazimianec, M., Augsten, N.: PG-Skip: proximity graph based clustering of long strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011, Part II. LNCS, vol. 6588, pp. 31–46. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  12. Koudas, N., Sarawagi, S., Srivastava, D., Record linkage: similarity measures and algorithms. In: Proceedings of the SIGMOD Conference, pp. 802–803 (2006)

    Google Scholar 

  13. Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. In: Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U., Hameurlain, A. (eds.) TLDKS VIII. LNCS, vol. 7790, pp. 174–196. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Liu, H., Ashwin Kumar, T.K, Thomas, J.P.: Cleaning framework for big data - object identification and linkage. In: Proceedings of the Big Data Congress, pp. 215–221 (2015)

    Google Scholar 

  15. Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: Proceedings of the VLDB Workshop on Clean Databases (2006)

    Google Scholar 

  16. Menestrina, D., Whang, S., Garcia-Molina, H.: Evaluating entity resolution results. PVLDB 3(1), 208–219 (2010)

    Google Scholar 

  17. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: Towards a framework for integrating similarity join algorithms and clustering. In: Proceedings of the ICEIS Conference, pp. 75–80 (2016)

    Google Scholar 

  18. Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)

    Article  Google Scholar 

  19. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In Proceedings of the SIGMOD Conference, pp. 743–754 (2004)

    Google Scholar 

  20. Schneider, N.C., Ribeiro, L.A., de Souza, A., Inácio, H.M., Wagner, A., von Wangenheim. SimDataMapper: An architectural pattern to integrate declarative similarity matching into database applications. In: Proceedings of the SBBD Conference, pp. 967–972 (2015)

    Google Scholar 

  21. Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the SAC Conference, pp. 967–972 (2015)

    Google Scholar 

  22. Tang, N.: Big RDF data cleaning. In: Proceedings of the ICDE Conference Workshops, pp. 77–79 2015)

    Google Scholar 

  23. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of the SIGMOD Conference, pp. 85–96 (2012)

    Google Scholar 

  24. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)

    Article  Google Scholar 

  25. Zhang, F., Xue, H.-F., Xu, D.-S., Zhang, Y.-H., You, F.: Big data cleaning algorithms in cloud computing. iJOE 9(3), 77–81 (2013)

    Google Scholar 

Download references

Acknowledgments

This research was partially supported by the Brazilian agencies CNPq and CAPES.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2016). Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework. In: Hartmann, S., Ma, H. (eds) Database and Expert Systems Applications. DEXA 2016. Lecture Notes in Computer Science(), vol 9827. Springer, Cham. https://doi.org/10.1007/978-3-319-44403-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44403-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44402-4

  • Online ISBN: 978-3-319-44403-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics