Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Ribeiro, Leonardo Andrade; Cuzzocrea, Alfredo; Bezerra, Karen Aline Alves; do Nascimento, Ben Hur Bahia

doi:10.1007/978-3-319-44403-1_12

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Leonardo Andrade Ribeiro¹⁵,
Alfredo Cuzzocrea¹⁶,
Karen Aline Alves Bezerra¹⁷ &
…
Ben Hur Bahia do Nascimento¹⁷

Conference paper
First Online: 06 August 2016

816 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9827))

Abstract

Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For ease of notation, the parameter \(\tau \) is omitted.
2.
A secondary ordering is used to break ties consistently (e.g., the lexicographic ordering).
3.
This definition can be made consistent when the input is exhausted by defining a conceptual probe set of infinite weight after the last input set.
4.
http://dblab.cs.toronto.edu/project/stringer/clustering/.

References

Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. PVLDB 6(14), 1846–1857 (2013)
Google Scholar
Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. PVLDB 9(3), 120–131 (2015)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the WWW Conference, pp. 131–140 (2007)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Qi, S., Whang, S.E., Widom, S.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and xml. In: WebDyn 2002 (2002)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5 (2006)
Google Scholar
Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. TKDE 19(1), 1–16 (2007)
Google Scholar
Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Google Scholar
Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the SIGMOD Conference, pp. 277–281 (2015)
Google Scholar
Kazimianec, M., Augsten, N.: PG-Skip: proximity graph based clustering of long strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011, Part II. LNCS, vol. 6588, pp. 31–46. Springer, Heidelberg (2011)
Chapter Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D., Record linkage: similarity measures and algorithms. In: Proceedings of the SIGMOD Conference, pp. 802–803 (2006)
Google Scholar
Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. In: Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U., Hameurlain, A. (eds.) TLDKS VIII. LNCS, vol. 7790, pp. 174–196. Springer, Heidelberg (2013)
Chapter Google Scholar
Liu, H., Ashwin Kumar, T.K, Thomas, J.P.: Cleaning framework for big data - object identification and linkage. In: Proceedings of the Big Data Congress, pp. 215–221 (2015)
Google Scholar
Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: Proceedings of the VLDB Workshop on Clean Databases (2006)
Google Scholar
Menestrina, D., Whang, S., Garcia-Molina, H.: Evaluating entity resolution results. PVLDB 3(1), 208–219 (2010)
Google Scholar
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: Towards a framework for integrating similarity join algorithms and clustering. In: Proceedings of the ICEIS Conference, pp. 75–80 (2016)
Google Scholar
Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)
Article Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In Proceedings of the SIGMOD Conference, pp. 743–754 (2004)
Google Scholar
Schneider, N.C., Ribeiro, L.A., de Souza, A., Inácio, H.M., Wagner, A., von Wangenheim. SimDataMapper: An architectural pattern to integrate declarative similarity matching into database applications. In: Proceedings of the SBBD Conference, pp. 967–972 (2015)
Google Scholar
Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the SAC Conference, pp. 967–972 (2015)
Google Scholar
Tang, N.: Big RDF data cleaning. In: Proceedings of the ICDE Conference Workshops, pp. 77–79 2015)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of the SIGMOD Conference, pp. 85–96 (2012)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15 (2011)
Article Google Scholar
Zhang, F., Xue, H.-F., Xu, D.-S., Zhang, Y.-H., You, F.: Big data cleaning algorithms in cloud computing. iJOE 9(3), 77–81 (2013)
Google Scholar

Download references

Acknowledgments

This research was partially supported by the Brazilian agencies CNPq and CAPES.

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal de Goiás, Goiânia, Goiás, Brazil
Leonardo Andrade Ribeiro
DIA Department, University of Trieste and ICAR-CNR, Trieste, Italy
Alfredo Cuzzocrea
Departmento de Ciência da Computação, Universidade Federal de Lavras, Lavras, Brazil
Karen Aline Alves Bezerra & Ben Hur Bahia do Nascimento

Authors

Leonardo Andrade Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuzzocrea
View author publications
You can also search for this author in PubMed Google Scholar
Karen Aline Alves Bezerra
View author publications
You can also search for this author in PubMed Google Scholar
Ben Hur Bahia do Nascimento
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

Clausthal University of Technology , Clausthal-Zellerfeld, Germany
Sven Hartmann
Victoria University of Wellington , Wellington, New Zealand
Hui Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2016). Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework. In: Hartmann, S., Ma, H. (eds) Database and Expert Systems Applications. DEXA 2016. Lecture Notes in Computer Science(), vol 9827. Springer, Cham. https://doi.org/10.1007/978-3-319-44403-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-44403-1_12
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44402-4
Online ISBN: 978-3-319-44403-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics