PG-Skip: Proximity Graph Based Clustering of Long Strings

Kazimianec, Michail; Augsten, Nikolaus

doi:10.1007/978-3-642-20152-3_3

Michail Kazimianec¹⁹ &
Nikolaus Augsten¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6588))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1057 Accesses
4 Citations

Abstract

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. A promising string clustering approach is the recently proposed Graph Proximity Cleansing (GPC). A distinguishing feature of GPC is that it automatically detects the cluster borders without knowledge about the underlying data, using the so-called proximity graph. Unfortunately, the computation of the proximity graph is expensive. In particular, the runtime is high for long strings, thus limiting the application of the state-of-the-art GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that efficiently compute the GPC cluster borders and scale to long strings. PG-Skip follows a prefix pruning strategy and does not need to compute the full proximity graph to detect the cluster border. PG-Skip is much faster than the state-of-the-art algorithm, especially for long strings, and computes the exact GPC borders. We show the optimality of PG-Skip among all prefix pruning algorithms. PG-Binary is an efficient approximation algorithm, which uses a binary search strategy to detect the cluster border. Our extensive experiments on synthetic and real-world data confirm the scalability of PG-Skip and show that PG-Binary approximates the GPC clusters very effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: CleanDB (2006)
Google Scholar
Kazimianec, M., Augsten, N.: Exact and efficient proximity graph computation. In: Catania, B., Ivanović, M., Thalheim, B. (eds.) ADBIS 2010. LNCS, vol. 6295, pp. 293–307. Springer, Heidelberg (2010)
Chapter Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Google Scholar
Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: International Conference on Data Engineering (ICDE), Cancún, Mexico, pp. 814–823. IEEE Computer Society, Los Alamitos (2008)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: International Conference on Data Engineering (ICDE), Washington, DC, USA, pp. 257–266. IEEE Computer Society, Los Alamitos (2008)
Google Scholar
Kazimianec, M., Mazeika, A.: Clustering of short strings in large databases. In: International Workshop on Database and Expert Systems Applications, pp. 368–372 (2009)
Google Scholar
Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 303–314, VLDB Endowment (2007)
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience, New York (1990)
MATH Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: Ordering points to identify the clustering structure. ACM SIGMOD Record 28(2), 49–60 (1999)
Article Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Hodge, V.J., Austin, J.: A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Trans. on Knowl. and Data Eng. 15(5), 1073–1081 (2003)
Article Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (2003)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 918–929, VLDB Endowment (2006)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Free University of Bozen-Bolzano, Dominikanerplatz 3, 39100, Bozen, Italy
Michail Kazimianec & Nikolaus Augsten

Authors

Michail Kazimianec
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
Jeffrey Xu Yu
Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro (373-1 Guseong-don), 305-701, Yuseong-gu, Daejeon, Korea
Myoung Ho Kim
Institute for Computer Science and Business Information Systems (ICB), University of Duisburg-Essen, Schützenbahn 70, 45117, Essen, Germany
Rainer Unland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kazimianec, M., Augsten, N. (2011). PG-Skip: Proximity Graph Based Clustering of Long Strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds) Database Systems for Advanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20152-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-20152-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20151-6
Online ISBN: 978-3-642-20152-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics