PG-Skip: Proximity Graph Based Clustering of Long Strings

  • Michail Kazimianec
  • Nikolaus Augsten
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6588)


String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. A promising string clustering approach is the recently proposed Graph Proximity Cleansing (GPC). A distinguishing feature of GPC is that it automatically detects the cluster borders without knowledge about the underlying data, using the so-called proximity graph. Unfortunately, the computation of the proximity graph is expensive. In particular, the runtime is high for long strings, thus limiting the application of the state-of-the-art GPC algorithm to short strings.

In this work we present two algorithms, PG-Skip and PG-Binary, that efficiently compute the GPC cluster borders and scale to long strings. PG-Skip follows a prefix pruning strategy and does not need to compute the full proximity graph to detect the cluster border. PG-Skip is much faster than the state-of-the-art algorithm, especially for long strings, and computes the exact GPC borders. We show the optimality of PG-Skip among all prefix pruning algorithms. PG-Binary is an efficient approximation algorithm, which uses a binary search strategy to detect the cluster border. Our extensive experiments on synthetic and real-world data confirm the scalability of PG-Skip and show that PG-Binary approximates the GPC clusters very effectively.


Horizontal Line Similarity Threshold String Length Pruning Strategy Pruning Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: CleanDB (2006)Google Scholar
  2. 2.
    Kazimianec, M., Augsten, N.: Exact and efficient proximity graph computation. In: Catania, B., Ivanović, M., Thalheim, B. (eds.) ADBIS 2010. LNCS, vol. 6295, pp. 293–307. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)Google Scholar
  4. 4.
    Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: International Conference on Data Engineering (ICDE), Cancún, Mexico, pp. 814–823. IEEE Computer Society, Los Alamitos (2008)Google Scholar
  5. 5.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: International Conference on Data Engineering (ICDE), Washington, DC, USA, pp. 257–266. IEEE Computer Society, Los Alamitos (2008)Google Scholar
  6. 6.
    Kazimianec, M., Mazeika, A.: Clustering of short strings in large databases. In: International Workshop on Database and Expert Systems Applications, pp. 368–372 (2009)Google Scholar
  7. 7.
    Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 303–314, VLDB Endowment (2007)Google Scholar
  8. 8.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
  9. 9.
    Kaufman, L., Rousseeuw, P.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience, New York (1990)zbMATHGoogle Scholar
  10. 10.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)Google Scholar
  11. 11.
    Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: Ordering points to identify the clustering structure. ACM SIGMOD Record 28(2), 49–60 (1999)CrossRefGoogle Scholar
  12. 12.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  13. 13.
    Hodge, V.J., Austin, J.: A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Trans. on Knowl. and Data Eng. 15(5), 1073–1081 (2003)CrossRefGoogle Scholar
  14. 14.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (2003)Google Scholar
  15. 15.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 918–929, VLDB Endowment (2006)Google Scholar
  16. 16.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Michail Kazimianec
    • 1
  • Nikolaus Augsten
    • 1
  1. 1.Faculty of Computer ScienceFree University of Bozen-BolzanoBozenItaly

Personalised recommendations