An Effective Data Processing Method for Fast Clustering

  • Hyun-Joo Moon
  • Sangheon Kim
  • Jongbae Moon
  • Eun-Ser Lee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5073)


Because of the extensive diffusion of Internet usage, heterogeneous computing platforms, and ubiquitous computing technologies, Web data that are usually written in XML format are explosively increased. With the growth of Web data and the importance of their clustering, we need similarity detection method because it is a fundamental technology for efficient document management. In this paper, we introduce a similarity detection method that can check both semantic similarity and structural similarity between XML DTDs. For semantic checking, we adopt ontology technology, and we apply longest common string and longest nesting common string methods for structural checking. Our similarity detection method uses multi-tag sequences instead of traversing XML schema trees, so that it gets fast and reasonable results.


XML DTD Similarity Detection Ontology Tag Sequences 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Extensible Markup Language (XML) 1.0 (1998),
  2. 2.
    Lian, W., Cheung, D.W., Yiu, S.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering 16(1) (January 2004)Google Scholar
  3. 3.
    Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree- Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)Google Scholar
  4. 4.
    Klein, P., Tirthapura, S., Sharvit, D., Kimia, B.: A Tree-edit -distance Algorithm for Comparing Simple, Closed Shapes. In: Proceedings of the 11th Annual ACM SIAM Symposium of Discrete Algorithms, pp. 696–704 (2000)Google Scholar
  5. 5.
    Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)Google Scholar
  6. 6.
    Borenstein, E., Sharon, E., Ullman, S.: Combining Top-down and Bottom-up Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2004)Google Scholar
  7. 7.
    Ekram, R.A., Adma, A., Baysal, O.: Diffx: An Algorithm to Detect Changes in Multi-Version XML Documents. In: Proceedings of the 2005 Conference on the Centre for Advanced Studies on Collaborative Research (2005)Google Scholar
  8. 8.
    Zhang, K., Wang, J.T., Shasha, D.: On the Editing Distance between Undirected Acyclic Grahphs and Related Problems. In: Proceedings of the 6th Annual Symposium of Combinatorial Pattern Matching (1995)Google Scholar
  9. 9.
    Rafiei, D., Mendelzon, A.: Similarity-Based Queries for Time Series Data. In: Proceedings of the ACM International Conference on Management of Data, pp. 13–24 (May 1997)Google Scholar
  10. 10.
    Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proceedings of the 11th ACM International Conference on Information and Knowledge Management, pp. 292–299 (2002)Google Scholar
  11. 11.
    Moon, H.J., Kim, K.J., Park, G.C., Yoo, C.W.: Effective Similarity Discovery from Semi-structured Documents. International Journal of Multimedia and Ubiquitous Engineering 1(4), 12–18 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Hyun-Joo Moon
    • 1
  • Sangheon Kim
    • 1
  • Jongbae Moon
    • 2
  • Eun-Ser Lee
    • 3
  1. 1.Dept. of Cultural ContentsHankuk University of Foreign StudiesSeoulKorea
  2. 2.Korea Institute of Science and Technology InformationDaejeonKorea
  3. 3.Dept. of Computer EngineeringAndong National UniversityAndong-cityKorea

Personalised recommendations