Advertisement

Document Clustering Using Incremental and Pairwise Approaches

  • Tien Tran
  • Richi Nayak
  • Peter Bruza
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4862)

Abstract

This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a pairwise clustering method. The approach enables us to perform the clustering task on a large dataset by first reducing the dimension of the dataset to an undefined number of clusters using the incremental method. The lower-dimension dataset is then clustered to a required number of clusters using the pairwise method. In this way, clustering of the large number of documents is performed successfully and the accuracy of the clustering solution is achieved.

Keywords

Clustering structure content XML INEX 2007 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Do, H.H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: 28th VLDB, Hong Kong, China, August) propose a hybrid matching algorithm using the modulation of veraious approaches. They support user feedback and reuse previous matchings (one to one matching) (2002)Google Scholar
  2. 2.
    Lee, L.M., Yang, L.H., Hsu, W., Yang, X.: Xclust: Clustering xml schemas for effective integration. In: 11th ACM International Conference on Information and Knowledge Management (CIKM 2002), propose a clustering method that computes a similarity between XMl schemas (one to one matching), Virginia (November 2002)Google Scholar
  3. 3.
    Lian, W., Cheung, D.W., Maoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE TKDE 16(1), 82–96 (2004)Google Scholar
  4. 4.
    Karypis, G.: Cluto - software for clustering high-dimensional datasets karypis labGoogle Scholar
  5. 5.
    Nayak, R., Tran, T.: A progressive clustering algorithm to group the xml data by structural and semantic similarity. IJPRAI 21(3), 1–21 (2007)Google Scholar
  6. 6.
    Nayak, R., Xu, S.: Xcls: A fast and effective clustering algorithm for heterogenous xml documents. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Fuhr, N., Lalmas, M., Trotman, A., Kamps, J.: Focused access to xml documents. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Revised and Selected Papers, Dagstuhl Castle, Germany. Springer, Heidelberg (2007) (to appear 2008)Google Scholar
  8. 8.
    Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. Journal of Intelligent Information Systems (JJIS) 18(2) (2002)Google Scholar
  9. 9.
    Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes (25), 259–284 (1998)CrossRefGoogle Scholar
  10. 10.
    Kim, Y.S., Cho, W.J., Lee, J.Y.: An intelligent grading system using heterogeneous linguistic resources. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 102–108. Springer, Heidelberg (2005)Google Scholar
  11. 11.
    Yang, J., Cheung, W., Chen, X.: Learning the kernel matrix for xml document clustering. In: e-Technology, e-Commerce and e-Service (2005)Google Scholar
  12. 12.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  13. 13.
    Kutty, S., Tran, T., Nayak, R., Li, Y.: Clustering xml documents using closed frequency subtrees - a structure-only based approach. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)Google Scholar
  14. 14.
    Hagenbuchner, M., Tsoi, A., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, Decemeber 17-19 (2007)Google Scholar
  15. 15.
    Yao, J., Zerida, N.: Rare patterns to improve path-based clustering. In: 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)Google Scholar
  16. 16.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  17. 17.
    Zhao, Y., Karypis, G.: Empirical and theorectical comparisons of selected criterion functions for document clustering. In: Machine Learning, pp. 311–331 (2004)Google Scholar
  18. 18.
    Zhao, Y., Karypis, G.: Hierarchical clustering alogrithms for document datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Tien Tran
    • 1
  • Richi Nayak
    • 1
  • Peter Bruza
    • 1
  1. 1.Information TechnologyQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations