XML Document Clustering by Independent Component Analysis

  • Tong Wang
  • Da-Xin Liu
  • Xuan-Zuo Lin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3915)


When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.


Independent Component Analysis Independent Component Analysis Vector Space Model Latent Semantic Indexing Document Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods. Department of Computer Science. University of Maryland, Technical Report, CS-TR-35l4 (August 1995)Google Scholar
  2. 2.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l Workshop Web and Databases, June 1-16 (2002)Google Scholar
  3. 3.
    Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Beyer, K., Goldstein., J., Ramakrishnan., R., Shaft, U.: When is the Nearest Neighbour Meaningful? In: Proc.of the 7th International Conference on Database Theory, pp. 217–235 (1999)Google Scholar
  5. 5.
    Parsons, L., Hague, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)CrossRefGoogle Scholar
  6. 6.
    Liu, J., Wang, J.T., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: Proc. of ICTAI 2004, pp. 658–662 (2004)Google Scholar
  7. 7.
    Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)CrossRefGoogle Scholar
  8. 8.
    Bock, H.H.: Probabilistic aspects in clustering analysis. In: Conceptual and numerical analysis of data, pp. 12–44. Springer, Berlin (1989)CrossRefGoogle Scholar
  9. 9.
    Honkela, T., Hyvarinen, A.: Linguistic feature extraction using independent component analysis. In: Proc. of IJCNN 2004, Budapest, Hungary (2004)Google Scholar
  10. 10.
    Baeza-Yates, R., Ribeiro, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  11. 11.
    Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Processing Letters 17(1), 69–83 (2003)CrossRefGoogle Scholar
  12. 12.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  13. 13.
    Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000)CrossRefGoogle Scholar
  14. 14.
    Kolenda, T., Hansen, L.K., Sigurdsson, S.: Indepedent Components in Text. In: Advances in Independent Component Analysis, pp. 229–250. Springer, Heidelberg (2000)Google Scholar
  15. 15.
    Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering.Proc. of International Conference on Data Mining, April 23, Newport Beach, California (2005)Google Scholar
  16. 16.
    DBLP Computer Science Bibliography (2004),
  17. 17.
    Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)CrossRefMATHGoogle Scholar
  18. 18.
    Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  19. 19.
    Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms forthe hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)CrossRefGoogle Scholar
  20. 20.
    Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Tong Wang
    • 1
  • Da-Xin Liu
    • 1
  • Xuan-Zuo Lin
    • 2
  1. 1.Department of Computer Science and TechnologyHarbin Engineering UniversityChina
  2. 2.Northeast Agriculture UniversityHarbinChina

Personalised recommendations