KDXD 2006: Knowledge Discovery from XML Documents pp 13-21 | Cite as
XML Document Clustering by Independent Component Analysis
Abstract
When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.
Keywords
Independent Component Analysis Independent Component Analysis Vector Space Model Latent Semantic Indexing Document VectorPreview
Unable to display preview. Download preview PDF.
References
- 1.Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods. Department of Computer Science. University of Maryland, Technical Report, CS-TR-35l4 (August 1995)Google Scholar
- 2.Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l Workshop Web and Databases, June 1-16 (2002)Google Scholar
- 3.Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)CrossRefGoogle Scholar
- 4.Beyer, K., Goldstein., J., Ramakrishnan., R., Shaft, U.: When is the Nearest Neighbour Meaningful? In: Proc.of the 7th International Conference on Database Theory, pp. 217–235 (1999)Google Scholar
- 5.Parsons, L., Hague, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)CrossRefGoogle Scholar
- 6.Liu, J., Wang, J.T., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: Proc. of ICTAI 2004, pp. 658–662 (2004)Google Scholar
- 7.Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)CrossRefGoogle Scholar
- 8.Bock, H.H.: Probabilistic aspects in clustering analysis. In: Conceptual and numerical analysis of data, pp. 12–44. Springer, Berlin (1989)CrossRefGoogle Scholar
- 9.Honkela, T., Hyvarinen, A.: Linguistic feature extraction using independent component analysis. In: Proc. of IJCNN 2004, Budapest, Hungary (2004)Google Scholar
- 10.Baeza-Yates, R., Ribeiro, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
- 11.Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Processing Letters 17(1), 69–83 (2003)CrossRefGoogle Scholar
- 12.Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
- 13.Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000)CrossRefGoogle Scholar
- 14.Kolenda, T., Hansen, L.K., Sigurdsson, S.: Indepedent Components in Text. In: Advances in Independent Component Analysis, pp. 229–250. Springer, Heidelberg (2000)Google Scholar
- 15.Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering.Proc. of International Conference on Data Mining, April 23, Newport Beach, California (2005)Google Scholar
- 16.DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/
- 17.Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)CrossRefMATHGoogle Scholar
- 18.Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
- 19.Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms forthe hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)CrossRefGoogle Scholar
- 20.Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)CrossRefGoogle Scholar