Skip to main content

A Flexible Structured-Based Representation for XML Document Mining

  • Conference paper
Advances in XML Information Retrieval and Evaluation (INEX 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3977))

Abstract

This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Candillier, L., Tellier, I., Torre, F.: Transforming XML trees for efficient classification and clustering. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 469–480. Springer, Heidelberg (2006)

    Google Scholar 

  2. Candillier, L., Tellier, I., Torre, F., Bousquet, O.: SSC: Statistical subspace clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 100–109. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  3. Celeux, G., Diday, E., Govaert, G., Lechevallier, Y., Ralambondrainy, H.: Classification Automatique des Données, Environnement statistique et informatique. Dunod informatique, Bordas, Paris, France (1989)

    Google Scholar 

  4. Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS, vol. 3202, pp. 137–148. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents by Structure. In: Vouros, G., Panayiotopoulos, T. (eds.) SETN 2004. LNCS, vol. 3025, pp. 112–121. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  7. Denoyer, L.: Apprentissage et inférence statistique dans les bases de documents structurés: Application aux corpus de documents textuels. PhD thesis, Université de Paris 6 (December 2004)

    Google Scholar 

  8. Denoyer, L., Gallinari, P.: Categorization and Clustering of XML documents using Structure and Content Information. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977. Springer, Heidelberg (2006)

    Google Scholar 

  9. Denoyer, L., Vittaut, J.-N., Gallinari, P., Brunesseaux, S.: Structured Multimedia Document Classification. In: ACM Document Engeneering, Grenoble, pp. 153–160 (November 2003)

    Google Scholar 

  10. Doucet, A., Ahonen-Myka, H.: Naïve Clustering of a large XML Document Collection. In: INEX Workshop, pp. 81–87 (2002)

    Google Scholar 

  11. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 55–60 (June 2002)

    Google Scholar 

  12. Francesca, F.D., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based Clustering of XML Documents. In: De Raedt, L., Washio, T. (eds.) MGTS-2003: Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, ECML/PKDD 2203 workshop proceedings, pp. 75–78 (September 2003)

    Google Scholar 

  13. Guillaume, D., Murtagh, F.: Clustering of XML documents. Computer Physics Communications 127(2-3), 215–227 (2000)

    Article  MATH  Google Scholar 

  14. Hubert, L., Arabie, P.: Comparing Partitions. Journal of Classification 2, 193–218 (1985)

    Article  MATH  Google Scholar 

  15. Jianwu, Y., Xiaoou, C.: A semi-structured document model for text mining. J. Comput. Sci. Technol. 17(5), 603–610 (2002)

    Article  MATH  Google Scholar 

  16. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM Press, New York (1999)

    Chapter  Google Scholar 

  17. Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transaction on Knowledge and Data Engineering 16(1), 82–96 (2004)

    Article  Google Scholar 

  18. Liu, J., Wang, J.T.L., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: ICTAI, pp. 658–662 (2004)

    Google Scholar 

  19. Nayak, R., Xu, S.: XML documents clustering by structures with XCLS. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 432–442. Springer, Heidelberg (2006)

    Google Scholar 

  20. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: WebDB, Madison, Wisconsin, USA, pp. 61–66 (June 2002)

    Google Scholar 

  21. Porter, M.F.: An algorithm for suffix stripping. In: Readings in information retrieval, San Francisco, CA, USA, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)

    Google Scholar 

  22. Termier, A., Rousset, M.-C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Washington, DC, USA, p. 450. IEEE Computer Society Press, Los Alamitos (2002)

    Chapter  Google Scholar 

  23. Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML documents using self-organizing maps for structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)

    Google Scholar 

  24. Vercoustre, A.-M., Fegas, M., Lechevallier, Y., Despeyroux, T.: Classification de documents XML à partir d’une représentation linéaire des arbres de ces documents. In: Actes des 6éme journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l’Information (RNTI-E-6), Lille, France, pp. 433–444 (January 2006)

    Google Scholar 

  25. Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 340–344. ACM Press, New York (2000)

    Chapter  Google Scholar 

  26. Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)

    Article  MATH  Google Scholar 

  27. Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–325. ACM Press, New York (2003)

    Google Scholar 

  28. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical Report 01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vercoustre, AM., Fegas, M., Gul, S., Lechevallier, Y. (2006). A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-34963-1_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34962-4

  • Online ISBN: 978-3-540-34963-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics