XML Document Clustering Using Structure-Preserving Flat Representation of XML Content and Structure

  • Fedja Hadzic
  • Michael Hecker
  • Andrea Tagarelli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7121)

Abstract

With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach.

Keywords

Minority Class Minimum Support Threshold Tree Instance Semistructured Data Tree Edit Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.: XProj: a framework for projected structural clustering of XML documents. In: Proc. ACM KDD Conf., pp. 46–55 (2007)Google Scholar
  2. 2.
    Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)CrossRefMATHGoogle Scholar
  3. 3.
    Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A methodology for clustering XML documents by structure. Information Systems 31(3) (2006)Google Scholar
  5. 5.
    Doucet, A., Lehtonen, M.: Unsupervised Classification of Text-Centric XML Document Collections. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 497–509. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Hadzic, F.: A Structure Preserving Flat Data Format Representation for Tree-Structured Data. In: Proc. PAKDD Workshops (QIME 2011), Springer, Heidelberg (2011)Google Scholar
  7. 7.
    Hadzic, F., Tan, H., Dillon, T.S.: Mining of Data with Complex Structures, 1st edn. SCI, vol. 333. Springer, Heidelberg (2011)MATHGoogle Scholar
  8. 8.
    Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: Proc. ACM KDD Conf., pp. 577–582 (2003)Google Scholar
  9. 9.
    Karypis, G.: CLUTO - Software for Clustering High-Dimensional Datasets (2002/2007), http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
  10. 10.
    Kutty, S., Nayak, R., Li, Y.: HCX: an efficient hybrid clustering approach for XML documents. In: Proc. ACM Symposium on Document Engineering, pp. 94–97 (2009)Google Scholar
  11. 11.
    Kutty, S., Nayak, R., Li, Y.: XML Documents Clustering using a Tensor Space Model. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 488–499. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  12. 12.
    Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge Data Engineering 16(1), 82–96 (2004)CrossRefGoogle Scholar
  13. 13.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. WebDB Workshop, pp. 61–66 (2002)Google Scholar
  14. 14.
    Punin, J.R., Krishnamoorthy, M.S., Zaki, M.J.: LOGML: Log Markup Language for Web Usage Mining. In: Kohavi, R., Masand, B., Spiliopoulou, M., Srivastava, J. (eds.) WebKDD 2001. LNCS (LNAI), vol. 2356, pp. 88–112. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. KDD Workshop on Text Mining (2000)Google Scholar
  16. 16.
    Tagarelli, A., Greco, S.: Semantic clustering of XML documents. ACM Transactions on Information Systems 28(1) (2010)Google Scholar
  17. 17.
    Yao, J.T., Varde, A., Rundensteiner, E., Fahrenholz, S.: XML Based Markup Languages for Specific Domains. In: Web-based Support Systems. Advanced Information and Knowledge Processing, pp. 215–238. Springer, London (2010)CrossRefGoogle Scholar
  18. 18.
    Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2–3), 241–254 (2001)CrossRefMATHGoogle Scholar
  19. 19.
    Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)CrossRefGoogle Scholar
  20. 20.
    Zhao, Y., Karypis, G.: Empirical and Theoretical Comparison of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Fedja Hadzic
    • 1
  • Michael Hecker
    • 1
  • Andrea Tagarelli
    • 2
  1. 1.Digital Ecosystems and Business Intelligence InstituteCurtin UniversityAustralia
  2. 2.Dept. of Electronics, Computer and Systems SciencesUniversity of CalabriaItaly

Personalised recommendations