Information Retrieval Journal

, Volume 21, Issue 1, pp 24–55 | Cite as

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article
  • 97 Downloads

Abstract

A new method is proposed for clustering XML documents by structure-constrained phrases. It is implemented by three machine-learning approaches previously unexplored in the XML domain, namely non-negative matrix (tri-)factorization, co-clustering and automatic transactional clustering. A novel class of XML features approximately captures structure-constrained phrases as n-grams contextualized by root-to-leaf paths. Experiments over real-world benchmark XML corpora show that the effectiveness of the three approaches improves with contextualized n-grams of suitable length. This confirms the validity of the devised method from multiple clustering perspectives. Two approaches overcome in effectiveness several state-of-the-art competitors. The scalability of the three approaches is investigated, too.

Keywords

XML Semi-structured data analysis XML (co-)clustering by structure and nested text Structure-constrained phrases Contextualized n-grams 

References

  1. Abiteboul, S. (1997). Querying semistructured data. In Proceedings of the international conference on database theory (pp. 1–18).Google Scholar
  2. Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: From relations to semistructured data and XML. Burlington: Morgan Kaufmann.Google Scholar
  3. Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. (2007). Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the international conference on knowledge discovery and data mining (pp. 46–55).Google Scholar
  4. Albright, R., Cox, J., Duling amd, D., Langville, A. N., & Meyer, C. D. (2006). Algorithms, initializations, and convergence for the nonnegative matrix factorization. Technical Report Math 81706, North Carolina State University.Google Scholar
  5. Algergawy, A., Mesiti, M., Nayak, R., & Saake, G. (2011). Xml data clustering: An overview. ACM Computing Surveys, 43(4), 25:1–25:41.CrossRefMATHGoogle Scholar
  6. Bratko, A., & Filipič, B. (2006). Exploiting structural information for semi-structured document categorization. Information Processing and Management, 42(3), 679–694.CrossRefGoogle Scholar
  7. Cesario, E., Manco, G., & Ortale, R. (2007). Top-down parameter-free clustering of high-dimensional categorical data. IEEE Transactions on Knowledge and Data Engineering, 19(12), 1607–1624.CrossRefGoogle Scholar
  8. Cho, H., Dhillon, I. S., Guan, Y., & Sra. S. (2004). Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the SIAM international conference on data mining (pp. 114–125).Google Scholar
  9. Cichocki, A., Zdunek, R., Phan, A. H., & Amari, S. (2009). Nonnegative matrix and tensor factorizations. London: Wiley.CrossRefGoogle Scholar
  10. Connolly, T., & Begg, C. (2002). Database systems: A practical approach to design, implementation, and management. Reading: Addison Wesley.MATHGoogle Scholar
  11. Costa, G., Manco, G., & Ortale, R. (2008). A hierarchical model-based approach to co-clustering high-dimensional data. In Proceedings of ACM symposium on applied computing (pp. 886–890)Google Scholar
  12. Costa, G., Manco, G., & Ortale, R. (2010). An incremental clustering scheme for data deduplication. Data Mining and Knowledge Discovery, 20(1), 152–187.MathSciNetCrossRefGoogle Scholar
  13. Costa, G., Manco, G., Ortale, R., & Ritacco, E. (2013). Hierarchical clustering of xml documents focused on structural components. Data and Knowledge Engineering, 84, 26–46.CrossRefGoogle Scholar
  14. Costa, G., Manco, G., Ortale, R., & Tagarelli, A. (2004). A tree-based approach to clustering xml documents by structure. In Proceedings of the international conference on principles and practice of knowledge discovery in databases (pp. 137–148).Google Scholar
  15. Costa, G., & Ortale, R. (2012a). On effective xml clustering by path commonality: An efficient and scalable algorithm. In IEEE international conference on tools with artificial intelligence (pp. 389–396).Google Scholar
  16. Costa, G., & Ortale, R. (2012b). Structure-oriented clustering of xml documents: A transactional approach. In IEEE international conference on intelligent systems (pp. 188–193).Google Scholar
  17. Costa, G., & Ortale, R. (2013a). Developments in partitioning xml documents by content and structure based on combining multiple clusterings. In IEEE international conference on tools with artificial intelligence (pp. 477–482).Google Scholar
  18. Costa, G., & Ortale, R. (2013b). A latent semantic approach to xml clustering by content and structure based on non-negative matrix factorization. In IEEE international conference on machine learning applications (pp. 179–184).Google Scholar
  19. Costa, G., & Ortale, R. (2014). Xml document co-clustering via non-negative matrix tri-factorization. In International conference on tools with artificial intelligence (pp. 607–614).Google Scholar
  20. Costa, G., & Ortale, R. (2015a). Fully-automatic xml clustering by structure-constrained phrases. In International conference on tools with artificial intelligence (pp. 146–153).Google Scholar
  21. Costa, G., & Ortale, R. (2015b). Mining clusters in xml corpora based on Bayesian generative topic modeling. In International conference on machine learning applications (pp. 515–520).Google Scholar
  22. Costa, G., & Ortale, R. (2017). XML clustering by structure-constrained phrases: A fully-automatic approach using contextualized n-grams. International Journal on Artificial Intelligence Tools, 26(1), 1–24.CrossRefGoogle Scholar
  23. Costa, G., Ortale, R., & Ritacco, E. (2011). Effective xml classification using content and structural information via rule learning. In International conference on tools with artificial intelligence (pp. 102–109).Google Scholar
  24. Costa, G., Ortale, R., & Ritacco, E. (2013). X-class: Associative classification of xml documents by structure. ACM Transactions on Information Systems, 31(1), 3:1–3:40.CrossRefGoogle Scholar
  25. Dalamagas, T., Cheng, T., Winkel, K.-J., & Sellis, T. (2006). A methodology for clustering xml documents by structure. Information Systems, 31(3), 187–228.CrossRefMATHGoogle Scholar
  26. Denoyer, L., & Gallinari, P. (2007). Report on the xml mining track at INEX 2005 and INEX 2006. ACM SIGIR Forum, 41(1), 79–90.CrossRefGoogle Scholar
  27. Denoyer, L., & Gallinari, P. (2008). Report on the xml mining track at INEX 2007. ACM SIGIR Forum, 42(1), 22–28.CrossRefGoogle Scholar
  28. Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–274).Google Scholar
  29. Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. 89–98).Google Scholar
  30. De Francesca, F., Gordano, G., Ortale, R., & Tagarelli, A. (2003). Distance-based clustering of xml documents. In International ECML/PKDD workshop on mining graphs, trees and sequences (pp. 75–78).Google Scholar
  31. Ding, C. H. Q., Li, T., Peng, W., & Park. H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. 126–135).Google Scholar
  32. Doucet, A., & Lehtonen, M. (2006). Unsupervised classification of text-centric xml document collections. In Proceedings of the workshop of the initiative for the evaluation of XML retrieval (pp. 497–509).Google Scholar
  33. Fox, C. (1992). Lexical analysis and stoplists. Upper Saddle River: Prentice Hall.Google Scholar
  34. Hagenbuchner, M., Tsoi, A. C., Sperduti, A., & Kc, M. (2008). Efficient clustering of structured documents using graph self-organizing maps. In Focused access to XML Documents (pp. 207–221)Google Scholar
  35. Joshi, S., Agrawal, N., Krishnapuram, R., & Negi. S. (2003). A bag of paths model for measuring structural similarity in web documents. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 577–582).Google Scholar
  36. Kutty, S., Nayak, R., & Li, Y. (2009a). Hcx: An efficient hybrid clustering approach for xml documents. In Proceedings of ACM symposium on document engineering (pp. 94–97).Google Scholar
  37. Kutty, S., Nayak, R., & Li, Y. (2009b). Xcfs: A novel approach for clustering xml documents using both the structure and the content. In Proceedings of ACM conference on information and knowledge management (pp. 1729 – 1732).Google Scholar
  38. Kutty, S., Nayak, R., & Li, Y. (2011). Xml documents clustering using a tensor space model. In Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 488–499).Google Scholar
  39. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556–562).Google Scholar
  40. Lee, M. L., Yang, L. H., Hsu, W., & Yang, X. (2002). Xclust: Clustering xml schemas for effective integration. In Proceedings of international conference on information and knowledge management (pp. 292–299).Google Scholar
  41. Li, T., Sindhwani, V., Ding, C. H. Q., & Zhang, Y. (2010). Bridging domains with words: Opinion analysis with matrix tri-factorizations. In Proceedings of SIAM international conference on data mining (pp. 293–302).Google Scholar
  42. Lian, W., Cheung, D. W., Mamoulis, N., & Yiu, S.-M. (2004). An efficient and scalable algorithm for clustering xml documents by structure. IEEE Transactions on Knowledge and Data Engineering, 16(1), 82–96.CrossRefGoogle Scholar
  43. Long, B., Wu, X., Zhang, Z., & Yu, P. S. (2006). Unsupervised learning on k-partite graphs. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. 317–326).Google Scholar
  44. Menahem, E., Schclar, A., Rokach, L., & Elovici, Y. (2016). Xml-ad: Detecting anomalous patterns in xml documents. Information Sciences, 326, 71–88.CrossRefGoogle Scholar
  45. Pauca, V. P., Shahnaz, F., Berry, M. W., & Plemmons, R. J. (2004). Text mining using non-negative matrix factorizations. In SIAM international conference on data mining (pp. 452–456).Google Scholar
  46. Piernik, M., Brzezinski, D., Morzy, T., & Lesniewska, A. (2015). Xml clustering: A review of structural approaches. The Knowledge Engineering Review, 30(3), 297–323.CrossRefMATHGoogle Scholar
  47. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRefGoogle Scholar
  48. Salton, G. (1991). Developments in automatic text retrieval. Science, 253, 974–980.MathSciNetCrossRefGoogle Scholar
  49. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for information retrieval. Communications of the ACM, 18, 613–620.CrossRefMATHGoogle Scholar
  50. Shan, H., & Banerjee, A. (2008). Bayesian co-clustering. In Proceedings of international conference on data mining (pp. 530–539).Google Scholar
  51. Song, Y., Pan, S., Liu, S., Wei, F., Zhou, M. X., & Qian, W. (2010). Constrained co-clustering for textual documents. In Proceedings of AAAI conference on artificial intelligence (pp. 581–586).Google Scholar
  52. Tang, B., Shepherd, M., Milios, E., & Heywood, M. I. (2005). Comparing and combining dimension reduction techniques for efficient text clustering. In Canadian conference on artificial intelligence (pp. 292–296).Google Scholar
  53. Tran, T., Nayak, R., & Bruza, P. (2008). Combining structure and content similarities for xml document clustering. In Australasian conference on data mining (pp. 219–226).Google Scholar
  54. Tran, T., Nayak, R., & Bruza, P. (2008). Document clustering using incremental and pairwise approaches. In Focused access to XML documents (pp. 222–233).Google Scholar
  55. W3C. Extensible markup language (xml) 1.0 (fifth edition) W3C Recommendation. 2008. http://www.w3c.org.
  56. Wang, H., Nie, F., Huang, H., & Makedon, F. (2011). Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In Proceedings of international joint conference on artificial intelligence (pp. 1553–1558).Google Scholar
  57. Wang, P., Domeniconi, C., & Laskey, K. B. (2009). Latent Dirichlet Bayesian co-clustering. In Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases (pp. 522–537).Google Scholar
  58. Wilde, E., & Glushko, R. J. (2008). Xml fever. Communications of the ACM, 51(7), 40–46.CrossRefGoogle Scholar
  59. Xu, W., Liu, X., & Gong, Y., (2003). Document clustering based on non-negative matrix factorization. In ACM SIGIR conference on research and development in information retrieval (pp. 267–273).Google Scholar
  60. Yao, J., & Zerida, N. (2007). Rare patterns to improve path-based clustering of Wikipedia articles. In Pre-proceedings of the initiative for the evaluation of XML retrieval (pp. 224–231).Google Scholar
  61. Zaki, M. J., & Aggarwal, C. C. (2003). Xrules: An effective structural classifier for xml data. In Proceedings of SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 316–325).Google Scholar
  62. Zhao, Y., & Karypis, G. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2), 141–168.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.ICAR-CNRRendeItaly

Personalised recommendations