Skip to main content

High-Order Co-clustering Text Data on Semantics-Based Representation Model

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Abstract

The language modeling approach is widely used to improve the performance of text mining in recent years because of its solid theoretical foundation and empirical effectiveness. In essence, this approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smooth techniques. Semantic smoothing, which incorporates semantic and contextual information into the language models, is effective and potentially significant to improve the performance of text mining. In this paper, we proposed a high-order structure to represent text data by incorporating background knowledge, Wikipedia. The proposed structure consists of three types of objects, term, document and concept. Moreover, we firstly combined the high-order co-clustering algorithm with the proposed model to simultaneously cluster documents, terms and concepts. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that our proposed high-order co-clustering on high-order structure outperforms the general co-clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yates, R., Neto, B.: Modern information retrieval. Addison-Wesley Longman, Amsterdam (1999)

    Google Scholar 

  2. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proc. of the 30th ACM SIGIR, pp. 787–788 (2007)

    Google Scholar 

  3. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Busygin, S., Prokopyev, O., Pardalos, P.: Bi-clustering in data mining. Computers & Operations Research 35, 2964–2967 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dai, W., Yang, Q., Xue, G., Yu, Y.: Self-taught clustering. In: Proc. of the 25th ICML (2008)

    Google Scholar 

  6. Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proc. of the 9th ACM SIGKDD, pp. 89–98 (2003)

    Google Scholar 

  7. Fred, A.: Finding consistent clusters in data partitions. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp. 309–318. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Gao, B., Liu, T., Cheng, Q., Ma, W.: Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In: Proc. of ACM SIGKDD, pp. 41–50 (2005)

    Google Scholar 

  9. Gao, B., Liu, T., Ma, W.: Star-structured high-order heterogeneous data co-clustering based on consistent information theory. In: Proc. of the 6th ICDM (2006)

    Google Scholar 

  10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of ACM SIGIR, pp. 50–57 (1999)

    Google Scholar 

  11. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proc. of the Semantic Web Workshop at the 26th ACM SIGIR, pp. 541–544 (2003)

    Google Scholar 

  12. Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proc. of the 31st ACM SIGIR, pp. 179–186 (2008)

    Google Scholar 

  13. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proc. of the 15th ACM SIGKDD, pp. 389–396 (2009)

    Google Scholar 

  14. Huang, A., Milne, D., Frank, E., Witten, I.: Clustering documents using a wikipedia-based concept representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 628–636. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  15. Jing, L., Ng, M., Huang, J.: Knowledge-based vector space model for text clustering. Knowledge and Information Systems 25, 35–55 (2010)

    Article  Google Scholar 

  16. Kittur, A., Chi, E., Suh, B.: What’s in wikipedia? mapping topics and conflict using socially annotated category structure. In: Proc. of the 27th CHI, pp. 1509–1512 (2009)

    Google Scholar 

  17. Medelyan, O., Witten, I., Milne, D.: Topic indexing with wikipedia. In: Proc. of AAAI (2008)

    Google Scholar 

  18. Milne, D., Witten, I.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proc. of the Workshop on Wikipedia and Artificial Intelligence at AAAI, pp. 25–30 (2008)

    Google Scholar 

  19. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proc. of the 14th ACM SIGKDD, New York, NY, USA, pp. 713–721 (2008)

    Google Scholar 

  20. Wang, P., Domeniconi, C., Hu, J.: Using wikipedia for co-clustering based cross-domain text classification. In: Proc. of the 8th ICDM, pp. 1085–1090 (2008)

    Google Scholar 

  21. Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report 02-014, University of Minnesota (2002)

    Google Scholar 

  22. Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proc. of SDW Workshop on Clustering High Dimensional Data and its Applications, San Francisco, CA (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jing, L., Yun, J., Yu, J., Huang, J. (2011). High-Order Co-clustering Text Data on Semantics-Based Representation Model. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20841-6_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20840-9

  • Online ISBN: 978-3-642-20841-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics