Enhancing Document Clustering Using Reweighting Terms Based on Semantic Features

  • Sun Park
  • Jin Gwan Park
  • Min A. Jeong
  • Jong Geun Jeong
  • Yeonwoo Lee
  • Seong Ro Lee
Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 235)

Abstract

This paper proposes a new document clustering method using the reweighted term based on semantic features for enhancing document clustering. The proposed method uses document samples of cluster by user to reduce the semantic gap between the user’s requirement and clustering results by machine. The method can enhance the document clustering because it uses the reweighted term which can well represent an inherent structure of document set relevant to a user’s requirement. The experimental results demonstrate that the proposed method achieves better performance than related document clustering methods.

Keywords

Document clustering Reweighting term Sematic feature Non-negative matrix factorization (NMF) 

Notes

Acknowledgments

This work was supported by Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2009-0093828). “This research was supported by the The Ministry of Knowledge Economy (MKE), Korea, under the Information Technology Research Center (ITRC) support program supervised by the National IT Industry Promotion Agency (NIPA)” (NIPA-2012-H0301-12-2005).

References

  1. 1.
    Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting wikipedia as external knowledge for document clustering. In: Proceeding of the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD’09). Paris, France, pp 389–396Google Scholar
  2. 2.
    Hu T, Xiong H, Zhou WS, Sung Y, Luo H (2008) Hypergraph partitioning for document clustering: a unified clique perspective. In: Proceeding of the ACM SIGIR conference on research and development in information retrieval (SIGIR’08). Singapore, pp 871–872Google Scholar
  3. 3.
    Park S, Kim KJ (2010) Document clustering using non-negative matrix factorization and fuzzy relationship. J Korea Navig Inst 14(2):239–246Google Scholar
  4. 4.
    Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceeding of the ACM SIGIR conference on research and development in information retrieval (SIGIR’03). Toronto, CanadaGoogle Scholar
  5. 5.
    Xu W, Gong Y (2004) Document clustering by concept factorization. In: Proceeding of the ACM SIGIR conference on research and development in information retrieval (SIGIR’04). UK, pp 202–209Google Scholar
  6. 6.
    Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceeding of the ACM SIGIR conference on research and development in information retrieval (SIGIR’04). UK, pp 218–225Google Scholar
  7. 7.
    Wang F, Zhang C (2007) Regularized clustering for documents. In: Proceeding of the ACM SIGIR conference on research and development in information retrieval (SIGIR’07). Amsterdam, pp 95–102Google Scholar
  8. 8.
    Park S, An DU, Cha BR, Kim CW (2009) Document clustering with cluster refinement and non-negative matrix factorization. In: Proceeding of the 16th international conference on neural information processing (ICONIP’09). Bangkok, ThailandGoogle Scholar
  9. 9.
    Park S, An DU, Choi IC (2010) Document clustering using weighted semantic features and cluster similarity. In: Proceeding of the 3rd IEEE international conference on digital game and intelligent toy enhanced learning (DIGITEL’10). Kaohsiung, TaiwanGoogle Scholar
  10. 10.
    Park S, An DU, Cha BR, Kim CW (2010) Document clustering with semantic feature and fuzzy association. In: Proceeding of the international conference on information systems, technology and management (ICISTM’10). Bangkok, ThailandGoogle Scholar
  11. 11.
    Park S, Kim KJ (2010) Document Clustering using non-negative matrix factorization and fuzzy relationship. J Korea Navig Inst 14(2):239–246Google Scholar
  12. 12.
    Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791CrossRefGoogle Scholar
  13. 13.
    Frankes WB, Ricardo BY (1992) Information retrieval, data structure & algorithms. Prentice-Hall, Englewood CliffsGoogle Scholar
  14. 14.
    Ricardo BY, Berthier RN (1999) Moden information retrieval. ACM Press, New YorkGoogle Scholar
  15. 15.
    The 20 newsgroups data set (2012). http://people.csail.mit.edu/jrennie/20Newsgroups/

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Sun Park
    • 1
  • Jin Gwan Park
    • 1
  • Min A. Jeong
    • 1
  • Jong Geun Jeong
    • 2
  • Yeonwoo Lee
    • 1
  • Seong Ro Lee
    • 1
  1. 1.Mokpo National UniveristyMokpoSouth Korea
  2. 2.National Research Foundation of KoreaSeoulSouth Korea

Personalised recommendations