Skip to main content
Log in

Knowledge-based vector space model for text clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berry M (2003) Survey of text mining: clustering, classification, and retrieval. Hardcover

  2. Ji X, Xu W, Zhu S (2006) Document clustering with prior knowledge. In: Proceedings of ACM SIGIR, Seattle, Washington, USA

  3. Jing L, Ng M, Huang J (2007) An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8): 1026–1041

    Article  Google Scholar 

  4. Wan X (2008) Beyond topic similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inform Syst 15(1): 55–73

    Article  Google Scholar 

  5. Zhang X, Hu X, Zhou X (2008) A comparative evaluation of different link types on enhancing document clustring. In: Proceedings of ACM SIGIR, Singapore, pp 555–562

  6. Nayak R (2008) Fast and effective clustering of XML data using structural information. Knowl Inform Syst 14(2): 197–215

    Article  MathSciNet  Google Scholar 

  7. Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recognit Lett 30(10): 870–876

    Article  Google Scholar 

  8. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

  9. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. Technical Report ♯00-034 at Department of Computer Science and Engineering, University of Minnesota

  10. Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at 26th annual international ACM SIGIR conference, Toronto, Canada

  11. Hotho A, Bloehdorn S (2004) Text classification by boosting weak learners based on terms and concepts. In: Proceedings of the 4th IEEE international conference on data mining, Brighton, UK, pp 72–79

  12. Mao W, Chu W (2002) Free text medical document retrieval via phrased-based vector space model. In: Proceedings of American medical informatics association annual symposium, San Antonio, TX, USA

  13. Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms, Fellbaum, pp 305–332

  14. Jurisica I, Mylopolous J, Yu E (2004) Ontologies for knowledge management: an information systems perspective. Knowl Inform Syst 6(4): 380–401

    Article  Google Scholar 

  15. Budanitsky A (1999) Lexical semantic relatedness and its application in natural language processing. Technical Report CSRG390 in University of Toronto. ftp://ftp.cs.utoronto.ca/csrg-technical-reports/390

  16. Rada R, Mili H, Bicknell E, Blettner M (1989) Development an application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1): 17–30

    Article  Google Scholar 

  17. Sussna M (1993) Word sense disambiguation for free-text indexing using a massive semantic network. In: Proceedings of the 2nd international conference on information and knowledge management, Arlington, Virginia

  18. Resnik P (1995) Using information content to evaluate semantic similarity. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 448–453

  19. Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  20. Katsavounidis I, Kuo C, Zhang Z (1994) A new initialization technique for generalized Lioyd iteration. IEEE Signal Proc Lett 1(10): 144–146

    Article  Google Scholar 

  21. Gruber T (1993) A translation approach to portable ontologies. Knowl Acquisit 5(2): 199–220

    Article  Google Scholar 

  22. Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguistics 32(1): 13–47

    Article  Google Scholar 

  23. Zhao G (1996) Analogical translator: experience-guided transfer in machine translation. Ph.D. Thesis at UMIST, UK

  24. Kolodner J (1993) Case-based reasoning. Morgan Kaufmann, Menlo Park

    Google Scholar 

  25. Maedche A, Staab S (2001) Ontology learning for the Semantic Web. IEEE Trans Intell Syst 31: 72–79

    Article  Google Scholar 

  26. Sabou M (2005) Learning web service ontologies automatic extraction method and its evaluation. In: Ontology learning from text: methods, applications and evaluation. IOS Press, Amsterdam

  27. Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Intell Res 24: 305–339

    MATH  Google Scholar 

  28. Miller G, Charles G (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1): 1–28

    Article  Google Scholar 

  29. Weiss S, Kulikowski C (1991) Computer systems that learn: classification and prediction methods from statistics neural nets, machine learning, and expert systems. Morgan Kaufmann, Menlo Park

  30. Edwards A (1976) The correlation coefficient. In: An introduction to linear regression and correlation. Freeman, San Francisco

  31. Hersh W, Buchley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of ACM SIGIR, Dublin, Ireland, pp 192–201

  32. Miller G (1998) Nouns in WordNet. In: WordNet: an electronic lexical database. MIT Press, Cambridge

  33. Steinbach W, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining, Boston, MA, USA

  34. Jing L, Ng M, Huang Z (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Proceedings of PAKDD, Hanoi, Vietnam, pp 802–812

  35. Dubes R, Jain A (1988) Algorithms for clustering data. Englewood Cliffs, Prentice-Hall

    MATH  Google Scholar 

  36. Zhao Y, Karypis G (2002) Comparison of agglomerative and partitional document clustering algorithms. Technical report ♯02-014 at University of Minnesota

  37. Shi Z, Ghosh J (2003) A comparative study of generative models for document clustering. In: Proceedings of SDW workshop on clustering high dimensional data and its applications, San Francisco, CA, USA

  38. Kotis K, Vouros G (2006) Human-centered ontology engineering: the HCOME methodology. Knowl Inform Syst 10(1): 109–131

    Article  Google Scholar 

  39. Wang P, Hu J, Zeng H, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3): 265–281

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liping Jing.

Additional information

Part of this work was done in The University of Hong Kong and Hong Kong Baptist University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jing, L., Ng, M.K. & Huang, J.Z. Knowledge-based vector space model for text clustering. Knowl Inf Syst 25, 35–55 (2010). https://doi.org/10.1007/s10115-009-0256-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0256-5

Keywords

Navigation