Abstract
This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.
Similar content being viewed by others
References
Berry M (2003) Survey of text mining: clustering, classification, and retrieval. Hardcover
Ji X, Xu W, Zhu S (2006) Document clustering with prior knowledge. In: Proceedings of ACM SIGIR, Seattle, Washington, USA
Jing L, Ng M, Huang J (2007) An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8): 1026–1041
Wan X (2008) Beyond topic similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inform Syst 15(1): 55–73
Zhang X, Hu X, Zhou X (2008) A comparative evaluation of different link types on enhancing document clustring. In: Proceedings of ACM SIGIR, Singapore, pp 555–562
Nayak R (2008) Fast and effective clustering of XML data using structural information. Knowl Inform Syst 14(2): 197–215
Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recognit Lett 30(10): 870–876
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. Technical Report ♯00-034 at Department of Computer Science and Engineering, University of Minnesota
Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at 26th annual international ACM SIGIR conference, Toronto, Canada
Hotho A, Bloehdorn S (2004) Text classification by boosting weak learners based on terms and concepts. In: Proceedings of the 4th IEEE international conference on data mining, Brighton, UK, pp 72–79
Mao W, Chu W (2002) Free text medical document retrieval via phrased-based vector space model. In: Proceedings of American medical informatics association annual symposium, San Antonio, TX, USA
Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms, Fellbaum, pp 305–332
Jurisica I, Mylopolous J, Yu E (2004) Ontologies for knowledge management: an information systems perspective. Knowl Inform Syst 6(4): 380–401
Budanitsky A (1999) Lexical semantic relatedness and its application in natural language processing. Technical Report CSRG390 in University of Toronto. ftp://ftp.cs.utoronto.ca/csrg-technical-reports/390
Rada R, Mili H, Bicknell E, Blettner M (1989) Development an application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1): 17–30
Sussna M (1993) Word sense disambiguation for free-text indexing using a massive semantic network. In: Proceedings of the 2nd international conference on information and knowledge management, Arlington, Virginia
Resnik P (1995) Using information content to evaluate semantic similarity. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 448–453
Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Katsavounidis I, Kuo C, Zhang Z (1994) A new initialization technique for generalized Lioyd iteration. IEEE Signal Proc Lett 1(10): 144–146
Gruber T (1993) A translation approach to portable ontologies. Knowl Acquisit 5(2): 199–220
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguistics 32(1): 13–47
Zhao G (1996) Analogical translator: experience-guided transfer in machine translation. Ph.D. Thesis at UMIST, UK
Kolodner J (1993) Case-based reasoning. Morgan Kaufmann, Menlo Park
Maedche A, Staab S (2001) Ontology learning for the Semantic Web. IEEE Trans Intell Syst 31: 72–79
Sabou M (2005) Learning web service ontologies automatic extraction method and its evaluation. In: Ontology learning from text: methods, applications and evaluation. IOS Press, Amsterdam
Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Intell Res 24: 305–339
Miller G, Charles G (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1): 1–28
Weiss S, Kulikowski C (1991) Computer systems that learn: classification and prediction methods from statistics neural nets, machine learning, and expert systems. Morgan Kaufmann, Menlo Park
Edwards A (1976) The correlation coefficient. In: An introduction to linear regression and correlation. Freeman, San Francisco
Hersh W, Buchley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of ACM SIGIR, Dublin, Ireland, pp 192–201
Miller G (1998) Nouns in WordNet. In: WordNet: an electronic lexical database. MIT Press, Cambridge
Steinbach W, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining, Boston, MA, USA
Jing L, Ng M, Huang Z (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Proceedings of PAKDD, Hanoi, Vietnam, pp 802–812
Dubes R, Jain A (1988) Algorithms for clustering data. Englewood Cliffs, Prentice-Hall
Zhao Y, Karypis G (2002) Comparison of agglomerative and partitional document clustering algorithms. Technical report ♯02-014 at University of Minnesota
Shi Z, Ghosh J (2003) A comparative study of generative models for document clustering. In: Proceedings of SDW workshop on clustering high dimensional data and its applications, San Francisco, CA, USA
Kotis K, Vouros G (2006) Human-centered ontology engineering: the HCOME methodology. Knowl Inform Syst 10(1): 109–131
Wang P, Hu J, Zeng H, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3): 265–281
Author information
Authors and Affiliations
Corresponding author
Additional information
Part of this work was done in The University of Hong Kong and Hong Kong Baptist University.
Rights and permissions
About this article
Cite this article
Jing, L., Ng, M.K. & Huang, J.Z. Knowledge-based vector space model for text clustering. Knowl Inf Syst 25, 35–55 (2010). https://doi.org/10.1007/s10115-009-0256-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0256-5