Knowledge-based vector space model for text clustering

Jing, Liping; Ng, Michael K.; Huang, Joshua Z.

doi:10.1007/s10115-009-0256-5

Knowledge-based vector space model for text clustering

Regular Paper
Published: 02 October 2009

Volume 25, pages 35–55, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Liping Jing¹,
Michael K. Ng² &
Joshua Z. Huang³

1035 Accesses
64 Citations
Explore all metrics

Abstract

This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Berry M (2003) Survey of text mining: clustering, classification, and retrieval. Hardcover
Ji X, Xu W, Zhu S (2006) Document clustering with prior knowledge. In: Proceedings of ACM SIGIR, Seattle, Washington, USA
Jing L, Ng M, Huang J (2007) An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8): 1026–1041
Article Google Scholar
Wan X (2008) Beyond topic similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inform Syst 15(1): 55–73
Article Google Scholar
Zhang X, Hu X, Zhou X (2008) A comparative evaluation of different link types on enhancing document clustring. In: Proceedings of ACM SIGIR, Singapore, pp 555–562
Nayak R (2008) Fast and effective clustering of XML data using structural information. Knowl Inform Syst 14(2): 197–215
Article MathSciNet Google Scholar
Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recognit Lett 30(10): 870–876
Article Google Scholar
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. Technical Report ♯00-034 at Department of Computer Science and Engineering, University of Minnesota
Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at 26th annual international ACM SIGIR conference, Toronto, Canada
Hotho A, Bloehdorn S (2004) Text classification by boosting weak learners based on terms and concepts. In: Proceedings of the 4th IEEE international conference on data mining, Brighton, UK, pp 72–79
Mao W, Chu W (2002) Free text medical document retrieval via phrased-based vector space model. In: Proceedings of American medical informatics association annual symposium, San Antonio, TX, USA
Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms, Fellbaum, pp 305–332
Jurisica I, Mylopolous J, Yu E (2004) Ontologies for knowledge management: an information systems perspective. Knowl Inform Syst 6(4): 380–401
Article Google Scholar
Budanitsky A (1999) Lexical semantic relatedness and its application in natural language processing. Technical Report CSRG390 in University of Toronto. ftp://ftp.cs.utoronto.ca/csrg-technical-reports/390
Rada R, Mili H, Bicknell E, Blettner M (1989) Development an application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1): 17–30
Article Google Scholar
Sussna M (1993) Word sense disambiguation for free-text indexing using a massive semantic network. In: Proceedings of the 2nd international conference on information and knowledge management, Arlington, Virginia
Resnik P (1995) Using information content to evaluate semantic similarity. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 448–453
Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Katsavounidis I, Kuo C, Zhang Z (1994) A new initialization technique for generalized Lioyd iteration. IEEE Signal Proc Lett 1(10): 144–146
Article Google Scholar
Gruber T (1993) A translation approach to portable ontologies. Knowl Acquisit 5(2): 199–220
Article Google Scholar
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguistics 32(1): 13–47
Article Google Scholar
Zhao G (1996) Analogical translator: experience-guided transfer in machine translation. Ph.D. Thesis at UMIST, UK
Kolodner J (1993) Case-based reasoning. Morgan Kaufmann, Menlo Park
Google Scholar
Maedche A, Staab S (2001) Ontology learning for the Semantic Web. IEEE Trans Intell Syst 31: 72–79
Article Google Scholar
Sabou M (2005) Learning web service ontologies automatic extraction method and its evaluation. In: Ontology learning from text: methods, applications and evaluation. IOS Press, Amsterdam
Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Intell Res 24: 305–339
MATH Google Scholar
Miller G, Charles G (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1): 1–28
Article Google Scholar
Weiss S, Kulikowski C (1991) Computer systems that learn: classification and prediction methods from statistics neural nets, machine learning, and expert systems. Morgan Kaufmann, Menlo Park
Edwards A (1976) The correlation coefficient. In: An introduction to linear regression and correlation. Freeman, San Francisco
Hersh W, Buchley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of ACM SIGIR, Dublin, Ireland, pp 192–201
Miller G (1998) Nouns in WordNet. In: WordNet: an electronic lexical database. MIT Press, Cambridge
Steinbach W, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining, Boston, MA, USA
Jing L, Ng M, Huang Z (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Proceedings of PAKDD, Hanoi, Vietnam, pp 802–812
Dubes R, Jain A (1988) Algorithms for clustering data. Englewood Cliffs, Prentice-Hall
MATH Google Scholar
Zhao Y, Karypis G (2002) Comparison of agglomerative and partitional document clustering algorithms. Technical report ♯02-014 at University of Minnesota
Shi Z, Ghosh J (2003) A comparative study of generative models for document clustering. In: Proceedings of SDW workshop on clustering high dimensional data and its applications, San Francisco, CA, USA
Kotis K, Vouros G (2006) Human-centered ontology engineering: the HCOME methodology. Knowl Inform Syst 10(1): 109–131
Article Google Scholar
Wang P, Hu J, Zeng H, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3): 265–281
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Liping Jing
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Michael K. Ng
E-Business Technology Institute, The University of Hong Kong, Pokfulam Road, Hong Kong, China
Joshua Z. Huang

Authors

Liping Jing
View author publications
You can also search for this author in PubMed Google Scholar
Michael K. Ng
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Z. Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liping Jing.

Additional information

Part of this work was done in The University of Hong Kong and Hong Kong Baptist University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jing, L., Ng, M.K. & Huang, J.Z. Knowledge-based vector space model for text clustering. Knowl Inf Syst 25, 35–55 (2010). https://doi.org/10.1007/s10115-009-0256-5

Download citation

Received: 02 October 2008
Revised: 07 July 2009
Accepted: 11 September 2009
Published: 02 October 2009
Issue Date: October 2010
DOI: https://doi.org/10.1007/s10115-009-0256-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Knowledge-based vector space model for text clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Clustering graph data: the roadmap to spectral techniques

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Knowledge-based vector space model for text clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Clustering graph data: the roadmap to spectral techniques

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation