Text Similarity Measurement Using Concept Representation of Texts

  • Abhinay Pandya
  • Pushpak Bhattacharyya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)


Measuring semantic nearness of documents is important for accurate information retrieval, automated text categorization and classification. Inspired by the observation that text documents contain semantically coherent set of ideas/topics, this paper presents the design and experimental evaluation of a method to represent a text document as a set of concepts. Based on this, we propose a method to measure semantic nearness of texts. Our method makes use of WordNet which is a lexico-semantic network of words. We bypass word sense disambiguation. In order to show the effectiveness of our representation of texts, we compare experimental results of text classification and clustering with the results of classification and clustering with standard techniques.


  1. 1.
    Lin, D.: Information Theoretic definition of similarity. In: Proc. 15th International Conf. on Machine Learning (1998)Google Scholar
  2. 2.
    Fellbaum, C.: WordNet, An Electronic Lexical Database. MIT press, Cambridge (1999)Google Scholar
  3. 3.
    Francis, Kucera: Computational Analysis of present day American English. Brown University press (1967)Google Scholar
  4. 4.
    Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research (JAIR) 11, 95–130 (1999)zbMATHGoogle Scholar
  5. 5.
    Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37(4) (1995)Google Scholar
  6. 6.
    Jiang, Conrath: Semantic similarity based on Corpus statistics and lexical Taxonomy. In: Proceedings of International Conference Research on Computational Linguistics (1997)Google Scholar
  7. 7.
    Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Proc. of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems (1998)Google Scholar
  8. 8.
    Scott, S., Matwin, S.: WordNet improves text document clustering. In: Proc. of the Semantic Web Workshop at SIGIR-2003 (2003)Google Scholar
  9. 9.
    Rada, R., Milli, H., Bicknell, E., Blettner, M.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics 1(9), 17–30 (1989)CrossRefGoogle Scholar
  10. 10.
    Lesk, M.: Automatic sense disambiguation: How to tell a pine cone from an ice-cream cone. In: Proc. of the 1986 ACM SIGDOC conference, New York, pp. 24–26 (1986) Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Abhinay Pandya
    • 1
  • Pushpak Bhattacharyya
    • 2
  1. 1.DA-IICTGandhinagar
  2. 2.Dept. of CSEIIT Bombay 

Personalised recommendations