Term Similarity and Weighting Framework for Text Representation

  • Sadiq Sani
  • Nirmalie Wiratunga
  • Stewart Massie
  • Robert Lothian
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6880)

Abstract

Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many different ways. Therefore, terms in a document should not be treated independently of one another since together they help to disambiguate and establish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms. Another consideration for representation involves the importance of terms. Feature selection techniques address this by using statistical measures to quantify term usefulness for retrieval. In this paper we present a framework that combines term-similarity and weighting for text representation. This allows us to comparatively study the impact of term similarity, term weighting and any synergistic effect that may exist between them. Study of term similarity is based on approaches that exploit term co-occurrences within document and sentence contexts whilst term weighting uses the popular Chi-squared test. Our results on text classification tasks show that the combined effect of similarity and weighting is superior to each technique independently and that this synergistic effect is obtained regardless of co-occurrence context granularity.

Keywords

Semantic Relatedness Cosine Similarity Term Similarity Sentence Context Vector Space Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 757–766. ACM, New York (2007)Google Scholar
  2. 2.
    Boyd-graber, J., Fellbaum, C., Osherson, D., Schapire, R.: Adding dense, weighted connections to wordnet. In: Proceedings of the Third International WordNet Conference (2006)Google Scholar
  3. 3.
    Brank, J., Milic-Frayling, N.: A framework for characterzing feature weighting and selection methods in text classification. Tech. rep., Microsoft Research (January 2005)Google Scholar
  4. 4.
    Chakraborti, S., Lothian, R., Wiratunga, N., Orecchioni, A., Watt, S.: Fast case retrieval nets for textual data. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 400–414. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Chakraborti, S., Wiratunga, N., Lothian, R., Watt, S.: Acquiring word similarities with higher order association mining. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 61–76. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19, 370–383 (2007)CrossRefGoogle Scholar
  7. 7.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  8. 8.
    Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 34, 443–498 (2009)MATHGoogle Scholar
  9. 9.
    Gracia, J., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of the Int’l. Conf. on Research in Computational Linguistics, pp. 19–33 (1997)Google Scholar
  11. 11.
    Kontostathis, A., Pottenger, W.M.: A framework for understanding latent semantic indexing (lsi) performance. Information Processing and Management 42(1), 56–73 (2006)CrossRefGoogle Scholar
  12. 12.
    Lin, D.: An information-theoretic definition of similarity. In: Proc. of the 15th Int’l. Conf. on Machine Learning, pp. 296–304 (1998)Google Scholar
  13. 13.
    Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM 38, 39–41 (1995)CrossRefGoogle Scholar
  14. 14.
    Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 1, pp. 448–453 (1995)Google Scholar
  15. 15.
    Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005 (2005)Google Scholar
  16. 16.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRefMATHGoogle Scholar
  17. 17.
    Schütze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manage. 33, 307–318 (1997)CrossRefGoogle Scholar
  18. 18.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37, 141–188 (2010)MathSciNetMATHGoogle Scholar
  19. 19.
    Wiratunga, N., Koychev, I., Massie, S.: Feature selection and generalisation for retrieval of textual cases. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 806–820. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  20. 20.
    Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proc. of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138 (1994)Google Scholar
  21. 21.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sadiq Sani
    • 1
  • Nirmalie Wiratunga
    • 1
  • Stewart Massie
    • 1
  • Robert Lothian
    • 1
  1. 1.School of ComputingThe Robert Gordon UniversityAberdeenScotland, UK

Personalised recommendations