Abstract
The use of semantics in tasks related to information retrieval has become, in recent years, a vast field of research. Considering supervised text classification, which is the main interest of this work, semantics can be involved at different steps of text processing: during indexing step, during training step and during class prediction step. As for class prediction step, new text-to-text semantic similarity measures can replace classical similarity measures that are traditionally used by some classification methods for decision-making. In this paper we propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)
Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing 1971. Prentice-Hall, Inc. (1971)
Albitar, S., Fournier, S., Espinasse, B.: The Impact of Conceptualization on Text Classification. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 326–339. Springer, Heidelberg (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Bloehdorn, S., Moschitti, A.: Combined syntactic and semantic Kernels for text classification. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 307–318. Springer, Heidelberg (2007)
Albitar, S., Fournier, S., Espinasse, B.: Conceptualization Effects on MEDLINE Documents Classification Using Rocchio Method. In: Web Intelligence 2012, pp. 462–466 (2012)
Hotho, A., Staab, S., Stumme, G.: Text clustering based on background knowledge (2003)
Guisse, A., Khelif, K., Collard, M.: PatClust: une plateforme pour la classification sémantique des brevets. In: Conférence d’Ingénierie des Connaissances, Hammamet, Tunisie (2009)
Huang, L., et al.: Learning a concept-based document similarity measure. J. Am. Soc. Inf. Sci. Technol. 63(8), 1593–1608 (2012)
Peng, X., Choi, B.: Document classifications based on word semantic hierarchies. In: International Conference on Artificial Intelligence and Applications (AIA 2005), pp. 362–367 (2005)
Wang, P., et al.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining 2007, pp. 332–341. IEEE Computer Society (2007)
Al-Mubaid, H., Nguyen, H.A.: A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain. In: 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2006 (2006)
Rada, R., et al.: Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19(1), 17–30 (1989)
Azuaje, F., Wang, H., Bodenreider, O.: Ontology-driven similarity approaches to supporting gene functional assessment. In: Proceedings of the ISMB 2005 SIG Meeting on Bio-Ontologies (2005)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 12006, pp. 775–780. AAAI Press, Boston
Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Athens (2009)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics 2004, p. 350. Association for Computational Linguistics, Geneva (2004)
Hersh, W., et al.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag New York, Inc., Dublin (1994)
Aronson, A.R., Lang, F.M.: An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17(3), 229–236 (2010)
Caviedes, J.E., Cimino, J.J.: Towards the development of a conceptual distance metric for the UMLS. J. of Biomedical Informatics 37(2), 77–85 (2004)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics 1994, pp. 133–138. Association for Computational Linguistics, Las Cruces (1994)
Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database (Language, Speech, and Communication), pp. 265–283. The MIT Press (1998)
Zhong, J., Zhu, H., Li, J., Yu, Y.: Conceptual Graph Matching for Semantic Search. In: Priss, U., Corbett, D.R., Angelova, G. (eds.) ICCS 2002. LNCS (LNAI), vol. 2393, pp. 92–106. Springer, Heidelberg (2002)
Sebastiani, F.: Text Categorization. In: Encyclopedia of Database Technologies and Applications 2005, pp. 683–687. Idea Group (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Albitar, S., Fournier, S., Espinasse, B. (2014). An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8786. Springer, Cham. https://doi.org/10.1007/978-3-319-11749-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-11749-2_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11748-5
Online ISBN: 978-3-319-11749-2
eBook Packages: Computer ScienceComputer Science (R0)