Abstract
The variation in natural language vocabulary remains a challenge for text representation as the same idea can be expressed in many different ways. Thus document representations often rely on generalisation to map low-level lexical expressions to higher level concepts in order to capture the inherent semantics of the documents. Term-relatedness measures are often used to generalise document representations by capturing semantic relationships between terms. In this work we conduct a comparative study of common term-relatedness metrics on 43 datasets and discover that generalisation is not always beneficial. Hence, the ability to predict whether or not to generalise the indexing vocabulary of a dataset is important given the computation overhead of generalisation. Accordingly, we present a case-based approach that predicts, given a text dataset, whether or not using generalisation will improve text retrieval performance. The evaluation shows that our approach is able to correctly predict datasets that are likely to benefit from generalisation with over 90% accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bensusan, H., Giraud-Carrier, C., Kennedy, C.: A higher-order approach to meta-learning. In: Proceedings of the ECML 2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pp. 109–117 (2000)
Brants, T., Inc, G.: Natural language processing in information retrieval. In: Proceedings of the 14th Meeting of Computational Linguistics in the Netherlands, pp. 1–13 (2004)
Chakraborti, S., Wiratunga, N., Lothian, R., Watt, S.: Acquiring word similarities with higher order association mining. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 61–76. Springer, Heidelberg (2007)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Craw, S., Wiratunga, N., Rowe, R.C.: Learning adaptation knowledge to improve case-based reasoning. Artificial Intelligence 170(16-17), 1175–1192 (2006)
Cummins, L., Bridge, D.: On dataset complexity for case base maintenance. In: Ram, A., Wiratunga, N. (eds.) ICCBR 2011. LNCS, vol. 6880, pp. 47–61. Springer, Heidelberg (2011)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 34, 443–498 (2009)
Lindner, G., Studer, R.: Ast: Support for algorithm selection with a cbr approach. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 418–423. Springer, Heidelberg (1999)
Massie, S., Craw, S., Wiratunga, N.: Complexity profiling for informed case-base editing. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 325–339. Springer, Heidelberg (2006)
Ohana, B., Delany, S., Tierney, B.: A case-based approach to cross domain sentiment classification. In: Agudo, B.D., Watson, I. (eds.) ICCBR 2012. LNCS, vol. 7466, pp. 284–296. Springer, Heidelberg (2012)
Sani, S., Wiratunga, N., Massie, S., Lothian, R.: Term similarity and weighting framework for text representation. In: Ram, A., Wiratunga, N. (eds.) ICCBR 2011. LNCS, vol. 6880, pp. 304–318. Springer, Heidelberg (2011)
Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: Proceedings of the Student Research Workshop at EACL 2009, pp. 70–78 (2009)
Wettschereck, D., Aha, D.W., Mohri, T.: A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review 11(1-5), 273–314 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sani, S., Wiratunga, N., Massie, S., Lothian, R. (2013). Should Term-Relatedness Be Used in Text Representation?. In: Delany, S.J., Ontañón, S. (eds) Case-Based Reasoning Research and Development. ICCBR 2013. Lecture Notes in Computer Science(), vol 7969. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39056-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-39056-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39055-5
Online ISBN: 978-3-642-39056-2
eBook Packages: Computer ScienceComputer Science (R0)