Abstract
Indexing of textual cases is commonly affected by the problem of variation in vocabulary. Semantic indexing is commonly used to address this problem by discovering semantic or conceptual relatedness between individual terms and using this to improve textual case representation. However, representations produced using this approach are not optimal for supervised tasks because standard semantic indexing approaches do not take into account class membership of these textual cases. Supervised semantic indexing approaches e.g. sprinkled Latent Semantic Indexing (SpLSI) and supervised Latent Dirichlet Allocation (sLDA) have been proposed for addressing this limitation. However, both SpLSI and sLDA are computationally expensive and require parameter tuning. In this work, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. S3 works by creating a separate sub-space for each class within which class-specific term relations and term weights are extracted. The power of S3 lies in its ability to modify document representations such that documents that belong to the same class are made more similar to one another while, at the same time, reducing their similarity to documents of other classes. In addition, S3 is flexible enough to work with a variety of semantic relatedness metrics and yet, powerful enough that it leads to significant improvements in text classification accuracy. We evaluate our approach on a number of supervised datasets and results show classification performance on S3-based representations to significantly outperform both a supervised version of Latent Semantic Indexing (LSI) called Sprinkled LSI, and supervised LDA.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data. Springer (2012)
Blei, D., McAuliffe, J.: Supervised topic models. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems 20, pp. 121–128. MIT Press, Cambridge (2008)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chakraborti, S., Lothian, R., Wiratunga, N., Watt, S.: Sprinkling: Supervised latent semantic indexing. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 510–514. Springer, Heidelberg (2006)
Chakraborti, S., Mukras, R., Lothian, R., Wiratunga, N., Watt, S., Harper, D.: Supervised latent semantic indexing using adaptive sprinkling. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1582–1587 (2007)
Chakraborti, S., Wiratunga, N., Lothian, R., Watt, S.: Acquiring word similarities with higher order association mining. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 61–76. Springer, Heidelberg (2007)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 79–86. Association for Computational Linguistics, Stroudsburg (2002)
Rohde, D.L.T., Gonnerman, L.M., Plaut, D.C.: An improved model of semantic similarity based on lexical co-occurence. Communications of the ACM 8, 627–633 (2006)
Sun, J.T., Chen, Z., Zeng, H.J., Lu, Y.C., Shi, C.Y., Ma, W.Y.: Supervised latent semantic indexing for document categorization. In: IEEE International Conference on Data Mining, pp. 535–538 (2004)
Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: Proceedings of the Student Research Workshop at EACL 2009, pp. 70–78 (2009)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37, 141–188 (2010)
Wang, C., Blei, D., Fei-fei, L.: Simultaneous image classification and annotation. In: Proceedings of Computer Vision and Pattern Recognition (2009)
Weber, R.O., Ashley, K.D., Bruninghaus, S.: Textual case-based reasoning. Knowledge Engineering Review 20(3), 255–260 (2005)
Wong, S.K., Ziarko, W., Raghavan, V.V., Wong, P.C.: On modeling of information retrieval concepts in vector spaces. ACM Trans. Database Syst. 12(2), 299–321 (1987)
Xu, Z.E., Chen, M., Weinberger, K.Q., Sha, F.: An alternative text representation to tf-idf and bag-of-words. In: Proceedings of the 21st ACM Conferece of Information and Knowledge Management, CIKM (2012)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sani, S., Wiratunga, N., Massie, S., Lothian, R. (2014). Supervised Semantic Indexing Using Sub-spacing. In: Lamontagne, L., Plaza, E. (eds) Case-Based Reasoning Research and Development. ICCBR 2014. Lecture Notes in Computer Science(), vol 8765. Springer, Cham. https://doi.org/10.1007/978-3-319-11209-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-11209-1_30
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11208-4
Online ISBN: 978-3-319-11209-1
eBook Packages: Computer ScienceComputer Science (R0)