Abstract
This paper proposes a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space. Preliminary experimental results demonstrate this yields a comparable representation to that provided by a novel probabilistic approach which reconsiders the entire indexing problem of text documents and works directly in the original high dimensional vector-space representation of text. The employed projection index is derived here from the a priori constraints on the problem. The principal advantage of this approach is computational efficiency and is obtained by the exploitation of the Latent Semantic Indexing as a preprocessing stage. Simulation results on subsets of the 20-Newsgroups text corpus in various settings are provided.
Similar content being viewed by others
References
Berry, M. W: Large-scale sparse singular value computations, The International Journal of Super-computer Applications 6(1) (1992), 13–49.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.: Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41(6) (1990), 391–407.
Kolenda, T., Hansen, L.-L. and Sigurdsson, S.: Independent components in text, In: M. Girolami (ed.), Advances in Independent Component Analysis (Springer-Verlag, 2000) pp. 241–262.
Hofmann, T.: Probabilistic latent semantic analysis, Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI'99), San Francisco,CA, 1999, pp. 289–296.
HyvOrinen, A. and Oja, E.: A fast fixed-point algorithm for independent component analysis, Neural Computation 9 (1997), 1483–1492.
Katz, S.: Distribution of content words and phrases in text and language modeling, Natural Language Engineering 2(1) (1996), 15–59.
Kabán, A. and Girolami, M.: Unsupervised topic separation and keyword identification in document collections: A projection approach, Technical Report, 10, University of Paisley.
Lee, D., Seung, S.: Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999), 788–791.
Papadimitriou, C. H. and Raghavan, P.: Latent semantic indexing: a probabilistic analysis, Proceedings of 17th ACM Symposium on the Principles of Database Systems, 1998, 159–168.
Sahami, M.: Using Machine Learning to Improve Information Access, PhD Thesis, Stanford University, 1998.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kabán, A., Girolami, M.A. Fast Extraction of Semantic Features from a Latent Semantic Indexed Text Corpus. Neural Processing Letters 15, 31–43 (2002). https://doi.org/10.1023/A:1013801028884
Issue Date:
DOI: https://doi.org/10.1023/A:1013801028884