Abstract
The usual data base for multiple correspondence analysis/homogeneity analysis consists of objects, characterised by categorical attributes. Its aims and ends are visualisation, dimension reduction and, to some extent, factor analysis using alternating least squares. As for dimension reduction, there are strong parallels between vector-based methods in Information Retrieval (IR) like the Vector Space Model (VSM) or Latent Semantic Analysis (LSA). The latter uses singular value decomposition (SVD) to discard a number of the smallest singular values and that way generates a lower-dimensional retrieval space. In this paper, the HOMALS technique is exploited for use in IR by categorising metric term frequencies in term-document matrices. In this context, dimension reduction is achieved by minimising the difference in distances between objects in the dimensionally reduced space compared to the full-dimensional space. An exemplary set of documents will be submitted to the process and later used for retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here, words that cannot discriminate between documents and do not carry any content like a or and are removed.
- 2.
In stemming, certain endings are removed or merged in order to map words with identical stems to the same item.
References
Berry MW, Browne M (1999) Understanding search engines: mathematical modeling and text retrieval. Society for industrial and applied mathematics. Philadelphia, PA, USA
Berry MW, Dumais ST, O’Brien GW (1994) Using linear algebra for intelligent information retrieval. Tech. Rep. UT-CS-94-270, University of Tennessee
Berry MW, Drmac Z, Jessup ER (1999) Matrices, vector spaces, and information retrieval. SIAM Rev 41:335–362
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41(6):391–407
Dumais ST (1991) Improving the retrieval of information from external sources. Behav Res Meth Instrum Comput 23(2):229–236
Dumais ST (2007) LSA and Information retrieval: Getting back to basics, Lawrence Erlbaum associates. Mahwah, NJ, Chap. 16, pp 293–321
Dumais ST, Furnas GW, Landauer TK, Deerwester SC, Harshman RA (1988) Using latent semantic analysis to improve access to textual information. In: CHI ’88: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM Press, New York, NY, pp 281–285
Gifi A (1992) Nonlinear multivariate analysis. Comput Stat Data Anal 14(4):548–544, URL http://econpapers.repec.org/RePEc:eee:csdana:v:14:y:1992:i:4:p:548–544
Kolda TG, O’Leary DP (1998) A semidiscrete matrix decomposition for latent semantic indexing information retrieval. ACM Trans Inf Syst 16(4):322–346
Landauer TK, Foltz PW, Laham D (1998) Introduction to latent semantic analysis. Discourse Process 25:259–284
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Martin DI, Berry MW (2007) Mathematical foundations behind latent semantic analysis, Lawrence Erlbaum associates. Mahwah, NJ, Chap. 2, pp 35–55
Michailidis G, Leeuw JD (2005) Homogeneity analysis using absolute deviations. Comput Stat Data Anal 48(3):587–603
Salton G (1988) Automatic text processing: The transformation analysis and retrieval of information by computer. Addison-Wesley
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hildebrand, K.F., Müller-Funk, U. (2012). HOMALS for Dimension Reduction in Information Retrieval. In: Gaul, W., Geyer-Schulz, A., Schmidt-Thieme, L., Kunze, J. (eds) Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24466-7_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-24466-7_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24465-0
Online ISBN: 978-3-642-24466-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)