Dimensionality Reduction

Banchs, Rafael E.

doi:10.1007/978-1-4614-4151-9_9

Dimensionality Reduction

Rafael E. Banchs²

Chapter
First Online: 01 January 2012

6970 Accesses

Abstract

We closed the previous chapter by introducing the "curse of dimensionality", which refers to the sparseness problem that typically affects models involving a very large number of variables, i.e. high dimensionality spaces. This problem is alleviated in practice by the use of dimensionality reduction techniques, which aim at reducing the sparseness of the data representation by projecting the original model into a new space of lower dimensionality. There exist several different approaches to dimensionality reduction, as well as it constitutes a very common practice in data mining applications. Indeed, almost every standard data mining method or procedure involves some sort of dimensionality reduction. In this chapter we will focus our attention on three basic methods for dimensionality reduction. First, in Sect. 9.1, vocabulary pruning and merging methods will be presented. Then, in Sect. 9.2, the linear transformation approach to dimensionality reduction will be introduced and discussed. In Sect. 9.3, the use of non-linear projection methods for dimensionality reduction will be described. Finally, some relevant references to other important and commonly used methods are provided in the Further Reading section at the end of the chapter.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Several stop word lists for many different languages can be easily found in the web just by querying for "stop words". For instance, for a general purpose English stop word list, you can check: http://www.textfixer.com/resources/common-english-words.txt. Accessed 16 July 2011.
2.
In practical applications, with probably few exceptions, it is not customary to implement dimensionality reductions like this one. Generally, reduced space dimensionalities will be around 100–500 dimensions. Here, we are using an exaggeratedly low dimensionality for illustrative purposes.
3.
Although this can vary significantly from word to word, the general trend is to observe a more direct dependency on pair-wise co-occurrences in the high-dimensional space. On the other hand, in low-dimensional spaces, similarity scores among terms depend more on context similarities than on simple word co-occurrences.
4.
A similar example is also presented in the documentation page of MATLAB® multidimensional scaling function: http://www.mathworks.com/help/toolbox/stats/briu08r-1.html. Accessed 16 September 2011.
5.
Different versions and/or initializations of the mdscale algorithm can produce different rotation, offsets and scaling factors; if you are not able to reproduce the results shown in Fig. 9.5, you will have to play for a while with the values of variables rotang , scale1 , scale2 , offset1 and offset2 until you get an appropriate fit of the cities into the map. .

References

Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Cox MF, Cox MAA (2001) Multidimensional scaling. Chapman and Hall, Boca Raton
MATH Google Scholar
Deerwester S, Dumais S, Landauer T, Furnas G, Beck L (1988) Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American society for information science, 25:36–40, Atlanta, GA
Google Scholar
Fodor IK (2002) A survey of dimension reduction techniques. U.S. Department of Energy, Lawrence Livermore National Laboratory, UCRL-ID-148494
Google Scholar
Golub GH, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. J Soc Ind Appl Math: Numer Anal 2(2):205–224
Article MathSciNet Google Scholar
Gorsuch RL (1983) Factor analysis. Lawrence Erlbaum, Hillsdale
Google Scholar
Griffiths T, Steyvers M, Tenenbaum JB (2007) Topics in semantic representation. Psychol Rev 144(2):211–244
Article Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507
Article MathSciNet MATH Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on uncertainty in artificial intelligence
Google Scholar
Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Am Soc Inf Sci 47(1):70–84
Article Google Scholar
Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128
Google Scholar
Jolliffe IT (2002) Principal component analysis. Springer-Verlag, New York
MATH Google Scholar
Karlgren J (1999) Stylistic experiments in information retrieval. In: Strzalkowski (ed) Natural language information retrieval. Kluwer, Dordrecht, pp 147–166
Google Scholar
Kaski S (1998) Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings IEEE international joint conference on neural networks, pp 413–418
Google Scholar
Kohonen TK (1990) The self-organizing map. In: Proceedings IEEE, 78(9):1464–1480
Google Scholar
Koskenniemi KM (1983) Two-level morphology: a general computational model for word-form recognition and production. Technical report, University of Helsinki, Helsinki, Finland
Google Scholar
Landauer T, Dumais S (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240
Article Google Scholar
Madsen RE, Sigurdsson S, Hansen LK, Lansen J (2004) Pruning the vocabulary for better context recognition. In: Proceedings of the 7th international conference on pattern recognition
Google Scholar
Matsuda Y, Yamaguchi K (2005) An efficient MDS algorithm for the analysis of massive document collections. In: Khosla R et al (eds): KES 2005, LNAI 3682:1015–1021. Springer, Berlin
Google Scholar
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2(6):559–572
Google Scholar
Wälchli B, Cysouw M (2012) Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics 50(3):671–710
Google Scholar

Download references

Author information

Authors and Affiliations

, , Barcelona
Rafael E. Banchs

Authors

Rafael E. Banchs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Banchs, R.E. (2013). Dimensionality Reduction. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_9

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4151-9_9
Published: 14 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics