Skip to main content

Dimensionality Reduction

  • Chapter
  • First Online:
  • 6970 Accesses

Abstract

We closed the previous chapter by introducing the "curse of dimensionality", which refers to the sparseness problem that typically affects models involving a very large number of variables, i.e. high dimensionality spaces. This problem is alleviated in practice by the use of dimensionality reduction techniques, which aim at reducing the sparseness of the data representation by projecting the original model into a new space of lower dimensionality. There exist several different approaches to dimensionality reduction, as well as it constitutes a very common practice in data mining applications. Indeed, almost every standard data mining method or procedure involves some sort of dimensionality reduction. In this chapter we will focus our attention on three basic methods for dimensionality reduction. First, in Sect. 9.1, vocabulary pruning and merging methods will be presented. Then, in Sect. 9.2, the linear transformation approach to dimensionality reduction will be introduced and discussed. In Sect. 9.3, the use of non-linear projection methods for dimensionality reduction will be described. Finally, some relevant references to other important and commonly used methods are provided in the Further Reading section at the end of the chapter.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Several stop word lists for many different languages can be easily found in the web just by querying for "stop words". For instance, for a general purpose English stop word list, you can check: http://www.textfixer.com/resources/common-english-words.txt. Accessed 16 July 2011.

  2. 2.

    In practical applications, with probably few exceptions, it is not customary to implement dimensionality reductions like this one. Generally, reduced space dimensionalities will be around 100–500 dimensions. Here, we are using an exaggeratedly low dimensionality for illustrative purposes.

  3. 3.

    Although this can vary significantly from word to word, the general trend is to observe a more direct dependency on pair-wise co-occurrences in the high-dimensional space. On the other hand, in low-dimensional spaces, similarity scores among terms depend more on context similarities than on simple word co-occurrences.

  4. 4.

    A similar example is also presented in the documentation page of MATLAB® multidimensional scaling function: http://www.mathworks.com/help/toolbox/stats/briu08r-1.html. Accessed 16 September 2011.

  5. 5.

    Different versions and/or initializations of the mdscale algorithm can produce different rotation, offsets and scaling factors; if you are not able to reproduce the results shown in Fig. 9.5, you will have to play for a while with the values of variables rotang , scale1 , scale2 , offset1 and offset2 until you get an appropriate fit of the cities into the map. .

References

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Cox MF, Cox MAA (2001) Multidimensional scaling. Chapman and Hall, Boca Raton

    MATH  Google Scholar 

  • Deerwester S, Dumais S, Landauer T, Furnas G, Beck L (1988) Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American society for information science, 25:36–40, Atlanta, GA

    Google Scholar 

  • Fodor IK (2002) A survey of dimension reduction techniques. U.S. Department of Energy, Lawrence Livermore National Laboratory, UCRL-ID-148494

    Google Scholar 

  • Golub GH, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. J Soc Ind Appl Math: Numer Anal 2(2):205–224

    Article  MathSciNet  Google Scholar 

  • Gorsuch RL (1983) Factor analysis. Lawrence Erlbaum, Hillsdale

    Google Scholar 

  • Griffiths T, Steyvers M, Tenenbaum JB (2007) Topics in semantic representation. Psychol Rev 144(2):211–244

    Article  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hofmann T (1999) Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on uncertainty in artificial intelligence

    Google Scholar 

  • Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Am Soc Inf Sci 47(1):70–84

    Article  Google Scholar 

  • Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128

    Google Scholar 

  • Jolliffe IT (2002) Principal component analysis. Springer-Verlag, New York

    MATH  Google Scholar 

  • Karlgren J (1999) Stylistic experiments in information retrieval. In: Strzalkowski (ed) Natural language information retrieval. Kluwer, Dordrecht, pp 147–166

    Google Scholar 

  • Kaski S (1998) Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings IEEE international joint conference on neural networks, pp 413–418

    Google Scholar 

  • Kohonen TK (1990) The self-organizing map. In: Proceedings IEEE, 78(9):1464–1480

    Google Scholar 

  • Koskenniemi KM (1983) Two-level morphology: a general computational model for word-form recognition and production. Technical report, University of Helsinki, Helsinki, Finland

    Google Scholar 

  • Landauer T, Dumais S (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240

    Article  Google Scholar 

  • Madsen RE, Sigurdsson S, Hansen LK, Lansen J (2004) Pruning the vocabulary for better context recognition. In: Proceedings of the 7th international conference on pattern recognition

    Google Scholar 

  • Matsuda Y, Yamaguchi K (2005) An efficient MDS algorithm for the analysis of massive document collections. In: Khosla R et al (eds): KES 2005, LNAI 3682:1015–1021. Springer, Berlin

    Google Scholar 

  • Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2(6):559–572

    Google Scholar 

  • Wälchli B, Cysouw M (2012) Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics 50(3):671–710

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Banchs, R.E. (2013). Dimensionality Reduction. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-4151-9_9

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-4150-2

  • Online ISBN: 978-1-4614-4151-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics