A New Semi-supervised Dimension Reduction Technique for Textual Data Analysis

  • Manuel Martín-Merino
  • Jesus Román
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4224)


Dimension reduction techniques are important preprocessing algorithms for high dimensional applications that reduce the noise keeping the main structure of the dataset. They have been successfully applied to a large variety of problems and particularly in text mining applications.

However, the algorithms proposed in the literature often suffer from a low discriminant power due to its unsupervised nature and to the ‘curse of dimensionality’. Fortunately several search engines such as Yahoo provide a manually created classification of a subset of documents that may be exploited to overcome this problem.

In this paper we propose a semi-supervised version of a PCA like algorithm for textual data analysis. The new method reduces the term space dimensionality taking advantage of this document classification. The proposed algorithm has been evaluated using a text mining problem and it outperforms well known unsupervised techniques.


Partial Little Square Cosine Similarity Latent Semantic Indexing Textual Collection Dimension Reduction Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, C.C.: Re-designing distance functions and distance-based applications for high dimensional applications. Proc. of SIGMOD-PODS 1, 13–18 (2001)CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison Wesley, Wokingham (1999)Google Scholar
  3. 3.
    Bartell, B.T., Cottrell, G.W., Belew, R.K.: Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference, Copenhagen, Denmark, pp. 161–167 (1992)Google Scholar
  4. 4.
    Berry, M.W., Drmac, Z., Jessup, E.R.: Matrices, vector spaces and information retrieval. SIAM review 41(2), 335–362 (1999)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Buja, A., Logan, B., Reeds, F., Shepp, R.: Inequalities and positive default functions arising from a problem in multidimensional scaling. Annals of Statistics 22, 406–438 (1994)MATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Cherkassky, V., Mulier, F.: Learning from Data. John Wiley & Sons, New York (1998)MATHGoogle Scholar
  7. 7.
    Chung, Y.M., Lee, J.Y.: A corpus-based approach to comparative evaluation of statistical term association measures. Journal of the American Society for Information Science and Technology 52(4), 283–296 (2001)CrossRefGoogle Scholar
  8. 8.
    Cox, T.F., Cox, M.A.A.: Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, USA (2001)MATHGoogle Scholar
  9. 9.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins university press, Baltimore (1996)MATHGoogle Scholar
  10. 10.
    Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2002)Google Scholar
  11. 11.
    Lebart, L., Salem, A., Berry, L.: Exploring Textual Data. Kluwer Academic Publishers, Netherlands (1998)Google Scholar
  12. 12.
    Mao, J., Jain, A.K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks 6(2) (March 1995)Google Scholar
  13. 13.
    Martin-Merino, M., Muñoz, A.: Extending the SOM algorithm to non-euclidean distances via the kernel trick. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001. LNCS, vol. 2130, pp. 150–157. Springer, Heidelberg (2001)Google Scholar
  14. 14.
    Martín-Merino, M., Muńoz, A.: A New Sammon Algorithm for Sparse Data Visualization. In: International Conference on Pattern Recognition, vol. 1, pp. 477–481 (2004)Google Scholar
  15. 15.
    Mladenié, D.: Turning Yahoo into an Automatic Web-Page Classifier. In: Proceedings of the 13th European Conference on Aritficial Intelligence, Brighton, UK, pp. 473–474 (1998)Google Scholar
  16. 16.
    Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002)Google Scholar
  17. 17.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, Austin, USA, July 2000, pp. 58–64 (2000)Google Scholar
  18. 18.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the 14th International Conference on Machine Learning, Nashville, Tennessee, USA, July 1997, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Manuel Martín-Merino
    • 1
  • Jesus Román
    • 1
  1. 1.Universidad Pontificia de SalamancaSalamancaSpain

Personalised recommendations