Skip to main content
Log in

A method for selecting the relevant dimensions for high-dimensional classification in singular vector spaces

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this paper, we give a new feature selection algorithm for the binary class classification problem in sparse high-dimensional spaces. Singular value decomposition (SVD) is a popular dimension reduction method in higher-dimensional classification. The traditional SVD method begins by ranking the Singular Dimensions (SDs) from largest singular value to the smallest. However, when the number of signals is fewer than the number of noise, the first few ranked SDs are not necessarily the best for classification. We demonstrate, theoretically and empirically, that our method efficiently selects the SDs most appropriate for classification and significantly reduces the misclassification error. We also apply our method to real data text mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Albright R (2004) Taming text with the SVD. SAS Institute Inc., Cary

    Google Scholar 

  • Bickel PJ, Levina E (2004) Some theory for Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10:989–1010

    Article  MathSciNet  MATH  Google Scholar 

  • Cormack G, Gomez J, Sanz E (2007) Spam filtering for short messages. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 313–320

  • Deerwester S, Dumais G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407

    Article  Google Scholar 

  • Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:7787

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Fan Y (2008) High dimensional classification using features annealed independence rules. Ann Stat 36:2605–2637

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultra-high dimensional feature space. J R Stat Soc B 70:849–911

    Article  MathSciNet  Google Scholar 

  • Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space (invited review article). Stat Sin 20:101–148

    MATH  Google Scholar 

  • Filannino M (2011) DBWorld e-mail classification using a very small corpus. The University of Manchester, Manchester

    Google Scholar 

  • Joachims T (1997) Text categorization with support vector machines. Technical report, LS VIII Number 23, University of Dortmund

  • Mahinovs A, Tiwari A (2007) Text classification method review. In: Decision engineering report series, April 2007

  • Mai Q, Zou H, Yau M (2012) A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99:29–42

    Article  MathSciNet  MATH  Google Scholar 

  • Mesterharm C, Pazzani M (2011) Active learning using on-line algorithms. In: KDD 2011

  • Pitts M, Clark C (2011) SAS text miner: theory and practice at UnitedHealthcare, UnitedHealthcare (Presentation at Analytics 2011 Conference)

  • Romero R, Iglesias EL, Borrajo L (2015) A linear-RBF multikernel SVM to classify big text corpora. BioMed Res Int 2015:878291

    Article  Google Scholar 

  • Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567–6572

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the associate editor and referees for their helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dawit G. Tadesse.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tadesse, D.G., Carpenter, M. A method for selecting the relevant dimensions for high-dimensional classification in singular vector spaces. Adv Data Anal Classif 13, 405–426 (2019). https://doi.org/10.1007/s11634-018-0311-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0311-8

Keywords

Mathematics Subject Classification

Navigation