Abstract
In this paper, we give a new feature selection algorithm for the binary class classification problem in sparse high-dimensional spaces. Singular value decomposition (SVD) is a popular dimension reduction method in higher-dimensional classification. The traditional SVD method begins by ranking the Singular Dimensions (SDs) from largest singular value to the smallest. However, when the number of signals is fewer than the number of noise, the first few ranked SDs are not necessarily the best for classification. We demonstrate, theoretically and empirically, that our method efficiently selects the SDs most appropriate for classification and significantly reduces the misclassification error. We also apply our method to real data text mining applications.
Similar content being viewed by others
References
Albright R (2004) Taming text with the SVD. SAS Institute Inc., Cary
Bickel PJ, Levina E (2004) Some theory for Fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli 10:989–1010
Cormack G, Gomez J, Sanz E (2007) Spam filtering for short messages. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 313–320
Deerwester S, Dumais G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:7787
Fan J, Fan Y (2008) High dimensional classification using features annealed independence rules. Ann Stat 36:2605–2637
Fan J, Lv J (2008) Sure independence screening for ultra-high dimensional feature space. J R Stat Soc B 70:849–911
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space (invited review article). Stat Sin 20:101–148
Filannino M (2011) DBWorld e-mail classification using a very small corpus. The University of Manchester, Manchester
Joachims T (1997) Text categorization with support vector machines. Technical report, LS VIII Number 23, University of Dortmund
Mahinovs A, Tiwari A (2007) Text classification method review. In: Decision engineering report series, April 2007
Mai Q, Zou H, Yau M (2012) A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99:29–42
Mesterharm C, Pazzani M (2011) Active learning using on-line algorithms. In: KDD 2011
Pitts M, Clark C (2011) SAS text miner: theory and practice at UnitedHealthcare, UnitedHealthcare (Presentation at Analytics 2011 Conference)
Romero R, Iglesias EL, Borrajo L (2015) A linear-RBF multikernel SVM to classify big text corpora. BioMed Res Int 2015:878291
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567–6572
Acknowledgements
The authors would like to thank the associate editor and referees for their helpful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tadesse, D.G., Carpenter, M. A method for selecting the relevant dimensions for high-dimensional classification in singular vector spaces. Adv Data Anal Classif 13, 405–426 (2019). https://doi.org/10.1007/s11634-018-0311-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0311-8