Skip to main content
Log in

Sparse large-margin nearest neighbor embedding via greedy dyad functional optimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

We consider the sparse subspace learning problem where the intrinsic subspace is assumed to be low-dimensional and formed by sparse basis vectors. Confined to a few sparse bases, projecting data to the learned subspace essentially has an effect of feature selection by taking a small number of the most salient features while suppressing the rest as noise. Unlike existing sparse dimensionality reduction methods, however, we exploit the class labels to impose maximal margin data separation in the subspace, which was previously shown to yield improved prediction accuracy often times in non-sparse models. We first formulate an optimization problem with constraints on the matrix rank and the sparseness of the basis vectors. Instead of computationally demanding gradient-based learning strategies used in previous large-margin embedding, we propose an efficient greedy functional optimization algorithm over the infinite set of the sparse dyadic products. Each iteration in the proposed algorithm, after some shifting operations, effectively reduces to the famous sparse eigenvalue problem, and can be solved quickly by the recent truncated power method. We demonstrate the improved prediction performance of the proposed approach on several image/text classification datasets, especially characterized by high-dimensional noisy data samples with small training sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Whereas semi-supervised learning can be incorporated to utilize relatively larger number of unlabeled data, we do not consider it in this paper. Although it remains as our future work, the proposed approach can be easily extended for semi-supervised setups using manifold regularization [3, 42] or related approaches, and can potentially benefit from it.

  2. In LMNN [37], the rank constraint was ignored since their main goal was learning the metric rather than finding a low-dimensional embedding like ours.

  3. When it happens, there may be three possibilities: i) we found a right embedding solution and it is good to stop, ii) the sparseness constraint was too harsh (i.e., r is so small that there is no decent direction in the feasible space), iii) the maximal allowable rank was chosen too small (i.e., the rank penalty constant μ is so large to overwhelm the maximum \(\textbf {u}^{\top } {\Sigma }_{\textbf {A}} \textbf {u}\) yielding a positive derivative). Possibly, both of the latter two situations may occur, in which cases one needs to tune the constant values appropriately.

  4. Although we didn’t do this in our implementation, one can reduce the overhead by a mini-batch type approximation under the stochastic gradient framework. That is, when computing a sum/expectation, we approximate it by the expectation over a small subset/batch of the data.

  5. Our greedy approach usually takes a small number of stages since each stage tends to increase the rank of A by one.

References

  1. Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine (2012). International Workshop of Ambient Assisted Living (IWAAL 2012), Vitoria-Gasteiz, Spain

  2. Bache K, Lichma M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  3. Belkin M, Niyogi P, Sindhwani V (2005) On manifold regularization. Artificial Intelligence and Statistics

  4. Blei D, McAuliffe J (2007) Supervised topic models. Neural Information Processing Systems

  5. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. Clemmensen L, Hastie T, Witten D, Ersbøll B (2011) Sparse discriminant analysis. Technometrics 53(4):406–413

    Article  MathSciNet  Google Scholar 

  7. Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292

    MATH  Google Scholar 

  8. d’Aspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis. J Mach Learn Res 9:1269–1294

    MathSciNet  MATH  Google Scholar 

  9. d’Aspremont A, Ghaoui LE, Jordan M, Lanckriet G (2007) A direct formulation of sparse PCA using semidefinite programming. SIAM Rev 49(3):434–448

    Article  MathSciNet  MATH  Google Scholar 

  10. d’Aspremont A, Ghaoui LE, Jordan M, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev 49:434–448

    Article  MathSciNet  MATH  Google Scholar 

  11. Friedman J (1999) Greedy function approximation: a gradient boosting machine. Technical Report, Department of Statistics, Stanford University

  12. Fukumizu K, Bach F, Jordan M (2004) Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research

  13. Gang P, Zhen W, Zeng W, Gordienko Y, Kochura Y, Alienin O, Rokovyi O, Stirenko S (2018) Dimensionality reduction in deep learning for chest x-ray analysis of lung cancer. International Conference on Advanced Computational Intelligence (ICACI)

  14. Harchaoui Z, Douze M, Paulin M, Dudik M, Malick J (2012) Large-scale classification with trace-norm regularization. IEEE Conference on Computer Vision and Pattern Recognition

  15. He X, Niyogi P (2003) Locality preserving projections. In Advances in Neural Information Processing Systems

  16. Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) beta-VAE: Learning basic visual concepts with a constrained variational framework. In: International conference on learning representations

  17. Hofmann T (1999) Probabilistic latent semantic analysis. Uncertainty in Artificial Intelligence

  18. Hollander M, Wolfe DA (1973) Nonparametric statistical methods. Wiley, New York

    MATH  Google Scholar 

  19. Journée M, Nesterov Y, Richtárik P, Sepulchre R (2010) Generalized power method for sparse principal component analysis. J Mach Learn Res 11:517–553

    MathSciNet  MATH  Google Scholar 

  20. Kim H, Mnih A (2018) Disentangling by factorising. International Conference on Machine Learning

  21. Kim M, Pavlovic V (2007) A recursive method for discriminative mixture learning. International Conference on Machine Learning

  22. Kim M, Pavlovic V (2008) Dimensionality reduction using covariance operator inverse regression. IEEE Conference on Computer Vision and Pattern Recognition

  23. Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the Second International Conference on Learning Representations, ICLR

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems

  25. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11): 2278–2324

    Article  Google Scholar 

  26. LeCun Y, Jackel L, Bottou L, Brunot A, Cortes C, Denker J, Drucker H, Guyon I, Muller U, Sackinger E, Simard P, Vapnik V (1995) Comparison of learning algorithms for handwritten digit recognition. International Conference on Artificial Neural Networks

  27. Li KC (1991) Sliced inverse regression for dimension reduction. Journal of the American Statistical Association

  28. Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y (2017) Efficient algorithms for t-distributed stochastic neighborhood embedding. arXiv:1712.09005

  29. van der Maaten L (2014) Accelerating t-sne using tree-based algorithms. J Mach Learn Res 15:3221–3245

    MathSciNet  MATH  Google Scholar 

  30. Moghaddam B, Weiss Y, Avidan S (2006) Generalized spectral bounds for sparse LDA. International Conference on Machine Learning

  31. Nilsson J, Sha F, Jordan M (2007) Regression on manifolds using kernel dimension reduction. International Conference on Machine Learning

  32. Pavlovic V (2004) Model-based motion clustering using boosted mixture modeling. Computer Vision and Pattern Recognition

  33. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  34. Seung HS, Lee DD (2000) The manifold ways of perception. Science 290(5500):2268–2269

    Article  Google Scholar 

  35. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  Google Scholar 

  36. Wang C, Blei DM, Fei-Fei L (2009) Simultaneous image classification and annotation. IEEE International Conference on Computer Vision and Pattern Recognition

  37. Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  38. Yuan XT, Zhang T (2013) Truncated power method for sparse eigenvalue problems. J Mach Learn Res 14:899–925

    MathSciNet  MATH  Google Scholar 

  39. Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-Imbalance: An open-source software for multi-class imbalance learning. Knowledge-Based Systems (Available online: https://doi.org/10.1016/j.knosys.2019.03.001)

  40. Zhang C, Liu C, Zhang X, Almpanidis G (2017) An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst Appl 82(1):128–150

    Article  Google Scholar 

  41. Zhu J, Rosset S, Hastie T, Tibshirani R (2003) 1-norm support vector machines. In Advances in Neural Information Processing Systems

  42. Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. International Conference on Machine Learning

  43. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc, Ser B 67:301–320

    Article  MathSciNet  MATH  Google Scholar 

  44. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15 (2):265–286

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minyoung Kim.

Ethics declarations

This study was supported by the Research Program funded by the SeoulTech (Seoul National University of Science & Technology).

Conflict of interests

The authors have no conflict of interest. This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, M. Sparse large-margin nearest neighbor embedding via greedy dyad functional optimization. Appl Intell 49, 3628–3640 (2019). https://doi.org/10.1007/s10489-019-01472-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01472-x

Keywords

Navigation