Advertisement

Frontiers of Computer Science

, Volume 12, Issue 4, pp 694–713 | Cite as

Dropout training for SVMs with data augmentation

  • Ning ChenEmail author
  • Jun Zhu
  • Jianfei Chen
  • Ting Chen
Research Article
  • 92 Downloads

Abstract

Dropout and other feature noising schemes have shown promise in controlling over-fitting by artificially corrupting the training data. Though extensive studies have been performed for generalized linear models, little has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a reweighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs.

Keywords

dropout SVMs logistic regression data augmentation iteratively reweighted least square 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11704_2018_7314_MOESM1_ESM.ppt (230 kb)
Supplementary material, approximately 230 KB.

References

  1. 1.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929–1958MathSciNetzbMATHGoogle Scholar
  2. 2.
    Wager S, Wang S, Liang P. Dropout training as adaptive regularization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013Google Scholar
  3. 3.
    Maaten L V, Chen M, Tyree S, Weinberger K Q. Learning with marginalized corrupted features. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 410–418Google Scholar
  4. 4.
    Wang S, Wang M Q, Wager S, Liang P, Manning C D. Feature noising for log-linear structured prediction. In: Proceedings of Conference on Empirical Methods on Natural Language Processing. 2013, 1170–1179Google Scholar
  5. 5.
    Wang S, Manning C. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 777–785Google Scholar
  6. 6.
    Wang H, Shi X J, Yeung D Y. Relational stacked denoising autoencoder for tag recommendation. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 3052–3058Google Scholar
  7. 7.
    Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995CrossRefzbMATHGoogle Scholar
  8. 8.
    Burges C J C, Scholkopf B. Improving the accuracy and speed of support vector machines. In: Proceedings of Advances in Neural Information Processing Systems. 1997, 375–381Google Scholar
  9. 9.
    Globerson A, Roweis S. Nightmare at test time: robust learning by feature deletion. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 353–360Google Scholar
  10. 10.
    Dekel O, Shamir O. Learning to classify with missing and corrupted features. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 149–178Google Scholar
  11. 11.
    Teo C H, Globerson A, Roweis S T, Smola A K. Convex learning with invariances. In: Proceedings of Advances in Neural Information Processing Systems. 2008, 1489–1496Google Scholar
  12. 12.
    Polson N G, Scott S L. Data augmentation for support vector machines. Bayesian Analysis, 2011, 6(1): 1–24MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Polson N G, Scott J G, Windle J. Bayesian inference for logistic models using Polya-Gamma latent variables. Journal of the American Statistical Association, 2013, 108(504): 1339–1349MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A. Are loss functions all the same? Neural Computation, 2004, 16(5): 1063–1076CrossRefzbMATHGoogle Scholar
  15. 15.
    Globerson A, Koo T Y, Carreras X, Collins M. Exponentiated gradient algorithms for log-linear structured prediction. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 305–312Google Scholar
  16. 16.
    Baldi P, Sadowski P. The dropout learning algorithm. Artificial Intelligence, 2014, 210(5): 78–122MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929–1958MathSciNetzbMATHGoogle Scholar
  18. 18.
    Srivastava N. Improving neural networks with dropout. Dissertation for the Master Degree. Toronto: University of Toronto, 2013Google Scholar
  19. 19.
    Huang G, Song S J, Gupta J N D, Wu C. Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 2014, 44(12): 2405–2417CrossRefGoogle Scholar
  20. 20.
    Van Erven T, Kotlowski W, Warmuth M K. Follow the leader with dropout perturbations. Proceedings of Machine Learning Research, 2014, 35: 949–974Google Scholar
  21. 21.
    Xu P Y, Sarikaya R. Targeted feature dropout for robust slot filling in natural language understanding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. 2014, 258–262Google Scholar
  22. 22.
    Rashmi R K, Gilad-Bachrach R. Dart: dropouts meet multiple additive regression trees. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 2015, 489–497Google Scholar
  23. 23.
    Chen M M, Xu Z X, Weinberger K, Sha F. Marginalized denoising autoencoders for domain adaptation. In: Proceedings of International Conference on Machine Learning. 2012, 767–774Google Scholar
  24. 24.
    Chen M M, Weinberger K, Sha F, Bengio Y. Marginalized denoising autoencoders for nonlinear representation. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 3342–3350Google Scholar
  25. 25.
    Chen Z, Chen M M, Weinberger K Q, Zhang W X. Marginalized denoising for link prediction and multi-label learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 1707–1713Google Scholar
  26. 26.
    Chen Z, Zhang W X. A marginalized denoising method for link prediction in relational data. In: Proceedings of the SIAM International Conference on Data Mining. 2014, 298–306Google Scholar
  27. 27.
    Chen M M, Zheng A, Weinberger K. Fast image tagging. In: Proceedings of International Conference on Machine Learning. 2013, 2311–2319Google Scholar
  28. 28.
    Qian Q, Hu J H, Jin R, Pei J, Zhu S H. Distance metric learning using dropout: a structured regularization approach. In: Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2014, 323–332Google Scholar
  29. 29.
    Wager S, Fithian W, Wang S, Liang P S. Altitude training: strong bounds for single-layer dropout. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 100–108Google Scholar
  30. 30.
    Bachman P, Alsharif O, Precup D. Learning with pseudo-ensembles. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 3365–3373Google Scholar
  31. 31.
    Helmbold D P, Long P M. On the inductive bias of dropout. Journal of Machine Learning Research, 2015, 16: 3403–3454MathSciNetzbMATHGoogle Scholar
  32. 32.
    Maeda S. A Bayesian encourages dropout. 2014, arXiv:1412.7003v3Google Scholar
  33. 33.
    Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of International Conference on Machine Learning. 2016, 1651–1660Google Scholar
  34. 34.
    Chen N, Zhu J, Chen J F, Zhang B. Dropout training for support vector machines. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014, 1752–1759Google Scholar
  35. 35.
    Vincent P, Larochelle H, Bengio Y, Manzagol P A. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 1096–1103Google Scholar
  36. 36.
    Saul L K, Jaakkola T, Jordan M I. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 1996, 4: 61–76zbMATHGoogle Scholar
  37. 37.
    Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research, 2014, 15: 1073–1110MathSciNetzbMATHGoogle Scholar
  38. 38.
    Devroye L. Non-Uniform Random Variate Generation. New York: Springer-Verlag, 1986CrossRefzbMATHGoogle Scholar
  39. 39.
    Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503–528MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2009CrossRefzbMATHGoogle Scholar
  41. 41.
    Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798–1828CrossRefGoogle Scholar
  42. 42.
    Guo J, Che W X, Yarowsky D, Wang H F, Liu T. A distributed representation-based framework for cross-lingual transfer parsing. Journal of Artificial Intelligence Research, 2016, 55: 995–1023MathSciNetGoogle Scholar
  43. 43.
    Smola A J, Scholkopf B. A tutorial on support vector regression. Statistics and Computing, 2003, 14(3): 199–222MathSciNetCrossRefGoogle Scholar
  44. 44.
    Chen N, Zhu J, Xia F, Zhang B. Generalized relational topic models with data augmentation. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013, 1273–1279Google Scholar
  45. 45.
    Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 2007, 440–447Google Scholar
  46. 46.
    Torralba A, Fergus R, Freeman W. A large dataset for non-parametric object and scene recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2008, 30(11): 1958–1970CrossRefGoogle Scholar
  47. 47.
    Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report. 2009Google Scholar
  48. 48.
    Zhu J, Xing E P. Conditional topic random fields. In: Proceedings of International Conference on Machine Learning. 2010, 1239–1246Google Scholar
  49. 49.
    Rifkin R, Klautau A. In defense of one-vs-all classification. Journal of Machine Learning Research, 2004, (5): 101–141MathSciNetzbMATHGoogle Scholar
  50. 50.
    Blei D, McAuliffe J D. Supervised topic models. In: Proceedings of Advances in Neural Information Processing Systems. 2007Google Scholar
  51. 51.
    Tang Y. Deep learning with linear support vector machines. In: Proceedings of ICML workshop on Representational Learning. 2013Google Scholar
  52. 52.
    Kingma D P, Welling M. Efficient gradient-based inference through transformations between bayes nets and neural nets. In: Proceedings of International Conference on Machine Learning. 2014, 3791–3799Google Scholar
  53. 53.
    Bacon P L, Bengio E, Pineau J, Precup D. Conditional computation in neural networks using a decision-theoretic approach. In: Proceedings of the 2nd Multidisciplinary Conference on Reinforcement Learning and Decision Making. 2015Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.MOE Key lab of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, TNLISTTsinghua UniversityBeijingChina
  2. 2.State Key Lab of Intelligent Technology and Systems, Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations