Abstract
Learning a compact predictive model in an online setting has recently gained a great deal of attention. The combination of online learning with sparsity-inducing regularization enables faster learning with a smaller memory space than the previous learning frameworks. Many optimization methods and learning algorithms have been developed on the basis of online learning with L 1-regularization. L 1-regularization tends to truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to preserve the advantages of online learning. We propose a new regularization framework for sparse online learning. We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating information on all previous subgradients of the loss function into a regularization term. The resulting algorithms enable online learning to adjust the intensity of each feature’s truncations without pre-processing and eventually eliminate the bias of L 1-regularization. We show theoretical properties of our framework, the computational complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous methods in many classification tasks.
Similar content being viewed by others
References
Yu H-F, Hsieh C-J, Chang K-W, et al. Large linear classification when data cannot fit in memory. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 2010. 833–842
Duchi J, Singer Y. Effcient online and batch learning using forward backward splitting. J Mach Learn Res, 2009, 10: 2899–2934
Duchi J, Shalev-Shwartz S, Singer Y, et al. Composite objective mirror descent. In: 23rd International Conference on Learning Theory, Haifa, 2010. 14–26
Xiao L. Dual averaging methods for regularized stochastic learning and online optimization. J Mach Learn Res, 2010, 11: 2543–2596
Brendan McMahan H, Streeter M J. Adaptive bound optimization for online convex optimization. In: 23rd International Conference on Learning Theory, Haifa, 2010. 244–256
Brendan McMahan H. Follow-the-regularized-leader and mirror descent: equivalence theorems and l1 regularization. In: 14th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, 2011. 525–533
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manage, 1988, 24: 513–523
Shalev-Shwartz S. Online learning and online convex optimization. Found Trends Mach Learn, 2012, 4: 107–194
Bertsekas D P. Nonlinear Programming. 2nd edition. Athena Scientific. 1999
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: 20th International Conference on Machine Learning, Washington D. C., 2003. 928–936
Beck A, Teboulle M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper Res Lett, 2003, 31: 167–175
Nesterov Y. Primal-dual subgradient methods for convex problems. Math Program, 2009, 120: 221–259
Nesterov Y. A method of solving a convex programming problem with convergence rate o(1/k2). Sov Math Dokl, 1983, 27: 372–376
Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci, 2009, 2: 183–202
Tseng P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math Program, 2010, 125: 263–295
Carpenter B. Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Technical Report, Alias-i, Inc. 2008
Langford J, Li L H, Zhang T. Sparse online learning via truncated gradient. J Mach Learn Res, 2009, 10: 777–801
Tsuruoka Y, Tsujii J, Ananiadou S. Stochastic gradient descent training for l1-regularized log-linear. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg: Association for Computational Linguistics, 2009. 477–485
Shalev-shwartz S, Singer Y. Convex repeated games and fenchel duality. In: Advances in Neural Information Processing Systems, Vancouver, 2006. 1265–1272
Dekel O, Gilad-Bachrach R, Shamir O, et al. Optimal distributed online prediction using mini-batches. J Mach Learn Res, 2012, 13: 165–202
Duchi J, Agarwal A, Wainwright M J. Distributed dual averaging in networks. In: Advances in Neural Information Processing Systems, Vancouver, 2010. 550–558
Lee S, Wright S J. Manifold identification in dual averaging for regularized stochastic online learning. J Mach Learn Res, 2012, 13: 1705–1744
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121–2159
Kalai A, Vempala S. Efficient algorithms for online decision problems. J Comput Syst Sci, 2005, 71: 291–307
Shalev-Shwartz S, Singer Y. A primal-dual perspective of online learning algorithms. Mach Learn, 2007, 69: 115–142
Sra S, Nowozin S, Wright S J. Optimization for Machine Learning. MIT Press, 2011
Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev, 1958, 65: 386–408
Crammer K, Dekel O, Keshet J, et al. Online passive-aggressive algorithms. J Mach Learn Res, 2006, 7: 551–585
Dredze M, Crammer K, Pereira F. Confidence-weighted linear classification. In: 25th international conference on Machine learning. New York: ACM, 2008. 264–271
Crammer K, Fern M D, Pereira O. Exact convex confidence-weighted learning. In: Advances in Neural Information Processing Systems, Vancouver, 2008. 345–352
Narayanan H, Rakhlin A. Random walk approach to regret minimization. In: Advances in Neural Information Processing Systems, Vancouver, 2010. 1777–1785
Cesa-Bianchi N, Shamir O. Efficient online learning via randomized rounding. In: Advances in Neural Information Processing Systems, Granada, 2011. 343–351
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Roy Statist Soc Ser B, 2005, 67: 301–320
Bondell H D, Reich B J. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 2008, 64: 115–123
Luo D J, Ding C H Q, Huang H. Toward structural sparsity: an explicit 2/0 approach. Knowl Inf Syst, 2013, 36: 411–438
Wu X D, Yu K, Ding W, et al. Online feature selection with streaming features. IEEE Trans Patt Anal Mach Intell, 2013, 35: 1178–1192
Wang H X, Zheng W M. Robust sparsity-preserved learning with application to image visualization. Knowl Inf Syst, 2013. doi: 10.1007/s10115-012-0605-7
Oiwa H, Matsushima S, Nakagawa H. Healing truncation bias: self-weighted truncation framework for dual averaging. In: IEEE 12th International Conference on Data Mining (ICDM), Brussels, 2012. 575–584
Oiwa H, Matsushima S, Nakagawa H. Frequency-aware truncated methods for sparse online learning. Lect Notes Comput Sci, 2011, 6912: 533–548
Brendan McMahan H. A unified view of regularized dual averaging and mirror descent with implicit updates. arXiv:1009.3240, 2010
Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: 45th Annual Meeting of the Association of Computational Linguistics, Prague, 2007. 440–447
Lang K. Newsweeder: learning to filter netnews. In: 12th International Conference on Machine Learning, Lake Tahoe, 1995. 331–339
Matsushima S, Shimizu N, Yoshida K, et al. Exact passive-aggressive algorithm for multiclass classification using support class. In: SIAM International Conference on Data Mining, Mesa, 2010. 303–314
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oiwa, H., Matsushima, S. & Nakagawa, H. Feature-aware regularization for sparse online learning. Sci. China Inf. Sci. 57, 1–21 (2014). https://doi.org/10.1007/s11432-014-5082-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-014-5082-z