Asynchronous Feature Extraction for Large-Scale Linear Predictors

  • Shin MatsushimaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9851)


Learning from datasets with a massive number of possible features to obtain more accurate predictors is being intensively studied. In this paper, we aim to perform effective learning by using the L1 regularized risk minimization problems regarding both time and space computational resources. This is accomplished by concentrating on the effective features from among a large number of unnecessary features. To achieve this, we propose a multithreaded scheme that simultaneously runs processes for developing seemingly important features in the main memory and updating parameters regarding only the important features. We verified our method through computational experiments, showing that our proposed scheme can handle terabyte-scale optimization problems with one machine.


Relative Entropy Coordinate Descent Stochastic Gradient Descent Empirical Risk Minimization Coordinate Descent Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The author thanks S.V.N. Vishwanathan for the illuminating discussion to initiate the basic idea and Hiroshi Iida and Hiroshi Nakagawa for their suggestive trials and arguments on preceding schemes. This work is partially supported by MEXT KAKENHI Grant Number 23240019 and JST-CREST.


  1. 1.
    Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)zbMATHGoogle Scholar
  2. 2.
    Bottou, L.: Stochastic gradient tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, Reloaded. LNCS, vol. 7700, pp. 430–445. Springer, Heidelberg (2012)Google Scholar
  3. 3.
    Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via column generation. Mach. Learn. 46(1), 225–254 (2002)CrossRefzbMATHGoogle Scholar
  4. 4.
    Duchi, J., Singer, Y.: Boosting with structural sparsity. In: Proceedings of International Conference on Machine Learning, pp. 297–304 (2009)Google Scholar
  5. 5.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  6. 6.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRefGoogle Scholar
  8. 8.
    Matsushima, S., Vishwanathan, S., Smola, A.J.: Linear support vector machines via dual cached loops. In: Proceedings of Knowledge Discovery and Data Mining, pp. 177–185 (2012)Google Scholar
  9. 9.
    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Proceedings of Conference on Neural Information Processing Systems, pp. 1177–1184 (2007)Google Scholar
  11. 11.
    Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Proceedings of Conference on Neural Information Processing Systems, pp. 1313–1320 (2008)Google Scholar
  12. 12.
    Rish, I., Grabarnik, G.: Sparse Modeling: Theory, Algorithms, and Applications. CRC Press Inc., Boca Raton (2014)zbMATHGoogle Scholar
  13. 13.
    Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. The MIT Press, Cambridge (2012)zbMATHGoogle Scholar
  14. 14.
    Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York (2014)CrossRefzbMATHGoogle Scholar
  15. 15.
    Sonnenburg, S., Franc, V.: COFFIN: a computational framework for linear SVMs. In: Proceedings of International Conference on Machine Learning, pp. 999–1006 (2010)Google Scholar
  16. 16.
    Tibshirani, R.: The lasso method for variable selection in the cox model. In: Statistics in Medicine, pp. 385–395 (1997)Google Scholar
  17. 17.
    Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1–2), 387–423 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Warmuth, M.K., Liao, J.: Totally corrective boosting algorithms that maximize the margin. In: Proceedings of International Conference on Machine Learning, pp. 1001–1008 (2006)Google Scholar
  19. 19.
    Webb, S., Caverlee, J., Pu, C.: Introducing the webb spam corpus: using email spam to identify web spam automatically. In: Proceedings of the Third Conference on Email and Anti-Spam (2006)Google Scholar
  20. 20.
    Yu, H.F., Hsieh, C.J., Chang, K.W., Lin, C.J.: Large linear classification when data cannot fit in memory. In: Proceedings of Knowledge Discovery and Data Mining, pp. 833–842 (2010)Google Scholar
  21. 21.
    Yuan, G.X., Chang, K.W., Hsieh, C.J., Lin, C.J.: A comparison of optimization methods and software for large-scale l1-regularized linear classification. J. Mach. Learn. Res. 11, 3183–3234 (2010)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.The University of TokyoTokyoJapan

Personalised recommendations