Knowledge and Information Systems

, Volume 39, Issue 2, pp 305–328 | Cite as

A sparse \({\varvec{L}}_{2}\)-regularized support vector machines for efficient natural language learning

Regular Paper

Abstract

Linear kernel support vector machines (SVMs) using either \(L_{1}\)-norm or \(L_{2}\)-norm have emerged as an important and wildly used classification algorithm for many applications such as text chunking, part-of-speech tagging, information retrieval, and dependency parsing. \(L_{2}\)-norm SVMs usually provide slightly better accuracy than \(L_{1}\)-SVMs in most tasks. However, \(L_{2}\)-norm SVMs produce too many near-but-nonzero feature weights that are highly time-consuming when computing nonsignificant weights. In this paper, we present a cutting-weight algorithm to guide the optimization process of the \(L_{2}\)-SVMs toward a sparse solution. Before checking the optimality, our method automatically discards a set of near-but-nonzero feature weight. The final objects can then be achieved when the objective function is met by the remaining features and hypothesis. One characteristic of our cutting-weight algorithm is that it requires no changes in the original learning objects. To verify this concept, we conduct the experiments using three well-known benchmarks, i.e., CoNLL-2000 text chunking, SIGHAN-3 Chinese word segmentation, and Chinese word dependency parsing. Our method achieves 1–10 times feature parameter reduction rates in comparison with the original \(L_{2}\)-SVMs, slightly better accuracy with a lower training time cost. In terms of run-time efficiency, our method is reasonably faster than the original \(L_{2}\)-regularized SVMs. For example, our sparse \(L_{2}\)-SVMs is 2.55 times faster than the original \(L_{2}\)-SVMs with the same accuracy.

Keywords

L2-regularization Support vector machines Machine learning  Text chunking Dependency parsing Chinese word segmentation 

Notes

Acknowledgments

The authors acknowledge support under. NSC Grants NSC 101-2221-E-130-027- and NSC 101-2622-E-130-006-CC3.

References

  1. 1.
    Ando RK, Zhang T (2005) A high-performance semi-supervised learning method for text chunking. In: Proceedings of the annual meeting of the association of computational linguistics, pp 1–9Google Scholar
  2. 2.
    Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the annual ACM workshop on computational learning theory, pp 144–152Google Scholar
  3. 3.
    Buchholz S, Marsi E (2006) CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 149–164Google Scholar
  4. 4.
    Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods in natural language processing, pp 1–8Google Scholar
  5. 5.
    Daumé H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: Proceedings of international conference on machine learning, pp 169–176Google Scholar
  6. 6.
    Dhir CS, Lee J, Lee SY (2012) Extraction of independent discriminant features for data with asymmetric distribution. J Knowl Inf Syst 30(2):359–375CrossRefGoogle Scholar
  7. 7.
    Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the international conference on machine learning, pp 319–326Google Scholar
  8. 8.
    Fan TK, Cang CH (2010) Sentiment oriented contextual advertising. J Knowl Inf Syst 23(3):321–344CrossRefGoogle Scholar
  9. 9.
    Frommer A, Maaß P (1999) Fast CG-based methods for Tikhonov-Phillips regularization. J Sci Comput 20(5):1831–1850MATHMathSciNetGoogle Scholar
  10. 10.
    Gao J, Andrew G (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of international conference on machine learning, pp 33–40Google Scholar
  11. 11.
    Gao J, Andrew G, Johnson M, Toutanova K (2007) A comparative study of parameter estimation methods for statistical natural language processing. In: Proceedings of the annual meeting of the association of computational linguistics, pp 824–831Google Scholar
  12. 12.
    Giménez J, Márquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on, language resources and evaluation, pp 43–46Google Scholar
  13. 13.
    Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of international conference on machine learning, pp 408–415Google Scholar
  14. 14.
    Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 217–226Google Scholar
  15. 15.
    Joachims T, Finley T, Yu CN (2009) Cutting-plane training of structural SVMs. Mach Learn 77(1):27–59CrossRefMATHGoogle Scholar
  16. 16.
    Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. J Knowl Inf Syst 22(3):371–391CrossRefGoogle Scholar
  17. 17.
    Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear SVMs. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 408–416Google Scholar
  18. 18.
    Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear SVMs. J Mach Learn Res 6:341–361MATHMathSciNetGoogle Scholar
  19. 19.
    Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proceeding of the North American chapter of the association for computational linguistics on language technologies, pp 192–199Google Scholar
  20. 20.
    Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of conference on empirical methods in natural language processing, pp 230–237Google Scholar
  21. 21.
    Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of international conference on machine learning, pp 282–289Google Scholar
  22. 22.
    Lee YS, Wu YC (2007) A robust multilingual portable phrase chunking system. Expert Syst Appl 33(3): 1–26MATHGoogle Scholar
  23. 23.
    Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177MATHMathSciNetGoogle Scholar
  24. 24.
    Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  25. 25.
    Nivre J, Hall J, Kubler S, Mcdonald R, Nilsson J, Riedel S, Yuret D (2007) The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 915–932Google Scholar
  26. 26.
    Ng HT, Low JK (2004) Chinese part-of-speech tagging. One-at-a-time or all-at-once? word-based or character-based?. In: Proceedings of conference on empirical methods in natural language processing, pp 277–284Google Scholar
  27. 27.
    Suzuki J, Fujino A, Isozaki H (2007) Semi-supervised structured output learning based on a hybrid generative and discriminative approach. In: Proceedings of the annual meeting of the association of computational linguistics, pp 791–800Google Scholar
  28. 28.
    Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the annual meeting of the association of computational linguistics, pp 665–673Google Scholar
  29. 29.
    Tsai RTH (2010) Chinese text segmentation: a hybrid approach using transductive learning and statistical association measures. Expert Syst Appl 37(5):3553–3560CrossRefGoogle Scholar
  30. 30.
    Tjong Kim Sang EF, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: chunking. In: Proceedings of the conference on computational natural language learning, pp 127–132Google Scholar
  31. 31.
    Wu YC, Lee YS, Yang JC (2008) Robust and efficient Chinese word dependency analysis with linear kernel support vector machines. In: Proceedings of international conference on computational linguistics poster session, pp 135–138Google Scholar
  32. 32.
    Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the annual meeting of the association of computational linguistics, pp 840–847Google Scholar
  33. 33.
    Zhang T, Damerau F, Johnson DE (2002) Text chunking based on a generalization of Winnow. J Mach Learn Res 2:615–637Google Scholar
  34. 34.
    Zhao H, Kit C (2007) Incorporating global information into supervised learning for Chinese word segmentation. In: Proceedings of the conference of the pacific association for computational linguistics, pp 66–74Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.Department of Communication and ManagementMing Chuan UniversityTaipeiTaiwan

Personalised recommendations