A sparse \({\varvec{L}}_{2}\)-regularized support vector machines for efficient natural language learning
Abstract
Linear kernel support vector machines (SVMs) using either \(L_{1}\)-norm or \(L_{2}\)-norm have emerged as an important and wildly used classification algorithm for many applications such as text chunking, part-of-speech tagging, information retrieval, and dependency parsing. \(L_{2}\)-norm SVMs usually provide slightly better accuracy than \(L_{1}\)-SVMs in most tasks. However, \(L_{2}\)-norm SVMs produce too many near-but-nonzero feature weights that are highly time-consuming when computing nonsignificant weights. In this paper, we present a cutting-weight algorithm to guide the optimization process of the \(L_{2}\)-SVMs toward a sparse solution. Before checking the optimality, our method automatically discards a set of near-but-nonzero feature weight. The final objects can then be achieved when the objective function is met by the remaining features and hypothesis. One characteristic of our cutting-weight algorithm is that it requires no changes in the original learning objects. To verify this concept, we conduct the experiments using three well-known benchmarks, i.e., CoNLL-2000 text chunking, SIGHAN-3 Chinese word segmentation, and Chinese word dependency parsing. Our method achieves 1–10 times feature parameter reduction rates in comparison with the original \(L_{2}\)-SVMs, slightly better accuracy with a lower training time cost. In terms of run-time efficiency, our method is reasonably faster than the original \(L_{2}\)-regularized SVMs. For example, our sparse \(L_{2}\)-SVMs is 2.55 times faster than the original \(L_{2}\)-SVMs with the same accuracy.
Keywords
L2-regularization Support vector machines Machine learning Text chunking Dependency parsing Chinese word segmentationNotes
Acknowledgments
The authors acknowledge support under. NSC Grants NSC 101-2221-E-130-027- and NSC 101-2622-E-130-006-CC3.
References
- 1.Ando RK, Zhang T (2005) A high-performance semi-supervised learning method for text chunking. In: Proceedings of the annual meeting of the association of computational linguistics, pp 1–9Google Scholar
- 2.Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the annual ACM workshop on computational learning theory, pp 144–152Google Scholar
- 3.Buchholz S, Marsi E (2006) CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 149–164Google Scholar
- 4.Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods in natural language processing, pp 1–8Google Scholar
- 5.Daumé H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: Proceedings of international conference on machine learning, pp 169–176Google Scholar
- 6.Dhir CS, Lee J, Lee SY (2012) Extraction of independent discriminant features for data with asymmetric distribution. J Knowl Inf Syst 30(2):359–375CrossRefGoogle Scholar
- 7.Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the international conference on machine learning, pp 319–326Google Scholar
- 8.Fan TK, Cang CH (2010) Sentiment oriented contextual advertising. J Knowl Inf Syst 23(3):321–344CrossRefGoogle Scholar
- 9.Frommer A, Maaß P (1999) Fast CG-based methods for Tikhonov-Phillips regularization. J Sci Comput 20(5):1831–1850MATHMathSciNetGoogle Scholar
- 10.Gao J, Andrew G (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of international conference on machine learning, pp 33–40Google Scholar
- 11.Gao J, Andrew G, Johnson M, Toutanova K (2007) A comparative study of parameter estimation methods for statistical natural language processing. In: Proceedings of the annual meeting of the association of computational linguistics, pp 824–831Google Scholar
- 12.Giménez J, Márquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on, language resources and evaluation, pp 43–46Google Scholar
- 13.Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of international conference on machine learning, pp 408–415Google Scholar
- 14.Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 217–226Google Scholar
- 15.Joachims T, Finley T, Yu CN (2009) Cutting-plane training of structural SVMs. Mach Learn 77(1):27–59CrossRefMATHGoogle Scholar
- 16.Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. J Knowl Inf Syst 22(3):371–391CrossRefGoogle Scholar
- 17.Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear SVMs. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 408–416Google Scholar
- 18.Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear SVMs. J Mach Learn Res 6:341–361MATHMathSciNetGoogle Scholar
- 19.Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proceeding of the North American chapter of the association for computational linguistics on language technologies, pp 192–199Google Scholar
- 20.Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of conference on empirical methods in natural language processing, pp 230–237Google Scholar
- 21.Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of international conference on machine learning, pp 282–289Google Scholar
- 22.Lee YS, Wu YC (2007) A robust multilingual portable phrase chunking system. Expert Syst Appl 33(3): 1–26MATHGoogle Scholar
- 23.Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177MATHMathSciNetGoogle Scholar
- 24.Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
- 25.Nivre J, Hall J, Kubler S, Mcdonald R, Nilsson J, Riedel S, Yuret D (2007) The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the conference on computational natural language learning, pp 915–932Google Scholar
- 26.Ng HT, Low JK (2004) Chinese part-of-speech tagging. One-at-a-time or all-at-once? word-based or character-based?. In: Proceedings of conference on empirical methods in natural language processing, pp 277–284Google Scholar
- 27.Suzuki J, Fujino A, Isozaki H (2007) Semi-supervised structured output learning based on a hybrid generative and discriminative approach. In: Proceedings of the annual meeting of the association of computational linguistics, pp 791–800Google Scholar
- 28.Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the annual meeting of the association of computational linguistics, pp 665–673Google Scholar
- 29.Tsai RTH (2010) Chinese text segmentation: a hybrid approach using transductive learning and statistical association measures. Expert Syst Appl 37(5):3553–3560CrossRefGoogle Scholar
- 30.Tjong Kim Sang EF, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: chunking. In: Proceedings of the conference on computational natural language learning, pp 127–132Google Scholar
- 31.Wu YC, Lee YS, Yang JC (2008) Robust and efficient Chinese word dependency analysis with linear kernel support vector machines. In: Proceedings of international conference on computational linguistics poster session, pp 135–138Google Scholar
- 32.Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the annual meeting of the association of computational linguistics, pp 840–847Google Scholar
- 33.Zhang T, Damerau F, Johnson DE (2002) Text chunking based on a generalization of Winnow. J Mach Learn Res 2:615–637Google Scholar
- 34.Zhao H, Kit C (2007) Incorporating global information into supervised learning for Chinese word segmentation. In: Proceedings of the conference of the pacific association for computational linguistics, pp 66–74Google Scholar