Advances in Data Analysis and Classification

, Volume 1, Issue 3, pp 221–239

Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

Regular Article

Abstract

The support vector machine (SVM) is a powerful learning algorithm, e.g., for classification and clustering tasks, that works even for complex data structures such as strings, trees, lists and general graphs. It is based on the usage of a kernel function for measuring scalar products between data units. For analyzing string data Lodhi et al. (J Mach Learn Res 2:419–444, 2002) have introduced a String Subsequence kernel (SSK). In this paper we propose an approximation to SSK based on dropping higher orders terms (i.e., subsequences which are spread out more than a certain threshold) that reduces the computational burden of SSK. As we are also concerned with practical application of complex kernels with high computational complexity and memory consumption, we provide an empirical model to predict runtime and memory of the approximation as well as the original SSK, based on easily measurable properties of input data. We provide extensive results on the properties of the proposed approximation, SSK-LP, with respect to prediction accuracy, runtime and memory consumption. Using some real-life datasets of text mining tasks, we show that models based on SSK and SSK-LP perform similarly for a set of real-life learning tasks, and that the empirical runtime model is also useful in roughly determining total learning time for a SVM using either kernel.

Keywords

Machine learning Kernel methods String kernels Runtime estimation Memory consumption estimation 

Mathematics Subject Classification (2000)

68T05 90C59 93A30 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Collins M and Duffy N (2002). Convolution kernels for natural language. In: Dietterich, TG, Becker, S and Ghahramani, Z (eds) Advances in neural information processing systems 14 (2001), pp 625–632. MIT Press, Cambridge Google Scholar
  2. Cortes C, Haffner P and Mohri M (2004). Rational kernels: theory and algorithms. J Mach Learn Res 5: 1035–1062 MathSciNetGoogle Scholar
  3. Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines. Cambridge University Press, Cambridge Google Scholar
  4. Frank E, Witten IH (2005) Data mining—practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann/Elsevier, San Francisco/Amsterdam 2005. http://www.cs.waikato.ac.nz/~ml/weka/Google Scholar
  5. Haussler D (1999). Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, Baskin School of Engineering. University of California, Santa Cruz Google Scholar
  6. Joachims T (2002) Learning to classify text using support vector machines. Kluwer, DordrechtGoogle Scholar
  7. Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing 2002, pp 564–575Google Scholar
  8. Lodhi H, Saunders C, Shawe-Taylor J, Christianini N and Watkins C (2002). Text classification using string kernels. J Mach Learn Res 2: 419–444 MATHCrossRefGoogle Scholar
  9. Minsky M, Papert SA (1969) Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, expanded edition, reprinted 1988Google Scholar
  10. Pillet V, Zehnder M, Seewald AK, Veuthey A-L and Petrak J (2005). GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005(21): 1743–1744 Google Scholar
  11. Platt J (1998) Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf B, Burges C, Smola A (eds). Advances in kernel methods—support vector learning. MIT Press, CambridgeGoogle Scholar
  12. Rousu J and Shawe-Taylor J (2005). Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6: 1323–1344 MathSciNetGoogle Scholar
  13. Seewald AK (2003) Recognizing domain and species from MEDLINE proteomics publications. Workshop on data mining and text mining for bioinformatics. In: 14th European conference on machine learning (ECML-2003), Dubrovnik-Cavtat, CroatiaGoogle Scholar
  14. Seewald AK (2004) Ranking for medical annotation: investigating performance, local search and homonymy recognition. In: Proceedings of the symposium on knowledge exploration in life science informatics (KELSI 2004), Milano, ItalyGoogle Scholar
  15. Seewald AK (2007). An evaluation of Naive Bayes Variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5): 497–524 Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  1. 1.Seewald SolutionsViennaAustria
  2. 2.Research Studios AustriaSmart Agent Techn.ViennaAustria

Personalised recommendations