Abstract
A number of new algorithms for nonparametric distribution analysis based on Maximum Mean Discrepancy measures have been recently introduced. These novel algorithms operate in Hilbert space and can be used for nonparametric two-sample tests. Coupled with recent advances in string kernels, these methods extend the scope of kernel-based methods in the area of text mining. We review these kernel-based two-sample tests focusing on text mining where we will propose novel applications and present an efficient implementation in the kernlab package. We also present an efficient and integrated environment for applying modern machine learning methods to complex text mining problems through the combined use of the tm (for text mining) and the kernlab (for kernel-based learning) R packages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2, 53–86.
Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2), 9–17.
Cancedda, N., Gaussier, E., Goutte, C., & Renders, J.-M. (2003, August). Word-sequence kernels. Journal of Machine Learning Research, 3(6), 1059–1082 (special issue on machine learning methods for text and images).
Feinerer, I., Hornik, K., & Meyer, D. (2008, March). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.
Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., & Smola, A. J. (2007). A kernel method for the two-sample-problem. In Schölkopf, B., Platt, J., & Hofmann, T., (Eds.), Advances in neural information processing systems (Vol. 19). MIT Press: Cambridge, MA.
Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.
Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory, and algorithms. Boston: Kluwer Academic Publishers.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab – An S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20.
Lewis, D. (1997). Reuters-21578 text categorization test collection.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002, February). Text classification using string kernels. Journal of Machine Learning Research, 2, 419–444.
Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), General theory of information transfer and combinatorics, Lecture Notes in Computer Science (Vol. 4123, pp. 362–380). Berlin: Springer.
R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Teo, C. H., & Vishwanathan, S. V. N. (2006). Fast and space efficient string kernels using suffix arrays. In Proceedings of the 23rd International Conference on Machine Learning (pp. 929–936). New York: ACM Press.
Vishwanathan, S. V. N., & Smola, A. J. (2003). Fast kernels for string and tree matching. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (Vol. 15, pp. 569–576). Cambridge, MA: MIT Press.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Karatzoglou, A., Feinerer, I., Hornik, K. (2009). Nonparametric Distribution Analysis for Text Mining. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-01044-6_27
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)