Nonparametric Distribution Analysis for Text Mining

Karatzoglou, Alexandros; Feinerer, Ingo; Hornik, Kurt

doi:10.1007/978-3-642-01044-6_27

Alexandros Karatzoglou⁵,
Ingo Feinerer &
Kurt Hornik

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2925 Accesses

Abstract

A number of new algorithms for nonparametric distribution analysis based on Maximum Mean Discrepancy measures have been recently introduced. These novel algorithms operate in Hilbert space and can be used for nonparametric two-sample tests. Coupled with recent advances in string kernels, these methods extend the scope of kernel-based methods in the area of text mining. We review these kernel-based two-sample tests focusing on text mining where we will propose novel applications and present an efficient implementation in the kernlab package. We also present an efficient and integrated environment for applying modern machine learning methods to complex text mining problems through the combined use of the tm (for text mining) and the kernlab (for kernel-based learning) R packages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2, 53–86.
Article MATH MathSciNet Google Scholar
Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2), 9–17.
MathSciNet Google Scholar
Cancedda, N., Gaussier, E., Goutte, C., & Renders, J.-M. (2003, August). Word-sequence kernels. Journal of Machine Learning Research, 3(6), 1059–1082 (special issue on machine learning methods for text and images).
Google Scholar
Feinerer, I., Hornik, K., & Meyer, D. (2008, March). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.
Google Scholar
Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., & Smola, A. J. (2007). A kernel method for the two-sample-problem. In Schölkopf, B., Platt, J., & Hofmann, T., (Eds.), Advances in neural information processing systems (Vol. 19). MIT Press: Cambridge, MA.
Google Scholar
Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.
Article Google Scholar
Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory, and algorithms. Boston: Kluwer Academic Publishers.
Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab – An S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20.
Google Scholar
Lewis, D. (1997). Reuters-21578 text categorization test collection.
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002, February). Text classification using string kernels. Journal of Machine Learning Research, 2, 419–444.
Google Scholar
Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), General theory of information transfer and combinatorics, Lecture Notes in Computer Science (Vol. 4123, pp. 362–380). Berlin: Springer.
Chapter Google Scholar
R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Google Scholar
Teo, C. H., & Vishwanathan, S. V. N. (2006). Fast and space efficient string kernels using suffix arrays. In Proceedings of the 23rd International Conference on Machine Learning (pp. 929–936). New York: ACM Press.
Google Scholar
Vishwanathan, S. V. N., & Smola, A. J. (2003). Fast kernels for string and tree matching. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (Vol. 15, pp. 569–576). Cambridge, MA: MIT Press.
Google Scholar

Download references

Author information

Authors and Affiliations

INSA de Rouen, LITIS, Rouen, France
Alexandros Karatzoglou

Authors

Alexandros Karatzoglou
View author publications
You can also search for this author in PubMed Google Scholar
Ingo Feinerer
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Hornik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandros Karatzoglou .

Editor information

Editors and Affiliations

Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Andreas Fink
Dept. Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom
Berthold Lausen
Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Wilfried Seidel
FB 12 Mathematik und Informatik, Datenbionik AG, Universität Marburg, Hans-Meerwein-Straße, Marburg, 35032, Germany
Alfred Ultsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karatzoglou, A., Feinerer, I., Hornik, K. (2009). Nonparametric Distribution Analysis for Text Mining. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-01044-6_27
Published: 31 July 2009
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics