Abstract
In this investigation, we discuss how to classify very quickly documents in Japanese putting stress on Part Of Speech (POS) distribution, not word distribution. There exist two main contributon of this investigation: linear regression approach models POS behavior in Japanese documents very well for classification, and a new excellent and efficient classification proposed based on Gaussian probability distribution, called Gaussian classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Kabashima, T.: On the ratio of parts of speech in present-day Japanese and the cause of its fluctuation. Kokugi Kokubun 24(6), 385–387 (1955) (in Japanese)
Kurohashi, S., Nagao, M.: KN Parser: Japanese Dependency/Case Structure Analyzer. In: Workshop on Sharable Natural Language Resources (1994)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT (1999)
Mizutani, S.: On Ohno’s lexical law. Keiryo-Kokugogaku. Mathematical Linguistics of Japanese 35, 1–12 (1965) (in Japanese)
Jim, M., Murakami, M.: Authorship Identification Using Random Forests. In: Proc. Inst. of Statistical Mathematics, vol. 55(2), pp. 255–268 (2007)
Ohno, S.: A study of several themes on the basic lexicon – In Japanese classical literary works. Kokugogaku (Japanese language) 24, 34–46 (1956) (in Japanese)
Rosen-Zvi, M., Griffiths, S.M., Smyth, T.: The author-topic model for authors and documents. In: UAI 2004 Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004)
Shirai, M., Miura, T.: On Domain Independence of Author Identification. In: Yin, H., Wang, W., Rayward-Smith, V. (eds.) IDEAL 2011. LNCS, vol. 6936, pp. 9–16. Springer, Heidelberg (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shirai, M., Miura, T. (2012). Document Classification Using POS Distribution. In: Morzy, T., Härder, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2012. Lecture Notes in Computer Science, vol 7503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33074-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-33074-2_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33073-5
Online ISBN: 978-3-642-33074-2
eBook Packages: Computer ScienceComputer Science (R0)