Abstract
This paper introduces Hidden Markov Models with N-gram observation based on words bound morphemes (affixes) used in natural language text processing focusing on the field of syntactic classification. In general, presented curtailment of the consecutive gram’s affixes, decreases the accuracy in observation, but reveals statistically significant dependencies. Hence, considerably smaller size of the training data set is required. Therefore, the impact of affix observation on the knowledge generalization and associated with this improved word mapping is also described. The focal point of this paper is the evaluation of the HMM in the field of syntactic analysis for English and Polish language based on Penn and Składnica treebank. In total, a 10 HMM differing in the structure of observation has been compared. The experimental results show the advantages of particular configuration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kupiec, J.: Robust part-of-speech tagging using a hidden Markov model. In: Computer Speech and Language, pp. 225–242 (1992)
Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 744–751. Association for Computational Linguistics, June 2007
Gao, J., Johnson, M.: A comparison of Bayesian estimators for unsupervised hidden Markov model pos taggers. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 344–352 (2008)
Lioma, C.: Part of speech n-grams for information retrieval. Ph.D. thesis, University of Glasgow (2008)
Brants, T.: TnT — A statistical part of speech tagger. In: Proceedings of the 6th Applied NLP Conference(ANLP-2000), pp. 224–231 (2000)
Thede, S.M.: Predicting part-of-speech information about unknown words using statistical methods. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics - v.2, pp. 1505–1507 (1998)
Nakagawa, T., Kudoh, T., Matsumoto, Y.: Unknown word guessing and part-of-speech tagging using support vector machines. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pp. 325–331 (2001)
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)
Tseng, H., Jurafsky, D., Manning, C.: Morphological features help POS tagging of unknown words across language varieties. In: Proceedings of the Fourth SIGHAN Bakeoff (2005)
Luong, M.T., Nakov, P., Ken, M.Y.: A hybrid morpheme-word representation for machine translation of morphologically rich languages. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), Cambridge, MA, pp. 148–157 (2010)
Adler, M.: Hebrew morphological disambiguation: an unsupervised stochastic word-based approach. Ph.D. thesis, Ben-Gurion University of the Negev, Israel (2007)
Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview (2003)
Hajnicz, E.: Lexico-semantic annotation of składnica treebank by means of PLWN lexical units. In: Proceedings of the Seventh Global WordNet Conference, Tartu, Estonia, pp. 23–31 (2014)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Jahmm, Java implementation of HMM related algorithms (2009)
Layton, M.: Augmented Statistical Models for Classifying Sequence Data (2006)
Langkilde, I., Knight, K.: The practical value of n-grams in generation. In: Proceedings of the Ninth International Workshop on Natural Language Generation, Niagara-on-the-Lake, Ontario, pp. 248–255 (1998)
Lee, L.-M., Lee, J.-C.: A study on high-order hidden Markov models and applications to speech recognition. In: Ali, M., Dapoigny, R. (eds.) IEA/AIE 2006. LNCS (LNAI), vol. 4031, pp. 682–690. Springer, Heidelberg (2006)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global Wordnet Conference, Matsue, Japan (2012)
Levenshtein, A.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Pietras, M.: Sentence sentiment classification using fuzzy word matching combined with fuzzy sentiment classifier. Electrical Review - Special issue, Poland (2014). doi:10.15199/48.2015.02.26
Wróblewska, A.: Polish dependency parser trained on an automatically induced dependency bank. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw (2014)
Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden Markov model: analysis and applications. Mach. Learn. Boston 32, 41–62 (1998)
Kobyliński, Ł.: PoliTa: a multitagger for Polish. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, Iceland, pp. 2949–2954 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pietras, M. (2017). Hidden Markov Models with Affix Based Observation in the Field of Syntactic Analysis. In: Kobayashi, Sy., Piegat, A., Pejaś, J., El Fray, I., Kacprzyk, J. (eds) Hard and Soft Computing for Artificial Intelligence, Multimedia and Security. ACS 2016. Advances in Intelligent Systems and Computing, vol 534. Springer, Cham. https://doi.org/10.1007/978-3-319-48429-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-48429-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48428-0
Online ISBN: 978-3-319-48429-7
eBook Packages: EngineeringEngineering (R0)