Abstract
Syntactically annotated data like a treebank are used for training the statistical parsers. One of the main aspects in developing statistical parsers is their sensitivity to the training data. Since data sparsity is the biggest challenge in data oriented analyses, parsers have a malperformance if they are trained with a small set of data, or when the genre of the training and the test data are not equal. In this paper, we propose a word-clustering approach using the Brown algorithm to overcome these problems. Using the proposed class-based model, a more coarser level of the lexicon is created compared to the words. In addition, we propose an extension to the clustering approach in which the POS tags of the words are also taken into the consideration while clustering the words. We prove that adding this information improves the performance of clustering specially for homographs. In usual word clusterings, homographs are treated equally; while the proposed extended model considers the homographs distinct and causes them to be assigned to different clusters. The experimental results show that the class-based approach outperforms the word-based parsing in general. Moreover, we show the superiority of the proposed extension of the class-based parsing to the model which only uses words for clustering.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aono, M., Doi, H.: A Method for Query Expansion Using a Hierarchy of Clusters. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 479–484. Springer, Heidelberg (2005)
Bijankhan, M.: The role of corpora in writing grammar. Journal of Linguistics 19(2), 48–67 (2004)
Bijankhan, M., Sheykhzadegan, J., Bahrani, M., Ghayoomi, M.: Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45(2), 143–164 (2011)
Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18, 467–479 (1992)
Candito, M., Anguiano, E.H., Seddah, D.: A word clustering approach to domain adaptation: Effective parsing of biomedical texts. In: Proceedings of the 12th International Conference on Parsing Technology, pp. 37–42 (2011)
Candito, M., Crabbe, B.: Improving generative statistical parsing with semi-supervised word clustering. In: Proceedings of the 11th International Conference on Parsing Technologies, Parise, France, pp. 138–141 (2009)
Candito, M., Seddah, D.: Parsing word clusters. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, California, pp. 76–84 (2010)
Chen, W., Chang, X., Wang, H., Zhu, J., Yao, T.: Automatic Word Clustering for Text Categorization Using Global Information. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 1–11. Springer, Heidelberg (2005)
Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania (1999)
Dhillon, I.S., Mallela, S., Kumar, R.: Enhanced word clustering for hierarchical text classification. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 191–200 (2002)
Ghayoomi, M.: Bootstrapping the development of an HPSG-based treebank for Persian. Linguistic Issues in Language Technology 7(1) (2012)
Ghayoomi, M.: From grammar rule extraction to treebanking: A bootstrapping approach. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 1912–1919 (2012)
Hodge, V., Austin, J.: Hierarchical word clustering - automatic thesaurus generation. Neurocomputing 48, 819–846 (2002)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Kneser, R., Peters, J.: Semantic clustering for adaptive language modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society (1997)
Koo, T., Carreras, X., Collins, M.: Simple seim-supervised dependency parsing. In: Proceedings of the ACL 2008, Colimbus, USA, pp. 595–603 (2008)
Leech, G., Wilson, A.: Standards for Tagsets. In: Text, Speech, and Language Technology, 9th edn., pp. 55–80. Kluwer Academic Publishers, Dordrecht (1999)
Li, H.: Word clustering and disambiguation based on co-occurrence data. Natural Language Engineering 8(1), 25–42 (2002)
Mahootiyan, S.: Persian. Routledge (1997)
Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Proceedings of NAACL-HLT, pp. 337–342. Association for Computational Linguistics (2004)
Momtazi, S., Klakow, D.: A word clustering approach for language model-based sentence retrieval in question answering systems. In: Proceedings of the Annual International ACM Conference on Information and Knowledge Management (CIKM), pp. 1911–1914. ACM (2009)
Morita, K., Atlam, E.S., Fuketra, M., Tsuda, K., Oono, M., Aoe, J.: Word classification and hierarchy using co-occurrence word information. Information Processing and Management 40(6), 957–972 (2004)
Pollard, C.J., Sag, I.A.: Head-Driven Phrase Structure Grammar. University of Chicago Press (1994)
Samuelsson, C., Reichl, W.: A class-based language model for large-vocabulary speech recognition extracted from part-of-speech statistics. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society (1999)
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP) (2002)
Uszkoreit, J., Brants, T.: Distributed word clustering for large scale class-based language modeling in machine translation. In: Proceedings of the International Conference of the Association for Computational Linguistics (ACL). Association for Computational Linguistics (2008)
Zhang, Y., Krieger, H.U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, pp. 198–208 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ghayoomi, M. (2012). Word Clustering for Persian Statistical Parsing. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-33983-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)