Word Clustering for Persian Statistical Parsing

  • Masood Ghayoomi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7614)


Syntactically annotated data like a treebank are used for training the statistical parsers. One of the main aspects in developing statistical parsers is their sensitivity to the training data. Since data sparsity is the biggest challenge in data oriented analyses, parsers have a malperformance if they are trained with a small set of data, or when the genre of the training and the test data are not equal. In this paper, we propose a word-clustering approach using the Brown algorithm to overcome these problems. Using the proposed class-based model, a more coarser level of the lexicon is created compared to the words. In addition, we propose an extension to the clustering approach in which the POS tags of the words are also taken into the consideration while clustering the words. We prove that adding this information improves the performance of clustering specially for homographs. In usual word clusterings, homographs are treated equally; while the proposed extended model considers the homographs distinct and causes them to be assigned to different clusters. The experimental results show that the class-based approach outperforms the word-based parsing in general. Moreover, we show the superiority of the proposed extension of the class-based parsing to the model which only uses words for clustering.


Statistical Parsing Word Clustering the Persian Language 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aono, M., Doi, H.: A Method for Query Expansion Using a Hierarchy of Clusters. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 479–484. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Bijankhan, M.: The role of corpora in writing grammar. Journal of Linguistics 19(2), 48–67 (2004)Google Scholar
  3. 3.
    Bijankhan, M., Sheykhzadegan, J., Bahrani, M., Ghayoomi, M.: Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45(2), 143–164 (2011)CrossRefGoogle Scholar
  4. 4.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18, 467–479 (1992)Google Scholar
  5. 5.
    Candito, M., Anguiano, E.H., Seddah, D.: A word clustering approach to domain adaptation: Effective parsing of biomedical texts. In: Proceedings of the 12th International Conference on Parsing Technology, pp. 37–42 (2011)Google Scholar
  6. 6.
    Candito, M., Crabbe, B.: Improving generative statistical parsing with semi-supervised word clustering. In: Proceedings of the 11th International Conference on Parsing Technologies, Parise, France, pp. 138–141 (2009)Google Scholar
  7. 7.
    Candito, M., Seddah, D.: Parsing word clusters. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, California, pp. 76–84 (2010)Google Scholar
  8. 8.
    Chen, W., Chang, X., Wang, H., Zhu, J., Yao, T.: Automatic Word Clustering for Text Categorization Using Global Information. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 1–11. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania (1999)Google Scholar
  10. 10.
    Dhillon, I.S., Mallela, S., Kumar, R.: Enhanced word clustering for hierarchical text classification. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 191–200 (2002)Google Scholar
  11. 11.
    Ghayoomi, M.: Bootstrapping the development of an HPSG-based treebank for Persian. Linguistic Issues in Language Technology 7(1) (2012)Google Scholar
  12. 12.
    Ghayoomi, M.: From grammar rule extraction to treebanking: A bootstrapping approach. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 1912–1919 (2012)Google Scholar
  13. 13.
    Hodge, V., Austin, J.: Hierarchical word clustering - automatic thesaurus generation. Neurocomputing 48, 819–846 (2002)zbMATHCrossRefGoogle Scholar
  14. 14.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)Google Scholar
  15. 15.
    Kneser, R., Peters, J.: Semantic clustering for adaptive language modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society (1997)Google Scholar
  16. 16.
    Koo, T., Carreras, X., Collins, M.: Simple seim-supervised dependency parsing. In: Proceedings of the ACL 2008, Colimbus, USA, pp. 595–603 (2008)Google Scholar
  17. 17.
    Leech, G., Wilson, A.: Standards for Tagsets. In: Text, Speech, and Language Technology, 9th edn., pp. 55–80. Kluwer Academic Publishers, Dordrecht (1999)Google Scholar
  18. 18.
    Li, H.: Word clustering and disambiguation based on co-occurrence data. Natural Language Engineering 8(1), 25–42 (2002)CrossRefGoogle Scholar
  19. 19.
    Mahootiyan, S.: Persian. Routledge (1997)Google Scholar
  20. 20.
    Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Proceedings of NAACL-HLT, pp. 337–342. Association for Computational Linguistics (2004)Google Scholar
  21. 21.
    Momtazi, S., Klakow, D.: A word clustering approach for language model-based sentence retrieval in question answering systems. In: Proceedings of the Annual International ACM Conference on Information and Knowledge Management (CIKM), pp. 1911–1914. ACM (2009)Google Scholar
  22. 22.
    Morita, K., Atlam, E.S., Fuketra, M., Tsuda, K., Oono, M., Aoe, J.: Word classification and hierarchy using co-occurrence word information. Information Processing and Management 40(6), 957–972 (2004)zbMATHCrossRefGoogle Scholar
  23. 23.
    Pollard, C.J., Sag, I.A.: Head-Driven Phrase Structure Grammar. University of Chicago Press (1994)Google Scholar
  24. 24.
    Samuelsson, C., Reichl, W.: A class-based language model for large-vocabulary speech recognition extracted from part-of-speech statistics. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society (1999)Google Scholar
  25. 25.
    Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP) (2002)Google Scholar
  26. 26.
    Uszkoreit, J., Brants, T.: Distributed word clustering for large scale class-based language modeling in machine translation. In: Proceedings of the International Conference of the Association for Computational Linguistics (ACL). Association for Computational Linguistics (2008)Google Scholar
  27. 27.
    Zhang, Y., Krieger, H.U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, pp. 198–208 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Masood Ghayoomi
    • 1
  1. 1.German Grammar GroupFreie Universität BerlinGermany

Personalised recommendations