Lessons from building a Persian written corpus: Peykare
- 315 Downloads
This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.
KeywordsContemporary Persian Corpus EAGLES-based tagset Ezafe construction Homographs
This project was funded by the Higher Council for Informatics of Iran and the University of Tehran under the contract number 190/3554. Masood Ghayoomi was funded by the German research council DFG under the contract number MU 2822/3-1. Our special gratitude also goes to Dr. Ali Darzi at the University of Tehran who cooperated with us in the project and the anonymous reviewers for their helpful comments. However, the responsibility for the content of this study lies with the authors alone.
- Biber, D. (1992). Representativeness in corpus design. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline (pp. 174–197). New York, USA: Continuum.Google Scholar
- Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 221–241.Google Scholar
- Bijankhan, M. et al. (1994). Farsi spoken language database: FARSDAT. In Proceedings of the 5th international conference on speech sciences and technology (ICSST), Perth (Vol. 2, pp. 826–829).Google Scholar
- Bijankhan, M. et al. (2003). TFARSDAT: Telephone Farsi spoken language database. EuroSpeech, Geneva (3), pp. 1525–1528.Google Scholar
- Bijankhan, M. et al. (2004). The large Persian speech database. In Proceedings of the 1st workshop on Persian language and computer, the University of Tehran, Tehran, Iran (pp. 149–150).Google Scholar
- Buckwalter, T. (2005). Issues in Arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to arabic script-based languages in conjunction with COLING 2004, Switzerland.Google Scholar
- Cloeren, J. (1999). Tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging. Dordrecht, The Netherlands: Kluwer.Google Scholar
- Ghayoomi, M., & Momtazi, S. (2009). Challenges in developing Persian corpora from online resources. In Proceedingss of IEEE international conference on Asian language processing, Singapore.Google Scholar
- Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20(1), 17–33.Google Scholar
- Ghomeshi, J. (1996). Projection and inflection: A study of persian phrase structure. Ph.D. thesis, University of Toronto, Toronto, ON.Google Scholar
- Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of the 6th applied natural language processing conference, Washington (pp. 94–101).Google Scholar
- Hearst, M. A. (1991). Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th annual conference of the University of Waterloo, Center for the new OED and text research, Oxford.Google Scholar
- Hodge, C. T. (1957). Some aspects of Persian style. Language, 33(3) Part 1, 355–369.Google Scholar
- Hussain, S., & Gul, S. (2005). Road map for localization. Lahore, Pakistan: Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences.Google Scholar
- Kawata, Y. (2001). Towards a reference tagset for Japanese. In Proceedings of the 6th natural language processing Pacific rim symposium post-conference workshop, Tokyo (pp. 55–62).Google Scholar
- Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morpho-syntactic tagging of Arabic. Lancaster University, Computing Department. http://archimedes.fas.harvard.edu/mdh/arabic/CL2001.pdf.
- Leech, G. (2002). The importance of reference corpora. Donostia, 2002-10-24/25. www.corpus4u.org/upload/forum/2005060301260076.pdf.
- Leech, G., & Wilson, A. (1999). Standards for tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 55–81). Dordrecht, The Netherlands: Kluwer.Google Scholar
- Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: The MIT press.Google Scholar
- Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. http://citeseer.comp.nus.edu.sg/587575.html.
- Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320).Google Scholar
- Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone Speech Corpus. In Proceedings of the 2nd international conference on spoken language processing (ICSLP), Banff (pp. 895–898).Google Scholar
- Sheykhzadegan, J., & Bijankhan, M. (2006). The speech databases of Persian language. In Proceedings of the 2nd workshop on Persian language and computing, the University of Tehran, Tehran, Iran (pp. 247–261).Google Scholar
- Sinclair, J. (1987). Corpus creation. In G. Sampson and D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline, 2004 (pp. 78–84). New York: Continuum.Google Scholar
- Voutilainen, A. (1999). A short history of tagging. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 9–19). Dordrecht, The Netherlands: Kluwer.Google Scholar