This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.
This is a preview of subscription content,to check access.
Access this article
Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Assi, M., & Abdolhosseini, M. H. (2000). Grammatical tagging of a Persian corpus. International Journal of Corpus Linguistics, 5(1), 69–81.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Biber, D. (1992). Representativeness in corpus design. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline (pp. 174–197). New York, USA: Continuum.
Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 221–241.
Bijankhan, M. et al. (1994). Farsi spoken language database: FARSDAT. In Proceedings of the 5th international conference on speech sciences and technology (ICSST), Perth (Vol. 2, pp. 826–829).
Bijankhan, M. et al. (2003). TFARSDAT: Telephone Farsi spoken language database. EuroSpeech, Geneva (3), pp. 1525–1528.
Bijankhan, M. et al. (2004). The large Persian speech database. In Proceedings of the 1st workshop on Persian language and computer, the University of Tehran, Tehran, Iran (pp. 149–150).
Buckwalter, T. (2005). Issues in Arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to arabic script-based languages in conjunction with COLING 2004, Switzerland.
Cloeren, J. (1999). Tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging. Dordrecht, The Netherlands: Kluwer.
Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.
Ghayoomi, M., & Momtazi, S. (2009). Challenges in developing Persian corpora from online resources. In Proceedingss of IEEE international conference on Asian language processing, Singapore.
Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20(1), 17–33.
Ghomeshi, J. (1996). Projection and inflection: A study of persian phrase structure. Ph.D. thesis, University of Toronto, Toronto, ON.
Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of the 6th applied natural language processing conference, Washington (pp. 94–101).
Hearst, M. A. (1991). Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th annual conference of the University of Waterloo, Center for the new OED and text research, Oxford.
Hodge, C. T. (1957). Some aspects of Persian style. Language, 33(3) Part 1, 355–369.
Hudson, R. (1994). About 37% word-tokens are nouns. Language, 70(2), 331–339.
Hussain, S., & Gul, S. (2005). Road map for localization. Lahore, Pakistan: Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences.
Kawata, Y. (2001). Towards a reference tagset for Japanese. In Proceedings of the 6th natural language processing Pacific rim symposium post-conference workshop, Tokyo (pp. 55–62).
Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morpho-syntactic tagging of Arabic. Lancaster University, Computing Department. http://archimedes.fas.harvard.edu/mdh/arabic/CL2001.pdf.
Kralik, J., & Šulc, M. (2005). The representativeness of Czeck corpora. International Journal of Corpus Linguistics, 10(3), 357–366.
Kučera, K. (2002). The Czech national corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–247.
Leech, G. (2002). The importance of reference corpora. Donostia, 2002-10-24/25. www.corpus4u.org/upload/forum/2005060301260076.pdf.
Leech, G., & Wilson, A. (1999). Standards for tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 55–81). Dordrecht, The Netherlands: Kluwer.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: The MIT press.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. http://citeseer.comp.nus.edu.sg/587575.html.
Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320).
Mosavi-Miangah, T. (2006). Automatic lemmatization of Persian words: Project report. Journal of Quantitative Linguistics, 13(1), 1–15.
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone Speech Corpus. In Proceedings of the 2nd international conference on spoken language processing (ICSLP), Banff (pp. 895–898).
Samvelian, P. (2007). A (phrasal) affix analysis of the Persian Ezafe. Journal of Linguistics, 43, 605–645.
Sheykhzadegan, J., & Bijankhan, M. (2006). The speech databases of Persian language. In Proceedings of the 2nd workshop on Persian language and computing, the University of Tehran, Tehran, Iran (pp. 247–261).
Sinclair, J. (1987). Corpus creation. In G. Sampson and D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline, 2004 (pp. 78–84). New York: Continuum.
Voutilainen, A. (1999). A short history of tagging. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 9–19). Dordrecht, The Netherlands: Kluwer.
This project was funded by the Higher Council for Informatics of Iran and the University of Tehran under the contract number 190/3554. Masood Ghayoomi was funded by the German research council DFG under the contract number MU 2822/3-1. Our special gratitude also goes to Dr. Ali Darzi at the University of Tehran who cooperated with us in the project and the anonymous reviewers for their helpful comments. However, the responsibility for the content of this study lies with the authors alone.
About this article
Cite this article
Bijankhan, M., Sheykhzadegan, J., Bahrani, M. et al. Lessons from building a Persian written corpus: Peykare. Lang Resources & Evaluation 45, 143–164 (2011). https://doi.org/10.1007/s10579-010-9132-x