Language Resources and Evaluation

, Volume 45, Issue 2, pp 143–164 | Cite as

Lessons from building a Persian written corpus: Peykare

  • Mahmood Bijankhan
  • Javad Sheykhzadegan
  • Mohammad Bahrani
  • Masood Ghayoomi
Original Paper


This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.


Contemporary Persian Corpus EAGLES-based tagset Ezafe construction Homographs 



This project was funded by the Higher Council for Informatics of Iran and the University of Tehran under the contract number 190/3554. Masood Ghayoomi was funded by the German research council DFG under the contract number MU 2822/3-1. Our special gratitude also goes to Dr. Ali Darzi at the University of Tehran who cooperated with us in the project and the anonymous reviewers for their helpful comments. However, the responsibility for the content of this study lies with the authors alone.


  1. Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.CrossRefGoogle Scholar
  2. Assi, M., & Abdolhosseini, M. H. (2000). Grammatical tagging of a Persian corpus. International Journal of Corpus Linguistics, 5(1), 69–81.CrossRefGoogle Scholar
  3. Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.CrossRefGoogle Scholar
  4. Biber, D. (1992). Representativeness in corpus design. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline (pp. 174–197). New York, USA: Continuum.Google Scholar
  5. Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 221–241.Google Scholar
  6. Bijankhan, M. et al. (1994). Farsi spoken language database: FARSDAT. In Proceedings of the 5th international conference on speech sciences and technology (ICSST), Perth (Vol. 2, pp. 826–829).Google Scholar
  7. Bijankhan, M. et al. (2003). TFARSDAT: Telephone Farsi spoken language database. EuroSpeech, Geneva (3), pp. 1525–1528.Google Scholar
  8. Bijankhan, M. et al. (2004). The large Persian speech database. In Proceedings of the 1st workshop on Persian language and computer, the University of Tehran, Tehran, Iran (pp. 149–150).Google Scholar
  9. Buckwalter, T. (2005). Issues in Arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to arabic script-based languages in conjunction with COLING 2004, Switzerland.Google Scholar
  10. Cloeren, J. (1999). Tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging. Dordrecht, The Netherlands: Kluwer.Google Scholar
  11. Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.CrossRefGoogle Scholar
  12. Ghayoomi, M., & Momtazi, S. (2009). Challenges in developing Persian corpora from online resources. In Proceedingss of IEEE international conference on Asian language processing, Singapore.Google Scholar
  13. Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20(1), 17–33.Google Scholar
  14. Ghomeshi, J. (1996). Projection and inflection: A study of persian phrase structure. Ph.D. thesis, University of Toronto, Toronto, ON.Google Scholar
  15. Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of the 6th applied natural language processing conference, Washington (pp. 94–101).Google Scholar
  16. Hearst, M. A. (1991). Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th annual conference of the University of Waterloo, Center for the new OED and text research, Oxford.Google Scholar
  17. Hodge, C. T. (1957). Some aspects of Persian style. Language, 33(3) Part 1, 355–369.Google Scholar
  18. Hudson, R. (1994). About 37% word-tokens are nouns. Language, 70(2), 331–339.CrossRefGoogle Scholar
  19. Hussain, S., & Gul, S. (2005). Road map for localization. Lahore, Pakistan: Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences.Google Scholar
  20. Kawata, Y. (2001). Towards a reference tagset for Japanese. In Proceedings of the 6th natural language processing Pacific rim symposium post-conference workshop, Tokyo (pp. 55–62).Google Scholar
  21. Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morpho-syntactic tagging of Arabic. Lancaster University, Computing Department.
  22. Kralik, J., & Šulc, M. (2005). The representativeness of Czeck corpora. International Journal of Corpus Linguistics, 10(3), 357–366.CrossRefGoogle Scholar
  23. Kučera, K. (2002). The Czech national corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–247.CrossRefGoogle Scholar
  24. Leech, G. (2002). The importance of reference corpora. Donostia, 2002-10-24/25.
  25. Leech, G., & Wilson, A. (1999). Standards for tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 55–81). Dordrecht, The Netherlands: Kluwer.Google Scholar
  26. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: The MIT press.Google Scholar
  27. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank.
  28. Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320).Google Scholar
  29. Mosavi-Miangah, T. (2006). Automatic lemmatization of Persian words: Project report. Journal of Quantitative Linguistics, 13(1), 1–15.CrossRefGoogle Scholar
  30. Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone Speech Corpus. In Proceedings of the 2nd international conference on spoken language processing (ICSLP), Banff (pp. 895–898).Google Scholar
  31. Samvelian, P. (2007). A (phrasal) affix analysis of the Persian Ezafe. Journal of Linguistics, 43, 605–645.CrossRefGoogle Scholar
  32. Sheykhzadegan, J., & Bijankhan, M. (2006). The speech databases of Persian language. In Proceedings of the 2nd workshop on Persian language and computing, the University of Tehran, Tehran, Iran (pp. 247–261).Google Scholar
  33. Sinclair, J. (1987). Corpus creation. In G. Sampson and D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline, 2004 (pp. 78–84). New York: Continuum.Google Scholar
  34. Voutilainen, A. (1999). A short history of tagging. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 9–19). Dordrecht, The Netherlands: Kluwer.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Mahmood Bijankhan
    • 1
  • Javad Sheykhzadegan
    • 2
  • Mohammad Bahrani
    • 3
  • Masood Ghayoomi
    • 4
  1. 1.Department of LinguisticsThe University of TehranTehranIran
  2. 2.Research Center for Intelligent Signal ProcessingTehranIran
  3. 3.Computer Engineering DepartmentSharif University of TechnologyTehranIran
  4. 4.German Grammar GroupFreie Universität BerlinBerlinGermany

Personalised recommendations