Skip to main content
Log in

Lessons from building a Persian written corpus: Peykare

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.

    Article  Google Scholar 

  • Assi, M., & Abdolhosseini, M. H. (2000). Grammatical tagging of a Persian corpus. International Journal of Corpus Linguistics, 5(1), 69–81.

    Article  Google Scholar 

  • Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.

    Article  Google Scholar 

  • Biber, D. (1992). Representativeness in corpus design. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline (pp. 174–197). New York, USA: Continuum.

    Google Scholar 

  • Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 221–241.

    Google Scholar 

  • Bijankhan, M. et al. (1994). Farsi spoken language database: FARSDAT. In Proceedings of the 5th international conference on speech sciences and technology (ICSST), Perth (Vol. 2, pp. 826–829).

  • Bijankhan, M. et al. (2003). TFARSDAT: Telephone Farsi spoken language database. EuroSpeech, Geneva (3), pp. 1525–1528.

  • Bijankhan, M. et al. (2004). The large Persian speech database. In Proceedings of the 1st workshop on Persian language and computer, the University of Tehran, Tehran, Iran (pp. 149–150).

  • Buckwalter, T. (2005). Issues in Arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to arabic script-based languages in conjunction with COLING 2004, Switzerland.

  • Cloeren, J. (1999). Tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging. Dordrecht, The Netherlands: Kluwer.

    Google Scholar 

  • Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.

    Article  Google Scholar 

  • Ghayoomi, M., & Momtazi, S. (2009). Challenges in developing Persian corpora from online resources. In Proceedingss of IEEE international conference on Asian language processing, Singapore.

  • Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20(1), 17–33.

    Google Scholar 

  • Ghomeshi, J. (1996). Projection and inflection: A study of persian phrase structure. Ph.D. thesis, University of Toronto, Toronto, ON.

  • Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of the 6th applied natural language processing conference, Washington (pp. 94–101).

  • Hearst, M. A. (1991). Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th annual conference of the University of Waterloo, Center for the new OED and text research, Oxford.

  • Hodge, C. T. (1957). Some aspects of Persian style. Language, 33(3) Part 1, 355–369.

    Google Scholar 

  • Hudson, R. (1994). About 37% word-tokens are nouns. Language, 70(2), 331–339.

    Article  Google Scholar 

  • Hussain, S., & Gul, S. (2005). Road map for localization. Lahore, Pakistan: Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences.

    Google Scholar 

  • Kawata, Y. (2001). Towards a reference tagset for Japanese. In Proceedings of the 6th natural language processing Pacific rim symposium post-conference workshop, Tokyo (pp. 55–62).

  • Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morpho-syntactic tagging of Arabic. Lancaster University, Computing Department. http://archimedes.fas.harvard.edu/mdh/arabic/CL2001.pdf.

  • Kralik, J., & Šulc, M. (2005). The representativeness of Czeck corpora. International Journal of Corpus Linguistics, 10(3), 357–366.

    Article  Google Scholar 

  • Kučera, K. (2002). The Czech national corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–247.

    Article  Google Scholar 

  • Leech, G. (2002). The importance of reference corpora. Donostia, 2002-10-24/25. www.corpus4u.org/upload/forum/2005060301260076.pdf.

  • Leech, G., & Wilson, A. (1999). Standards for tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 55–81). Dordrecht, The Netherlands: Kluwer.

    Google Scholar 

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: The MIT press.

    Google Scholar 

  • Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. http://citeseer.comp.nus.edu.sg/587575.html.

  • Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320).

  • Mosavi-Miangah, T. (2006). Automatic lemmatization of Persian words: Project report. Journal of Quantitative Linguistics, 13(1), 1–15.

    Article  Google Scholar 

  • Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone Speech Corpus. In Proceedings of the 2nd international conference on spoken language processing (ICSLP), Banff (pp. 895–898).

  • Samvelian, P. (2007). A (phrasal) affix analysis of the Persian Ezafe. Journal of Linguistics, 43, 605–645.

    Article  Google Scholar 

  • Sheykhzadegan, J., & Bijankhan, M. (2006). The speech databases of Persian language. In Proceedings of the 2nd workshop on Persian language and computing, the University of Tehran, Tehran, Iran (pp. 247–261).

  • Sinclair, J. (1987). Corpus creation. In G. Sampson and D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline, 2004 (pp. 78–84). New York: Continuum.

  • Voutilainen, A. (1999). A short history of tagging. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 9–19). Dordrecht, The Netherlands: Kluwer.

    Google Scholar 

Download references

Acknowledgments

This project was funded by the Higher Council for Informatics of Iran and the University of Tehran under the contract number 190/3554. Masood Ghayoomi was funded by the German research council DFG under the contract number MU 2822/3-1. Our special gratitude also goes to Dr. Ali Darzi at the University of Tehran who cooperated with us in the project and the anonymous reviewers for their helpful comments. However, the responsibility for the content of this study lies with the authors alone.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahmood Bijankhan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bijankhan, M., Sheykhzadegan, J., Bahrani, M. et al. Lessons from building a Persian written corpus: Peykare. Lang Resources & Evaluation 45, 143–164 (2011). https://doi.org/10.1007/s10579-010-9132-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-010-9132-x

Keywords

Navigation