Skip to main content

Creating a Persian-English Comparable Corpus

  • Conference paper
Multilingual and Multimodal Information Access Evaluation (CLEF 2010)

Abstract

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: A standard Persian text collection. Knowledge-Based Systems 22(5), 382–387 (2009)

    Article  Google Scholar 

  2. Bekavac, B., Osenova, P., Simov, K., Tadić, M.: Making monolingual corpora comparable: a case study of Bulgarian and Croatian. In: LREC, pp. 1187–1190 (2004)

    Google Scholar 

  3. Bijankhan, M.: Role of language corpora in writing grammar: introducing a computer software. Iranian Journal of Linguistics (38), 38–67 (2004)

    Google Scholar 

  4. Braschler, M., Schäuble, P.: Multilingual information retrieval based on document alignment techniques. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 183–197. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  5. Collier, N., Kumano, A., Hirakawa, H.: An application of local relevance feedback for building comparable corpora from news article matching. NII. J. (Natl. Inst. Inform.) 5, 9–23 (2003)

    Google Scholar 

  6. Davis, M.W.: On the effective use of large parallel corpora in cross-language text retrieval. Cross-language Information Retrieval, 11–22 (1998)

    Google Scholar 

  7. Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J., Tufis, D.: Multext-east: parallel and comparable corpora and lexicons for six central and eastern european languages. In: ACL, pp. 315–319 (1998)

    Google Scholar 

  8. Ghayoomi, M., Momtazi, S., Bijankhan, M.: A study of corpus development for Persian. International Journal of Asian Language Processing 20(1), 17–33 (2010)

    Google Scholar 

  9. Karimi, S.: Machine Transliteration of Proper Names between English and Persian. Ph.D. thesis, RMIT University, Melbourne, Victoria, Australia (2008)

    Google Scholar 

  10. Koskenniemi, K.: Two-level morphology: A general computational model for word-form recognition and production. Publications of the Department of General Linguistics, University of Helsinki 11 (1983)

    Google Scholar 

  11. Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp. 111–119 (2001)

    Google Scholar 

  12. McNamee, P., Mayfield, J.: Comparing cross-language query expansion techniques by degrading translation resources. In: SIGIR, pp. 159–166 (2002)

    Google Scholar 

  13. Miangah, T.M.: Constructing a Large-Scale English-Persian Parallel Corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)

    Article  Google Scholar 

  14. Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Article  Google Scholar 

  15. Oard, D., Diekema, A.: Cross-language information retrieval. Annual Review of Information Science and Technology 33, 223–256 (1998)

    Google Scholar 

  16. Pilevar, M.T., Feili, H.: PersianSMT: A first attempt to english-persian statistical machine translation. In: JADT (2010)

    Google Scholar 

  17. Pirkola, A., Leppanen, E., Järvelin, K.: The RATF formula (Kwok’s formula): exploiting average term frequency in cross-language retrieval. Information Research 7(2) (2002)

    Google Scholar 

  18. Resnik, P.: Mining the web for bilingual text. In: ACL, pp. 527–534 (1999)

    Google Scholar 

  19. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)

    Google Scholar 

  20. Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: WaCky! Working Papers on the Web as Corpus (2006)

    Google Scholar 

  21. Sheridan, P., Ballerini, J.P.: Experiments in multilingual information retrieval using the spider system. In: SIGIR, pp. 58–65 (1996)

    Google Scholar 

  22. Steinberger, R., Pouliquen, B., Ignat, C.: Navigating multilingual news collections using automatically extracted information. CIT 13(4), 257–264 (2005)

    Article  Google Scholar 

  23. Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M.: Creating and exploiting a comparable corpus in cross-language information retrieval. TOIS 25(4) (2007)

    Google Scholar 

  24. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11, 427–445 (2008)

    Article  Google Scholar 

  25. Tao, T., Zhai, C.X.: Mining comparable bilingual text corpora for cross-language information integration. In: SIGKDD, pp. 691–696 (2005)

    Google Scholar 

  26. Utsuro, T., Horiuchi, T., Chiba, Y., Hamamoto, T.: Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 165–176. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  27. Yang, C.C., Li, W., et al.: Building parallel corpora by automatic title alignment using length-based and text-based approaches. Information Processing & Management 40(6), 939–955 (2004)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baradaran Hashemi, H., Shakery, A., Faili, H. (2010). Creating a Persian-English Comparable Corpus. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15998-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15997-8

  • Online ISBN: 978-3-642-15998-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics