Creating a Persian-English Comparable Corpus

Baradaran Hashemi, Homa; Shakery, Azadeh; Faili, Heshaam

doi:10.1007/978-3-642-15998-5_5

Homa Baradaran Hashemi²¹,
Azadeh Shakery²¹ &
Heshaam Faili²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6360))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

687 Accesses
6 Citations

Abstract

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

New Areas of Application of Comparable Corpora

Collecting Comparable Corpora

References

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: A standard Persian text collection. Knowledge-Based Systems 22(5), 382–387 (2009)
Article Google Scholar
Bekavac, B., Osenova, P., Simov, K., Tadić, M.: Making monolingual corpora comparable: a case study of Bulgarian and Croatian. In: LREC, pp. 1187–1190 (2004)
Google Scholar
Bijankhan, M.: Role of language corpora in writing grammar: introducing a computer software. Iranian Journal of Linguistics (38), 38–67 (2004)
Google Scholar
Braschler, M., Schäuble, P.: Multilingual information retrieval based on document alignment techniques. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 183–197. Springer, Heidelberg (1998)
Chapter Google Scholar
Collier, N., Kumano, A., Hirakawa, H.: An application of local relevance feedback for building comparable corpora from news article matching. NII. J. (Natl. Inst. Inform.) 5, 9–23 (2003)
Google Scholar
Davis, M.W.: On the effective use of large parallel corpora in cross-language text retrieval. Cross-language Information Retrieval, 11–22 (1998)
Google Scholar
Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J., Tufis, D.: Multext-east: parallel and comparable corpora and lexicons for six central and eastern european languages. In: ACL, pp. 315–319 (1998)
Google Scholar
Ghayoomi, M., Momtazi, S., Bijankhan, M.: A study of corpus development for Persian. International Journal of Asian Language Processing 20(1), 17–33 (2010)
Google Scholar
Karimi, S.: Machine Transliteration of Proper Names between English and Persian. Ph.D. thesis, RMIT University, Melbourne, Victoria, Australia (2008)
Google Scholar
Koskenniemi, K.: Two-level morphology: A general computational model for word-form recognition and production. Publications of the Department of General Linguistics, University of Helsinki 11 (1983)
Google Scholar
Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp. 111–119 (2001)
Google Scholar
McNamee, P., Mayfield, J.: Comparing cross-language query expansion techniques by degrading translation resources. In: SIGIR, pp. 159–166 (2002)
Google Scholar
Miangah, T.M.: Constructing a Large-Scale English-Persian Parallel Corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)
Article Google Scholar
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Oard, D., Diekema, A.: Cross-language information retrieval. Annual Review of Information Science and Technology 33, 223–256 (1998)
Google Scholar
Pilevar, M.T., Feili, H.: PersianSMT: A first attempt to english-persian statistical machine translation. In: JADT (2010)
Google Scholar
Pirkola, A., Leppanen, E., Järvelin, K.: The RATF formula (Kwok’s formula): exploiting average term frequency in cross-language retrieval. Information Research 7(2) (2002)
Google Scholar
Resnik, P.: Mining the web for bilingual text. In: ACL, pp. 527–534 (1999)
Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)
Google Scholar
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: WaCky! Working Papers on the Web as Corpus (2006)
Google Scholar
Sheridan, P., Ballerini, J.P.: Experiments in multilingual information retrieval using the spider system. In: SIGIR, pp. 58–65 (1996)
Google Scholar
Steinberger, R., Pouliquen, B., Ignat, C.: Navigating multilingual news collections using automatically extracted information. CIT 13(4), 257–264 (2005)
Article Google Scholar
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M.: Creating and exploiting a comparable corpus in cross-language information retrieval. TOIS 25(4) (2007)
Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11, 427–445 (2008)
Article Google Scholar
Tao, T., Zhai, C.X.: Mining comparable bilingual text corpora for cross-language information integration. In: SIGKDD, pp. 691–696 (2005)
Google Scholar
Utsuro, T., Horiuchi, T., Chiba, Y., Hamamoto, T.: Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 165–176. Springer, Heidelberg (2002)
Chapter Google Scholar
Yang, C.C., Li, W., et al.: Building parallel corpora by automatic title alignment using length-based and text-based approaches. Information Processing & Management 40(6), 939–955 (2004)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering College of Engineering, University of Tehran, Iran
Homa Baradaran Hashemi, Azadeh Shakery & Heshaam Faili

Authors

Homa Baradaran Hashemi
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Shakery
View author publications
You can also search for this author in PubMed Google Scholar
Heshaam Faili
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131, Padova, Italy
Maristella Agosti
University of Padua, Padua, Italy
Nicola Ferro
ISTI-CNR, Area Ricerca CNR, Via Moruzzi, 1, 56124, Pisa, Italy
Carol Peters
ISLA, University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke
Dublin City University, Dublin, Ireland
Alan Smeaton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baradaran Hashemi, H., Shakery, A., Faili, H. (2010). Creating a Persian-English Comparable Corpus. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-15998-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15997-8
Online ISBN: 978-3-642-15998-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Creating a Persian-English Comparable Corpus

Abstract

Access this chapter

Preview

Similar content being viewed by others

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

New Areas of Application of Comparable Corpora

Collecting Comparable Corpora

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Creating a Persian-English Comparable Corpus

Abstract

Access this chapter

Preview

Similar content being viewed by others

Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

New Areas of Application of Comparable Corpora

Collecting Comparable Corpora

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation