Advertisement

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

  • Serge SharoffEmail author
  • Reinhard Rapp
  • Pierre Zweigenbaum
Chapter

Abstract

The beginning of the 1990s marked a radical turn in various NLP applications towards using large collections of texts.

References

  1. 1.
    Abdul Rauf, S., Schwenk, H.: Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 46–54. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3109.pdf
  2. 2.
    Adafre, S., de Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), pp. 62–69. Trento (2006)Google Scholar
  3. 3.
    Andrade, D., Matsuzaki, T., Tsujii, J.: Learning the optimal use of dependency-parsing information for finding translations with comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 10–18. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1203
  4. 4.
    Babych, B., Hartley, A., Sharoff, S.: Translating from under-resourced languages: comparing direct transfer against pivot translation. In: Proceedings of the MT Summit XI, pp. 412–418. Copenhagen (2007), http://corpus.leeds.ac.uk/serge/publications/2007-mt-summit.pdf
  5. 5.
    Babych, B., Hartley, A., Sharoff, S., Mudraya, O.: Assisting translators in indirect lexical transfer. In: Proceedings of 45\(^{th}\) ACL, pp. 739–746. Prague (2007), http://corpus.leeds.ac.uk/serge/publications/2007-ACL.pdf
  6. 6.
    Barbosa, L., Bangalore, S., Rangarajan Sridhar, V.K.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai (November 2011)Google Scholar
  7. 7.
    Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the Web. In: Proceedings of LREC2004. Lisbon (2004), http://sslmit.unibo.it/baroni/publications/lrec2004/bootcat_lrec_2004.pdf
  8. 8.
    Bel, N., Papavasiliou, V., Prokopidis, P., Toral, A., Arranz, V.: Mining and exploiting domain-specific corpora in the PANACEA platform. In: The 5th Workshop on Building and Using Comparable Corpora (2012)Google Scholar
  9. 9.
    Blancafort, H., Heid, U., Gornostay, T., Méchoulam, C., Daille, B., Sharoff, S.: User-centred views on terminology extraction tools: usage scenarios and integration into MT and CAT tools. In: Proceedings TRALOGY Conference "Translation Careers and Technologies: Convergence Points for the Future" (2011)Google Scholar
  10. 10.
    Brown, P., Pietra, S.D., Pietra, V.D., Mercer, R.: The mathematics of statistical machine translation: parameter estimation. Computat. Linguist. 19(2), 263–312 (1993)Google Scholar
  11. 11.
    Brown, P.F., Cocke, J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computat. Linguist. 16(2), 79–85 (1990)Google Scholar
  12. 12.
    Budge, E.A.T.W.: The Rosetta Stone. British Museum. London (1913)Google Scholar
  13. 13.
    Chen, J., Nie, J.: Parallel Web text mining for cross-language ir. In: Proceedings of RIAO, pp. 62–77 (2000)Google Scholar
  14. 14.
    Chiao, Y.C., Sta, J.D., Zweigenbaum, P.: A novel approach to improve word translations extraction from non-parallel, comparable corpora. In: Proceedings International Joint Conference on Natural Language Processing, Hainan (2004)Google Scholar
  15. 15.
    Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: COLING 2002 (2002)Google Scholar
  16. 16.
    Defrancq, B.: Establishing cross-linguistic semantic relatedness through monolingual corpora. Int. J. Corpus Linguist. 13(4), 465–490 (2008)CrossRefGoogle Scholar
  17. 17.
    Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING 2002 (2002)Google Scholar
  18. 18.
    Deléger, L., Cartoni, B., Zweigenbaum, P.: Paraphrase detection in monolingual specialized/lay comparable corpora. In: Sharoff, S., Rapp, R., Fung, P., Zweigenbaum, P. (eds.) Building and Using Comparable Corpora. Springer, Dordrecht (2012)Google Scholar
  19. 19.
    Deléger, L., Zweigenbaum, P.: Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Fung, P., Zweigenbaum, P., Rapp, R. (eds.) Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-Parallel Corpora, pp. 2–10. Association for Computational Linguistics, Singapore (August 2009), http://aclweb.org/anthology/W/W09/W09-3102
  20. 20.
    Diab, M., Finch, S.: A statistical wordlevel translation model for comparable corpora. In: Proceedings of the Conference on Content-Based Multimedia Information Access (RIAO) (2000)Google Scholar
  21. 21.
    Dorow, B., Laws, F., Michelbacher, L., Scheible, C., Utt, J.: A graph-theoretic algorithm for automatic extension of translation lexicons. In: EACL 2009 Workshop on Geometrical Models of Natural Language Semantics (2009)Google Scholar
  22. 22.
    Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., Chen, Y.: Using moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the Third Workshop on Statistical Machine Translation at ACL2008, pp. 179–182 (2008)Google Scholar
  23. 23.
    Eisele, A., Chen, Y.: MultiUN: A multilingual corpus from United Nations documents. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). Valletta, Malta (2010), http://www.euromatrixplus.net/multi-un/
  24. 24.
    Elhadad, N., Sutaria, K.: Mining a lexicon of technical terms and lay equivalents. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 49–56. Association for Computational Linguistics (2007)Google Scholar
  25. 25.
    Enright, J., Kondrak, G.: A fast method for parallel document identification. In: NAACL / Human Language Technologies, pp. 29–32. Rochester (2007)Google Scholar
  26. 26.
    Esplà-Gomis, M., Forcada, M.L.: Combining content-based and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)Google Scholar
  27. 27.
    Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of Third Annual Workshop on Very Large Corpora, pp. 173–183. Boston (1995)Google Scholar
  28. 28.
    Fung, P.: Extracting key terms from chinese and japanese texts. Int. J. Comput. Process. Orient. Lang. 12(1), 99–121 (1998)Google Scholar
  29. 29.
    Fung, P.: A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp. 1–17. Springer, Berlin (1998), http://www.springerlink.com/content/pqkpwpw32f5r74ev/
  30. 30.
    Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202 (1997)Google Scholar
  31. 31.
    Gahbiche-Braham, S., Bonneau-Maynard, H., Yvon, F.: Two ways to use a noisy parallel news corpus for improving statistical machine translation. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 44–51. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1207
  32. 32.
    Gale, W., Church, K.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)Google Scholar
  33. 33.
    Gamallo Otero, P., Garcia, M.: Extraction of bilingual cognates from wikipedia. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigo, F. (eds.) Computational Processing of the Portuguese Language. Lecture Notes in Artificial Intelligence, vol. 7243, pp. 63–72. Springer, Berlin (2012)Google Scholar
  34. 34.
    Gamallo Otero, P., Pichel Campos, J.R.: An approach to acquire word translations from nonparallel texts. In: EPIA, pp. 600–610 (2005)Google Scholar
  35. 35.
    Gamallo Otero, P., Pichel Campos, J.R.: Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. Comput. Linguist. Intell. Text Process. 6008, 473–483 (2010)Google Scholar
  36. 36.
    Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, p. 129137. Morristown (2009)Google Scholar
  37. 37.
    Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, p. 526533. Barcelona (2004)Google Scholar
  38. 38.
    Germann, U.: Aligned Hansards of the 36th Parliament of Canada (2001), http://www.isi.edu/natural-language/download/hansard/
  39. 39.
    Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp. 771–779. Columbus (2008)Google Scholar
  40. 40.
    Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)Google Scholar
  41. 41.
    Hassan, S., Mihalcea, R.: Cross-lingual semantic relatedness using encyclopedic knowledge. In: EMNLP (2009)Google Scholar
  42. 42.
    Hazem, A., Morin, E.: Qalign: a new method for bilingual lexicon extraction from comparable corpora. Comput. Linguist. Intell. Text Process. 7182, 83–96 (2012)Google Scholar
  43. 43.
    Hazem, A., Morin, E.: Extraction de lexiques bilingues partir de corpus comparables par combinaison de reprsentations contextuelles. In: Proceedings of the TALN 2013. ATALA, Les Sables d’Olonne (2013), in PressGoogle Scholar
  44. 44.
    Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 61–68. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1209
  45. 45.
    Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 34–37. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3107
  46. 46.
    Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 1–37 (2001)CrossRefGoogle Scholar
  47. 47.
    Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the kelly project. In: Proceedings of workshop on Building and Using Comparable Corpora at LREC, Malta (2010)Google Scholar
  48. 48.
    Knight, K., Megyesi, B., Schaefer, C.: The [copiale] cipher. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 2–9. Portland (June 2011), http://www.aclweb.org/anthology/W11-1202
  49. 49.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit 2005 (2005), http://www.iccs.inf.ed.ac.uk/ pkoehn/publications/europarl-mtsummit05.pdf
  50. 50.
    Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 9–16 (2002)Google Scholar
  51. 51.
    Lahaussois, A., Guillaume, S.: A viewing and processing tool for the analysis of a comparable corpus of kiranti mythology. In: Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pp. 33–41. ELDA, Istanbul (2012)Google Scholar
  52. 52.
    Langlais, P., Patry, A.: Translating unknown words by analogical learning. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 877–886 (2007)Google Scholar
  53. 53.
    Li, B.: Measuring and improving comparable corpus quality. Ph.D. thesis, Universit de Grenoble, Grenoble (June 2012)Google Scholar
  54. 54.
    Michelbacher, L., Laws, F., Dorow, B., Heid, U.,, Schütze, H.: Building a cross-lingual relatedness thesaurus using a graph similarity measure. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta (2010)Google Scholar
  55. 55.
    Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45\(^{th}\) Annual Meeting of the Association for Computational Linguistics, pp. 664-671. Prague, Czech Republic (2007)Google Scholar
  56. 56.
    Morin, E., Prochasson, E.: Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In: BUCC2011 (2011)Google Scholar
  57. 57.
    Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)CrossRefGoogle Scholar
  58. 58.
    Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of International Conference on Computational Linguistics and Association of Computational Linguistics, COLING-ACL 2006. Sydney (2006)Google Scholar
  59. 59.
    Patry, A., Langlais, P.: Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 87–95. Portland (June 2011), http://www.aclweb.org/anthology/W11-1212
  60. 60.
    Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)CrossRefGoogle Scholar
  61. 61.
    Peters, C., Picchi, E.: Using linguistic tools and resources in cross-language retrieval. In: Hull, D., Oard, D. (eds.) Cross-Language Text and Speech Retrieval Papers from the 1997 AAAI Spring Symposium, pp. 179–188. AAAI Press, San Francisco (1997)Google Scholar
  62. 62.
    Picchi, E., Peters, C.: Exploiting lexical resources and linguistic tools in cross-language information retrieval: the EuroSearch approach. In: First International Conference on Language Resources & Evaluation, pp. 865–872. Granada (1998)Google Scholar
  63. 63.
    Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of ACL-HLT, Portland (2011)Google Scholar
  64. 64.
    Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd ACL, pp. 320–322. Cambridge (1995)Google Scholar
  65. 65.
    Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th ACL, pp. 395–398. Maryland (1999)Google Scholar
  66. 66.
    Rapp, R., Sharoff, S., Babych, B.: Identifying word translations from comparable documents without a seed lexicon. In: Proceedings of the Eighth Language Resources and Evaluation Conference, LREC 2012. Istanbul (2012)Google Scholar
  67. 67.
    Rapp, R., Zock, M.: Automatic dictionary expansion using non-parallel corpora. In: Fink, A., Lausen, B., Ultsch, W.S.A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg (2010)Google Scholar
  68. 68.
    Rapp, R., Zock, M.: The noisier the better: identifying multilingual word translations using a single monolingual corpus. In: Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING. pp. 16–25. Beijing (2010)Google Scholar
  69. 69.
    Resnik, P., Smith, N.: The Web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003), http://www.umiacs.umd.edu/ resnik/strand/Google Scholar
  70. 70.
    Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the Web. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232. Trento (2006)Google Scholar
  71. 71.
    Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.: Structured named entities in two distinct press corpora: contemporary broadcast news and old newspapers. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 40–48. Association for Computational Linguistics, Jeju, Republic of Korea (July 2012), http://www.aclweb.org/anthology/W12-3606
  72. 72.
    Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: Proceedings of CoNLL (2002)Google Scholar
  73. 73.
    Segouat, J., Braffort, A.: Toward categorization of sign language corpora. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 64–67. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3111
  74. 74.
    Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Gedit, Bologna (2006), http://wackybook.sslmit.unibo.it
  75. 75.
    Sharoff, S.: In the garden and in the jungle: comparing genres in the BNC and Internet. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web: Computational Models and Empirical Studies, pp. 149–166. Springer, Berlin (2010)CrossRefGoogle Scholar
  76. 76.
    Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. p. 98107. Uppsala (2010)Google Scholar
  77. 77.
    Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., Gornostay, T.: Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In: Proc. 3rd Workshop on Building and Using Comparable Corpora. Malta (2010).Google Scholar
  78. 78.
    Tanaka, K., Iwasaki, H.: Extraction of lexical translations from non-aligned corpora. In: Proceedings of the 16th conference on Computational linguistics (COLING96), vol. 2, pp. 580–585 (1996)Google Scholar
  79. 79.
    Tillmann, C.: A beam-search extraction algorithm for comparable data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 225–228 (2009)Google Scholar
  80. 80.
    Tsvetkov, Y., Wintner, S.: Automatic acquisition of parallel corpora from websites with dynamic content. In: Proceedings of The Seventh International Conference on, Language Resources and Evaluation (LREC-2010) (2010)Google Scholar
  81. 81.
    Uszkoreit, J., Ponte, J.M., Popat, A.C., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1101–1109 (2010)Google Scholar
  82. 82.
    Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V.: Parallel corpora for medium density languages. In: N. Nicolov, K. Bontcheva, G.A., Mitkov, R. (eds.) Recent Advances in Natural Language Processing IV. Selected papers from RANLP-05, pp. 247–258. Benjamins (2007), http://www.kornai.com/Papers/ranlp05parallel.pdf
  83. 83.
    Wang, R., Callison-Burch, C.: Paraphrase fragment extraction from monolingual comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 52–60. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1208
  84. 84.
    Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124. Boulder (2009)Google Scholar
  85. 85.
    Zhao, B., Vogel, S.: Adaptive parallel sentences mining from Web bilingual news collection. In: Proceeding of the 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Serge Sharoff
    • 1
    Email author
  • Reinhard Rapp
    • 2
  • Pierre Zweigenbaum
    • 3
  1. 1.University of LeedsWest YorkshireUnited Kingdom
  2. 2.University of MainzMainzGermany
  3. 3.LIMSI, CNRS and ERTIM, INALCOParisFrance

Personalised recommendations