Abstract
The beginning of the 1990s marked a radical turn in various NLP applications towards using large collections of texts.
Keywords
- Comparable Corpora
- Parallel Corpus
- Bilingual Lexicon Extraction
- Statistical Machine Translation (SMT)
- Identifying Word Translations
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
“That is not said right,” said the Caterpillar. “Not quite right, I’m afraid,” said Alice, timidly: “some of the words have got altered.” Lewis Carroll, Alice’s Adventures in Wonderland.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Abdul Rauf, S., Schwenk, H.: Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 46–54. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3109.pdf
Adafre, S., de Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), pp. 62–69. Trento (2006)
Andrade, D., Matsuzaki, T., Tsujii, J.: Learning the optimal use of dependency-parsing information for finding translations with comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 10–18. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1203
Babych, B., Hartley, A., Sharoff, S.: Translating from under-resourced languages: comparing direct transfer against pivot translation. In: Proceedings of the MT Summit XI, pp. 412–418. Copenhagen (2007), http://corpus.leeds.ac.uk/serge/publications/2007-mt-summit.pdf
Babych, B., Hartley, A., Sharoff, S., Mudraya, O.: Assisting translators in indirect lexical transfer. In: Proceedings of 45\(^{th}\) ACL, pp. 739–746. Prague (2007), http://corpus.leeds.ac.uk/serge/publications/2007-ACL.pdf
Barbosa, L., Bangalore, S., Rangarajan Sridhar, V.K.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai (November 2011)
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the Web. In: Proceedings of LREC2004. Lisbon (2004), http://sslmit.unibo.it/baroni/publications/lrec2004/bootcat_lrec_2004.pdf
Bel, N., Papavasiliou, V., Prokopidis, P., Toral, A., Arranz, V.: Mining and exploiting domain-specific corpora in the PANACEA platform. In: The 5th Workshop on Building and Using Comparable Corpora (2012)
Blancafort, H., Heid, U., Gornostay, T., Méchoulam, C., Daille, B., Sharoff, S.: User-centred views on terminology extraction tools: usage scenarios and integration into MT and CAT tools. In: Proceedings TRALOGY Conference "Translation Careers and Technologies: Convergence Points for the Future" (2011)
Brown, P., Pietra, S.D., Pietra, V.D., Mercer, R.: The mathematics of statistical machine translation: parameter estimation. Computat. Linguist. 19(2), 263–312 (1993)
Brown, P.F., Cocke, J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computat. Linguist. 16(2), 79–85 (1990)
Budge, E.A.T.W.: The Rosetta Stone. British Museum. London (1913)
Chen, J., Nie, J.: Parallel Web text mining for cross-language ir. In: Proceedings of RIAO, pp. 62–77 (2000)
Chiao, Y.C., Sta, J.D., Zweigenbaum, P.: A novel approach to improve word translations extraction from non-parallel, comparable corpora. In: Proceedings International Joint Conference on Natural Language Processing, Hainan (2004)
Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: COLING 2002 (2002)
Defrancq, B.: Establishing cross-linguistic semantic relatedness through monolingual corpora. Int. J. Corpus Linguist. 13(4), 465–490 (2008)
Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: COLING 2002 (2002)
Deléger, L., Cartoni, B., Zweigenbaum, P.: Paraphrase detection in monolingual specialized/lay comparable corpora. In: Sharoff, S., Rapp, R., Fung, P., Zweigenbaum, P. (eds.) Building and Using Comparable Corpora. Springer, Dordrecht (2012)
Deléger, L., Zweigenbaum, P.: Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Fung, P., Zweigenbaum, P., Rapp, R. (eds.) Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-Parallel Corpora, pp. 2–10. Association for Computational Linguistics, Singapore (August 2009), http://aclweb.org/anthology/W/W09/W09-3102
Diab, M., Finch, S.: A statistical wordlevel translation model for comparable corpora. In: Proceedings of the Conference on Content-Based Multimedia Information Access (RIAO) (2000)
Dorow, B., Laws, F., Michelbacher, L., Scheible, C., Utt, J.: A graph-theoretic algorithm for automatic extension of translation lexicons. In: EACL 2009 Workshop on Geometrical Models of Natural Language Semantics (2009)
Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., Chen, Y.: Using moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the Third Workshop on Statistical Machine Translation at ACL2008, pp. 179–182 (2008)
Eisele, A., Chen, Y.: MultiUN: A multilingual corpus from United Nations documents. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). Valletta, Malta (2010), http://www.euromatrixplus.net/multi-un/
Elhadad, N., Sutaria, K.: Mining a lexicon of technical terms and lay equivalents. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 49–56. Association for Computational Linguistics (2007)
Enright, J., Kondrak, G.: A fast method for parallel document identification. In: NAACL / Human Language Technologies, pp. 29–32. Rochester (2007)
Esplà-Gomis, M., Forcada, M.L.: Combining content-based and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)
Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of Third Annual Workshop on Very Large Corpora, pp. 173–183. Boston (1995)
Fung, P.: Extracting key terms from chinese and japanese texts. Int. J. Comput. Process. Orient. Lang. 12(1), 99–121 (1998)
Fung, P.: A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp. 1–17. Springer, Berlin (1998), http://www.springerlink.com/content/pqkpwpw32f5r74ev/
Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202 (1997)
Gahbiche-Braham, S., Bonneau-Maynard, H., Yvon, F.: Two ways to use a noisy parallel news corpus for improving statistical machine translation. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 44–51. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1207
Gale, W., Church, K.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)
Gamallo Otero, P., Garcia, M.: Extraction of bilingual cognates from wikipedia. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigo, F. (eds.) Computational Processing of the Portuguese Language. Lecture Notes in Artificial Intelligence, vol. 7243, pp. 63–72. Springer, Berlin (2012)
Gamallo Otero, P., Pichel Campos, J.R.: An approach to acquire word translations from nonparallel texts. In: EPIA, pp. 600–610 (2005)
Gamallo Otero, P., Pichel Campos, J.R.: Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. Comput. Linguist. Intell. Text Process. 6008, 473–483 (2010)
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, p. 129137. Morristown (2009)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Djean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, p. 526533. Barcelona (2004)
Germann, U.: Aligned Hansards of the 36th Parliament of Canada (2001), http://www.isi.edu/natural-language/download/hansard/
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., Klein, D.: Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp. 771–779. Columbus (2008)
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Hassan, S., Mihalcea, R.: Cross-lingual semantic relatedness using encyclopedic knowledge. In: EMNLP (2009)
Hazem, A., Morin, E.: Qalign: a new method for bilingual lexicon extraction from comparable corpora. Comput. Linguist. Intell. Text Process. 7182, 83–96 (2012)
Hazem, A., Morin, E.: Extraction de lexiques bilingues partir de corpus comparables par combinaison de reprsentations contextuelles. In: Proceedings of the TALN 2013. ATALA, Les Sables d’Olonne (2013), in Press
Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 61–68. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1209
Ji, H.: Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 34–37. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3107
Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 1–37 (2001)
Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the kelly project. In: Proceedings of workshop on Building and Using Comparable Corpora at LREC, Malta (2010)
Knight, K., Megyesi, B., Schaefer, C.: The [copiale] cipher. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 2–9. Portland (June 2011), http://www.aclweb.org/anthology/W11-1202
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit 2005 (2005), http://www.iccs.inf.ed.ac.uk/ pkoehn/publications/europarl-mtsummit05.pdf
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 9–16 (2002)
Lahaussois, A., Guillaume, S.: A viewing and processing tool for the analysis of a comparable corpus of kiranti mythology. In: Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pp. 33–41. ELDA, Istanbul (2012)
Langlais, P., Patry, A.: Translating unknown words by analogical learning. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 877–886 (2007)
Li, B.: Measuring and improving comparable corpus quality. Ph.D. thesis, Universit de Grenoble, Grenoble (June 2012)
Michelbacher, L., Laws, F., Dorow, B., Heid, U.,, Schütze, H.: Building a cross-lingual relatedness thesaurus using a graph similarity measure. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta (2010)
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45\(^{th}\) Annual Meeting of the Association for Computational Linguistics, pp. 664-671. Prague, Czech Republic (2007)
Morin, E., Prochasson, E.: Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In: BUCC2011 (2011)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of International Conference on Computational Linguistics and Association of Computational Linguistics, COLING-ACL 2006. Sydney (2006)
Patry, A., Langlais, P.: Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 87–95. Portland (June 2011), http://www.aclweb.org/anthology/W11-1212
Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)
Peters, C., Picchi, E.: Using linguistic tools and resources in cross-language retrieval. In: Hull, D., Oard, D. (eds.) Cross-Language Text and Speech Retrieval Papers from the 1997 AAAI Spring Symposium, pp. 179–188. AAAI Press, San Francisco (1997)
Picchi, E., Peters, C.: Exploiting lexical resources and linguistic tools in cross-language information retrieval: the EuroSearch approach. In: First International Conference on Language Resources & Evaluation, pp. 865–872. Granada (1998)
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of ACL-HLT, Portland (2011)
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd ACL, pp. 320–322. Cambridge (1995)
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th ACL, pp. 395–398. Maryland (1999)
Rapp, R., Sharoff, S., Babych, B.: Identifying word translations from comparable documents without a seed lexicon. In: Proceedings of the Eighth Language Resources and Evaluation Conference, LREC 2012. Istanbul (2012)
Rapp, R., Zock, M.: Automatic dictionary expansion using non-parallel corpora. In: Fink, A., Lausen, B., Ultsch, W.S.A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg (2010)
Rapp, R., Zock, M.: The noisier the better: identifying multilingual word translations using a single monolingual corpus. In: Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING. pp. 16–25. Beijing (2010)
Resnik, P., Smith, N.: The Web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003), http://www.umiacs.umd.edu/ resnik/strand/
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the Web. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232. Trento (2006)
Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.: Structured named entities in two distinct press corpora: contemporary broadcast news and old newspapers. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 40–48. Association for Computational Linguistics, Jeju, Republic of Korea (July 2012), http://www.aclweb.org/anthology/W12-3606
Schafer, C., Yarowsky, D.: Inducing translation lexicons via diverse similarity measures and bridge languages. In: Proceedings of CoNLL (2002)
Segouat, J., Braffort, A.: Toward categorization of sign language corpora. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 64–67. Association for Computational Linguistics, Singapore (August 2009), http://www.aclweb.org/anthology/W/W09/W09-3111
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Gedit, Bologna (2006), http://wackybook.sslmit.unibo.it
Sharoff, S.: In the garden and in the jungle: comparing genres in the BNC and Internet. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web: Computational Models and Empirical Studies, pp. 149–166. Springer, Berlin (2010)
Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. p. 98107. Uppsala (2010)
Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., Gornostay, T.: Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In: Proc. 3rd Workshop on Building and Using Comparable Corpora. Malta (2010).
Tanaka, K., Iwasaki, H.: Extraction of lexical translations from non-aligned corpora. In: Proceedings of the 16th conference on Computational linguistics (COLING96), vol. 2, pp. 580–585 (1996)
Tillmann, C.: A beam-search extraction algorithm for comparable data. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 225–228 (2009)
Tsvetkov, Y., Wintner, S.: Automatic acquisition of parallel corpora from websites with dynamic content. In: Proceedings of The Seventh International Conference on, Language Resources and Evaluation (LREC-2010) (2010)
Uszkoreit, J., Ponte, J.M., Popat, A.C., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1101–1109 (2010)
Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V.: Parallel corpora for medium density languages. In: N. Nicolov, K. Bontcheva, G.A., Mitkov, R. (eds.) Recent Advances in Natural Language Processing IV. Selected papers from RANLP-05, pp. 247–258. Benjamins (2007), http://www.kornai.com/Papers/ranlp05parallel.pdf
Wang, R., Callison-Burch, C.: Paraphrase fragment extraction from monolingual comparable corpora. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 52–60. Association for Computational Linguistics, Portland (June 2011), http://www.aclweb.org/anthology/W11-1208
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124. Boulder (2009)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from Web bilingual news collection. In: Proceeding of the 2002 IEEE International Conference on Data Mining (ICDM 2002) (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2013). Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)