Abstract
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Published as part of TildeNER in the ‘Toolkit for multi-level alignment and information extraction from comparable corpora’, public deliverable of the project ACCURAT, 2011.
- 2.
A detailed list of available workflows is listed in the deliverable D2.6 of the ACCURAT project.
- 3.
StanfordNER English model from the University of Stanford: ‘conll.distsim.iob2.crf.ser.gz’, available for download from: http://nlp.stanford.edu/software/crf-faq.shtml (point 11).
- 4.
As reported by the University of Stanford in: http://nlp.stanford.edu/software/crf-faq.shtml (point11).
- 5.
If many L2 candidates were correct translations of the L1 lexeme, it would be more reasonable to use mean average precision (MAP).
References
Apidianaki, M., Ljubešić, N., & Fišer, D. (2013). Vector disambiguation for translation extraction from comparable corpora resources used comparable corpus. Informatica (Slovenia), 37(2), 193–201.
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.
Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of the 14th Conference on Computational Linguistics (Vol. 3, pp. 977–981). Association for Computational Linguistics.
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–393.
Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2). Association for Computational Linguistics.
Chinchor, N. (1997). MUC-7 named entity task definition. Proceedings of the 7th Conference on Message Understanding.
Cohen, J. (1968). Weighted Kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.
Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. Proceedings of the Fourth Conference on Applied Natural Language Processing (pp. 34–40). Association for Computational Linguistics.
Daille, B. (1994). Study and implementation of combined techniques for automatic extraction of terminology. Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language (Language, Speech, and Communication) (pp. 29–36). Association for Computational Linguistics, Las Cruces, NM.
Daille, B., & Morin, E. (2008). Effective compositional model for lexical alignment. Proceedings, IJCNLP 2008: Third International Joint Conference on Natural Language Processing (Vol. 1, pp. 95–102).
Damerau, F. J. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.
Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111–124.
Delač, D., Krleža, Z., Šnajder, J., Bašić, B. D., & Šarić, F. (2009). TermeX: A tool for collocation extraction. In Computational Linguistics and Intelligent Text Processing (pp. 149–157). Springer.
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363–370). Association for Computational Linguistics.
Fišer, D., & Ljubešic, N. (2011). Bilingual lexicon extraction from comparable corpora for closely related languages. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’11) (pp. 125–131).
Fišer, D., Vintar, Š., Ljubešić, N., & Pollak, S. (2011). Building and using comparable corpora for domain-specific bilingual lexicon extraction. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 19–26). Association for Computational Linguistics.
Fišer, D., Ljubešić, N., & Kubelka, O. (2012). Addressing polysemy in bilingual lexicon extraction from comparable corpora. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC’12 (pp. 3031–3035).
Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: The C-value/NC-Value Method. International Journal on Digital Libraries, 3(2), 115–130.
Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine translation and the information soup (pp. 1–17). Springer.
Fung, P., & McKeown, K. (1997). A technical word- and term-translation aid using noisy parallel corpora across language groups. Machine Translation, 12(1–2), 53–87.
Georgantopoulos, B., & Piperidis, S. (2000). A hybrid technique for automatic term extraction. Proceedings of the ACIDCA 2000 Conference.
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Heidelberg: Springer.
Grefenstette, G. (1999). The World Wide Web as a resource for example-based machine translation tasks. Proceedings of the ASLIB Conference on Translating and the Computer (Vol. 21).
Grigonyte, G., Rimkute, E., Utka, A., & Boizou, L. (2011). Experiments on lithuanian term extraction. Proceedings of the NODALIDA 2011 Conference (pp. 82–89).
Ion, R. (2007). Word sense disambiguation methods applied to English and Romanian. PhD Thesis, Romanian Academy, Bucharest.
Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(01), 9–27.
Kageura, K., & Umino, B. (1996). Methods of automatic term recognition: A review. Terminology, 3(2), 259–289.
Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97–133.
Kochanski, G. (2006). Lecture 4-good-turing probability estimation. Oxford.
Koehn, P., & Knight, K. (2002). Learning a translation lexicon from monolingual corpora. Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (Vol. 9, pp. 9–16). Association for Computational Linguistics.
Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics.
Kravalová, J., & Žabokrtský, Z. (2009). Czech named entity corpus and SVM-based recognizer. Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (pp. 194–201). Association for Computational Linguistics.
Krugļevskis, V. (2010). Semi-automatic term extraction from Latvian texts and related language technologies. Magyar Terminologia (Journal of Hungarian Terminology).
Kruglevskis, V., & Vancane, I. (2005). Term extraction from legal texts in Latvian. Proceedings of the Second Baltic Conference on Human Language Technologies (pp. 155–161).
Lee, L., Aw, A., Zhang, M., & Li, H. (2010). EM-based hybrid model for bilingual terminology extraction from comparable corpora. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 639–646). Association for Computational Linguistics.
Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 58–65). Association for Computational Linguistics.
Ljubešić, N., & Erjavec, T. (2011). hrWaC and slWac: Compiling web corpora for Croatian and Slovene. Text, Speech and Dialogue 2011 Conference Proceedings (pp. 395–402). Springer.
Ljubešić, N., & Fišer, D. (2011). Bootstrapping bilingual lexicons from comparable corpora for closely related languages. Text, Speech and Dialogue (pp. 91–98).
Ljubešić, N., Fišer, D., Vintar, Š., & Pollak, S. (2011). Bilingual lexicon extraction from comparable corpora: A comparative study. First International Workshop on Lexical Resources.
Ljubešić, N., Vintar, Š., & Fišer, D. (2012). Multi-word term extraction from comparable corpora by combining contextual and constituent clues. Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012) (pp. 143–147). ELRA, Istanbul.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Marsi, E., & Krahmer, E. (2010). Automatic analysis of semantic similarity in comparable text through syntactic tree matching. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 752–760). Association for Computational Linguistics.
Mima, H., & Ananiadou, S. (2000). An application and evaluation of the C/NC-value approach for the automatic term recognition of multi-word units in Japanese. Terminology, 6(2), 175–194.
Morin, E., & Prochasson, E. (2011). Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 27–34).
Morin, E., Daille, B., Takeuchi, K., Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 664–671). Association for Computational Linguistics.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.
Och, F. J., & Ney, H. (2000). Improved statistical alignment models. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 440–447). Association for Computational Linguistics.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Otero, P. G. (2007). Learning bilingual lexicons from comparable English and Spanish corpora. Proceedings of MT Summit XI (pp. 191–198).
Pantel, P., & Lin, D. (2001). A statistical corpus-based term extractor. Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence – Advances in Artificial Intelligence (AI 2001) (pp. 36–46). Ottawa, Canada. Berlin: Springer.
Paukkeri, M.-S., Nieminen, I. T., Pöllä, M., & Honkela, T. (2008). A language-independent approach to keyphrase extraction and evaluation. Proceedings of COLING 2008 (pp. 83–86).
Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 1258–1265). European Language Resources Association (ELRA), Istanbul, Turkey.
Pinnis, M., & Goba, K. (2011). Maximum entropy model for disambiguation of rich morphological tags. In C. Mahlow & M. Piotrowski (Eds.), Proceedings of the 2nd International Workshop on Systems and Frameworks for Computational Morphology (pp. 14–22). Zurich: Springer.
Pinnis, M., & Skadiņš, R. (2012). MT adaptation for under-resourced domains – What works and what not. Human Language Technologies – The Baltic Perspective – Proceedings of the Fifth International Conference Baltic HLT 2012 (Vol. 247, pp. 176–184). Tartu, Estonia: IOS Press.
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012) (pp. 193–208), Madrid.
Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322). Computation and Language, Association for Computational Linguistics.
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics, Stroudsburg, PA.
Saralegi, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of Building and Using Comparable Corpora Workshop (pp. 27–32).
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing (Vol. 12, pp. 44–49).
Schütze, H. (1998). The hypertext concordance: A better back-of-the-book index. Proceedings of First Workshop on Computational Terminology.
Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA.
Shezaf, D., & Rappoport, A. (2010). Bilingual lexicon generation using non-aligned signatures. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 98–107). Association for Computational Linguistics.
Skadiņa, I. (2009). Jaunas iespējas attēlu meklēšanā: ģeotelpiskajā informācijā un valodu tehnoloģijās balstīta attēlu meklēšanas platforma TRIPOD. Latvijas Nacionālās bibliotēkas zinātniskie raksti (pp. 182–192). National Library of Latvia.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Ştefănescu, D. (2010). Intelligent information mining from multilingual corpora. PhD Thesis, Romanian Academy, Bucharest.
Ştefănescu, D. (2012). Mining for term translations in comparable corpora. The 5th Workshop on Building and Using Comparable Corpora (pp. 98–103). Turkey, Istanbul.
Ştefănescu, D., Tufiş, D., & Irimia, E. (2006). Automatic identification and extraction of collocations from texts. Proceedings of the 2nd Romanian Workshop for Linguistic Tools and Resources (Vol. 3). Bucharest, Romania.
Ştefănescu, D., Ion, R., & Boroş, T. (2011). TiradeAI: An ensemble of spellcheckers. Proceedings of the Spelling Alteration for Web Search Workshop (pp. 20–23).
Steinberger, R., Pouliquen, B., & Hagman, J. (2002). Cross-lingual document similarity calculation using the multilingual thesaurus EuroVoc. Computational Linguistics and Intelligent Text Processing (pp. 115–424).
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufi, D., Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006) (Vol. 4, pp. 2142–2147).
Tadić, M., & Šojat, K. (2003). Finding multiword term candidates in Croatian. In Proceedings of Information Extraction for Slavic Languages 2003 Workshop (pp. 102–107).
Tiedemann, J. (2005). Optimization of word alignment clues. Natural Language Engineering, 11(03), 279–293.
Tjong, E. F., & Sang, K. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. Proceedings of the 6th Conference on Natural Language Learning (Vol. 20, pp. 142–147). Association for Computational Linguistics, Taipei, Taiwan.
Todirascu, A., Gledhill, C., & Stefanescu, D. (2009). Extracting collocations in contexts. Human Language Technology. Challenges of the Information Society (pp. 336–349). Springer.
Tufi, D., & Irimia, E. (2006). RoCo-news: A hand validated journalistic corpus of Romanian. Proceedings of the 5th LREC Conference (pp. 869–872). Genoa, Italy.
Tufi, D., Ion, R., Ceauşu, A., & Ştefănescu, D. (2008). RACAI’s linguistic web services. Proceedings of the 6th Language Resources and Evaluation Conference-LREC (pp. 327–333).
Vintar, Š. (2010). Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology, 16(2), 141–158.
Voorhees, E. M. (2001). Overview of the TREC-9 question answering track. Proceedings of the Ninth Text REtrieval Conference (TREC-9).
Weller, M., Gojun, A., Heid, U., Daille, B., & Harastani, R. (2011). Simple methods for dealing with term variation and term alignment. Proceedings of the 9th International Conference on Terminology and Artificial Intelligence (TIA 2011) (pp. 86–92).
Xiao, R., & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103–129.
Yu, K., & Tsujii, J. (2009). Bilingual dictionary extraction from Wikipedia. Proceedings of Machine Translation Summit XII (pp. 379–386).
Zeller, I. (2005). Automatinis terminu atpazinimas ir apdorojimas. VDU Lietuviu Kalbos Institutas.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Additional information
Chapter editors: Mārcis Pinnis and Nikola Ljubešić
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Pinnis, M. et al. (2019). Extracting Data from Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-99004-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)