A Two-Level Structure for Compressing Aligned Bitexts
A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC  compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.
KeywordsCompression Ratio Machine Translation Hash Table Statistical Machine Translation Parallel Corpus
Unable to display preview. Download preview PDF.
- 8.Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proc. of the 10th Machine Translation Summit, pp. 79–86 (2005), http://www.statmt.org/europarl/
- 9.Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Data Compres. Conf., p. 459 (2009)Google Scholar
- 10.Melamed, I.D.: Emplirical methods for exploting parallel texts. MIT Press, Cambridge (2001)Google Scholar
- 12.Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv., 39(1) (2007)Google Scholar
- 16.Shkarin, D.: PPM: One Step to Practicality. In: Data Compres. Conf., pp. 202–211 (2002)Google Scholar