Boosting Bitext Compression

  • Joaquín Adiego
  • Miguel A. Martínez-Prieto
  • Javier E. Hoyos-Torío
  • Felipe Sánchez-Martínez
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 90)

Abstract

Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bitexts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each one from a different text) as representation symbolic units. The properties of these approaches are analysed from a statistical point of view and tested as a preprocessing step to general purpose compressors. The results obtained suggest interesting conclusions concerning the use of both words and biwords. When encoded models are used as compression boosters we achieve compression ratios improving state-of-the-art compressors up to 6.5 percentage points, being up to 40% faster.

Keywords

Compression Boosting Bitext Compression 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adiego, J., Brisaboa, N.R., Martínez-Prieto, M.A., Sánchez-Martínez, F.: A two-level structure for compressing aligned bitexts. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 114–121. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)Google Scholar
  3. 3.
    Brisaboa, N.R., Fariña, A., Navarro, G., Paramá, J.R.: Lightweight natural language text compression. Inf. Retr. 10(1), 1–33 (2007)CrossRefGoogle Scholar
  4. 4.
    Conley, E.S., Klein, S.T.: Using alignment for multilingual text compression. Int. J. Found Comput. Sci. 19(1), 89–101 (2008)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)MATHCrossRefGoogle Scholar
  6. 6.
    Fariña, A., Navarro, G., Paramá, J.: Word-based statistical compressors as natural language compression boosters. In: Proc. 18th Data Compression Conference (DCC), pp. 162–171 (2008)Google Scholar
  7. 7.
    Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19(1), 75–102 (1993)Google Scholar
  8. 8.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand, pp. 79–86 (2005), http://www.statmt.org/europarl/
  9. 9.
    Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Data Compression Conference, p. 459 (2009)Google Scholar
  10. 10.
    Melamed, I.D.: Emplirical methods for exploting parallel texts. MIT Press, Cambridge (2001)Google Scholar
  11. 11.
    Nevill-Manning, C.G., Bell, T.C.: Compression of parallel texts. Inf. Process. Manage. 28(6), 781–794 (1992)CrossRefGoogle Scholar
  12. 12.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefGoogle Scholar
  13. 13.
    Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. IEEE Computer 33(11), 37–44 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Joaquín Adiego
    • 1
  • Miguel A. Martínez-Prieto
    • 1
  • Javier E. Hoyos-Torío
    • 1
  • Felipe Sánchez-Martínez
    • 2
  1. 1.Dpto. de InformáticaUniversidad de ValladolidSpain
  2. 2.Dpto. de Llenguatges i Sistemes InformàticsUniversitat d’AlacantSpain

Personalised recommendations