A Two-Level Structure for Compressing Aligned Bitexts

  • Joaquín Adiego
  • Nieves R. Brisaboa
  • Miguel A. Martínez-Prieto
  • Felipe Sánchez-Martínez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5721)

Abstract

A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC [2] compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Comm. of ACM 20(10), 762–772 (1977)CrossRefMATHGoogle Scholar
  2. 2.
    Brisaboa, N.R., Fariña, A., Navarro, G., Paramá, J.R.: Lightweight natural language text compression. Information Retrieval 10(1), 1–33 (2007)CrossRefGoogle Scholar
  3. 3.
    Cleary, J.G., Witten, I.H.: Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans. on Communications COM-32(4), 396–402 (1984)CrossRefGoogle Scholar
  4. 4.
    Conley, E.S., Klein, S.T.: Using alignment for multilingual text compression. Intl. J. of Foundations of Computer Science 19(1), 89–101 (2008)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Heaps, H.S.: Inf. Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)MATHGoogle Scholar
  6. 6.
    Horspool, R.N.: Practical fast searching in strings. Softw. Pract. & Exper. 10, 501–506 (1980)CrossRefGoogle Scholar
  7. 7.
    Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. of Computing 6(2), 323–350 (1977)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proc. of the 10th Machine Translation Summit, pp. 79–86 (2005), http://www.statmt.org/europarl/
  9. 9.
    Martínez-Prieto, M.A., Adiego, J., Sánchez-Martínez, F., de la Fuente, P., Carrasco, R.C.: On the use of word alignments to enhance bitext compression. In: Data Compres. Conf., p. 459 (2009)Google Scholar
  10. 10.
    Melamed, I.D.: Emplirical methods for exploting parallel texts. MIT Press, Cambridge (2001)Google Scholar
  11. 11.
    Mihalcea, R., Simard, M.: Parallel texts. Natural Language Eng. 11(3), 239–246 (2005)CrossRefGoogle Scholar
  12. 12.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv., 39(1) (2007)Google Scholar
  13. 13.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002)CrossRefMATHGoogle Scholar
  14. 14.
    Nevill-Manning, C.G., Bell, T.C.: Compression of parallel texts. Information Processing & Management 28(6), 781–794 (1992)CrossRefGoogle Scholar
  15. 15.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comp. Linguistics 29(1), 19–51 (2003)CrossRefMATHGoogle Scholar
  16. 16.
    Shkarin, D.: PPM: One Step to Practicality. In: Data Compres. Conf., pp. 202–211 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Joaquín Adiego
    • 1
  • Nieves R. Brisaboa
    • 2
  • Miguel A. Martínez-Prieto
    • 1
  • Felipe Sánchez-Martínez
    • 3
  1. 1.Dept. de InformáticaUniversidad de ValladolidSpain
  2. 2.Database LabUniversidade da CoruñaSpain
  3. 3.Dept. de Llenguatges i Sistemes InformàticsUniversitat d’AlacantSpain

Personalised recommendations