Bilingual Text Classification

  • Jorge Civera
  • Elsa Cubel
  • Enrique Vidal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4477)


Bilingual documentation has become a common phenomenon in official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool. In this paper, different approaches will be proposed to tackle this bilingual classification task. On the one hand, three finite-state transducer algorithms from the grammatical inference framework will be presented. On the other hand, a naive combination of smoothed n-gram models will be introduced. To evaluate the performance of bilingual classifiers, two categorized bilingual corpora of different complexity were considered. Experiments in a limited-domain task show that all the models obtain similar results. However, results on a more open-domain task denote the supremacy of the naive approach.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Civera, J., Juan, A.: Mixtures of IBM Model 2. In: Proc. of EAMT, pp. 159–167 (2006)Google Scholar
  2. 2.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 on Learning for Text Categorization, pp. 41–48 (1998)Google Scholar
  3. 3.
    Juan, A., Vidal, E.: On the use of Bernoulli mixture models for text classification. Pattern Recognition 35(12), 2705–2710 (2002)zbMATHCrossRefGoogle Scholar
  4. 4.
    Civera, J., Cubel, E., Juan, A., Vidal, E.: Different approaches to bilingual text classification based on grammatical inference techniques. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 630–637. Springer, Heidelberg (2005)Google Scholar
  5. 5.
    Picó, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finite-state transducers. Machine Learning 44, 121–142 (2001)zbMATHCrossRefGoogle Scholar
  6. 6.
    Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 421–437. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  7. 7.
    Oncina, J., García, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on PAMI 15, 448–458 (1993)Google Scholar
  8. 8.
    Oncina, J., Varó, M.: Using domain information during the learning of a subsequential transducer. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 301–312. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  9. 9.
    Cubel, E.: Aprendizaje de transductores subsecuenciales estocásticos. Technical Report II-DSIC-B-23/01, Universidad Politécnica de Valencia, Spain (2002)Google Scholar
  10. 10.
    Brown, P., et al.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312 (1993)Google Scholar
  11. 11.
    Vilar, J.M.: Improve the learning of subsequential transducers by using alignments and dictionaries. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 298–311. Springer, Heidelberg (2000)Google Scholar
  12. 12.
    Och, F., Ney, H.: Improved statistical alignment models. In: ACL, pp. 440–447 (2000)Google Scholar
  13. 13.
    Casacuberta, F., et al.: Some approaches to statistical and finite-state speech-to-speech translation. Computer Speech and Language 18, 25–47 (2004)CrossRefGoogle Scholar
  14. 14.
    Civera, J., Vilar, J.M., Cubel, E., Lagarda, A.L., Barrachina, S., Casacuberta, F., Vidal, E., Picó, D., González, J.: A syntactic pattern recognition approach to computer assisted translation. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 207–215. Springer, Heidelberg (2004)Google Scholar
  15. 15.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Google Scholar
  16. 16.
    Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proc. of ACL’96, San Francisco, USA, pp. 310–318 (1996)Google Scholar
  17. 17.
    Amengual, J., et al.: The EuTrans-I speech translation system. Machine Translation 15, 75–103 (2000)zbMATHCrossRefGoogle Scholar
  18. 18.
    Simard, M.: The BAF: A Corpus of English-French Bitext. In: Proc. of LREC’98, Granada, Spain, vol. 1, pp. 489–494 (1998)Google Scholar
  19. 19.
    Llorens, D., Vilar, J.M., Casacuberta, F.: Finite state language models smoothed using n-grams. IJPRAI 16(3), 275–289 (2002)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Jorge Civera
    • 1
  • Elsa Cubel
    • 1
  • Enrique Vidal
    • 1
  1. 1.Instituto Tecnológico de Informática, Universidad Politécnica de Valencia 

Personalised recommendations