Using Information from the Target Language to Improve Crosslingual Text Classification

  • Gabriela Ramírez-de-la-Rosa
  • Manuel Montes-y-Gómez
  • Luis Villaseñor-Pineda
  • David Pinto-Avendaño
  • Thamar Solorio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6233)


Crosslingual text classification consists of exploiting labeled documents in a source language to classify documents in a different target language. In addition to the evident translation problem, this task also faces some difficulties caused by the cultural discrepancies manifested in both languages by means of different topic distributions. Such discrepancies make the classifier unreliable for the categorization task. In order to tackle this problem we propose to improve the classification performance by using information embedded in the own target dataset. The central idea of the proposed approach is that similar documents must belong to the same category. Therefore, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same target dataset. Experimental results using three different languages evidence the appropriateness of the proposed approach.


Support Vector Machine Target Language Text Categorization Source Language Similar Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)Google Scholar
  3. 3.
    de Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: WI 2005: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, Washington, DC, USA, pp. 529–535. IEEE Computer Society, Los Alamitos (2005)CrossRefGoogle Scholar
  5. 5.
    Ling, X., Xue, G.R., Dai, W., Jiang, Y., Yang, Q., Yu, Y.: Can Chinese web pages be classified with English data source? In: WWW 2008: Proceeding of the 17th International Conference on World Wide Web, pp. 969–978. ACM, New York (2008)CrossRefGoogle Scholar
  6. 6.
    Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, Association for Computational Linguistics, pp. 235–243 (2009)Google Scholar
  7. 7.
    Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Cardoso-Cachopo, A., Oliveira, A.L.: Semi-supervised single-label text categorization using centroid-based classifiers. In: SAC 2007: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 844–851. ACM, New York (2007)Google Scholar
  9. 9.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  10. 10.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Gabriela Ramírez-de-la-Rosa
    • 1
  • Manuel Montes-y-Gómez
    • 1
  • Luis Villaseñor-Pineda
    • 1
  • David Pinto-Avendaño
    • 2
  • Thamar Solorio
    • 3
  1. 1.Laboratory of Language TechnologiesNational Institute for Astrophysics, Optics and Electronics 
  2. 2.Faculty of Computer ScienceAutonomous University of Puebla 
  3. 3.Department of Computer and Information SciencesUniversity of Alabama at Birmingham 

Personalised recommendations