Experiments with Google News for Filtering Newswire Articles

  • Arturo Montejo-Ráez
  • José M. Perea-Ortega
  • Manuel Carlos Díaz-Galiano
  • L. Alfonso Ureña-López
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6241)

Abstract

This paper describes an approach based on the use of Google News as a source of information in order to generate a learning corpus for an information filtering task. The INFILE (INformation FILtering Evaluation) track of the CLEF (Cross-Lingual Evaluation Forum) 2009 campaign has been used as framework. The information filtering task can be seen as a document classification task, so a supervised learning scheme has been followed. Two learning corpora have been proved: one using the text of the topics as learning data to train a classifier, and another one where training data have been generated from Google News pages, using the keywords of topics as queries. Results show that the use of Google News for generating learning data does not improve the results obtained using only topic descriptions as learning corpora.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Besançon, R., Chaudiron, S., Mostefa, D., Hamon, O., Timimi, I., Choukri, K.: Overview of CLEF 2008 INFILE Pilot Track. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 939–946. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Besançon, R., Chaudiron, S., Mostefa, D., Timimi, I., Choukri, K.: The INFILE Project: a Crosslingual Filtering Systems Evaluation Campaign. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA) (2008)Google Scholar
  3. 3.
    Besançon, R., Chaudiron, S., Mostefa, D., Timimi, I., Choukri, K., Laïb, M.: Overview of CLEF 2009 INFILE track. In: Peters, C., Nunzio, G.D., Kurimo, M., Mandl, T., Mostefa, D., Peñas, A., Roda, G. (eds.) In Press. LNCS, Springer, Heidelberg (2009)Google Scholar
  4. 4.
    Couto, F.M., Martins, B., Silva, M.J.: Classifying biological articles using web resources. In: SAC 2004, Proceedings of the 2004 ACM symposium on Applied computing. pp. 111–115. ACM, New York (2004)Google Scholar
  5. 5.
    Díaz-Galiano, M.C., Perea-Ortega, J.M., Martín-Valdivia, M.T., Montejo-Ráez, A., Ureña-López, L.A.: SINAI at TRECVID 2007. In: Over, P. (ed.) Proceedings of the TRECVID 2007 Workshop (TRECVID 2007) (2007)Google Scholar
  6. 6.
    Gligorov, R., ten Kate, W., Aleksovski, Z., van Harmelen, F.: Using google distance to weight approximate ontology matches. In: WWW ’07: Proceedings of the 16th international conference on World Wide Web, pp. 767–776. ACM, New York (2007)CrossRefGoogle Scholar
  7. 7.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998), citeseer.ist.psu.edu/joachims97text.html CrossRefGoogle Scholar
  8. 8.
    Perea-Ortega, J.M., Montejo-Ráez, A., Díaz-Galiano, M.C., Martín-Valdivia, M.T., Ureña-López, L.A.: Using an Information Retrieval System for Video Classification. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 927–930. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Arturo Montejo-Ráez
    • 1
  • José M. Perea-Ortega
    • 1
  • Manuel Carlos Díaz-Galiano
    • 1
  • L. Alfonso Ureña-López
    • 1
  1. 1.SINAI Research Group, Computer Science DepartmentUniversity of JaénJaénSpain

Personalised recommendations