Advertisement

Extension of the Rocchio Classification Method to Multi-modal Categorization of Documents in Social Media

  • Amin Mantrach
  • Jean-Michel Renders
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7523)

Abstract

Most of the approaches in multi-view categorization use early fusion, late fusion or co-training strategies. We propose here a novel classification method that is able to efficiently capture the interactions across the different modes. This method is a multi-modal extension of the Rocchio classification algorithm – very popular in the Information Retrieval community. The extension consists of simultaneously maintaining different “centroid” representations for each class, in particular “cross-media” centroids that correspond to pairs of modes. To classify new data points, different scores are derived from similarity measures between the new data point and these different centroids; a global classification score is finally obtained by suitably aggregating the individual scores. This method outperforms the multi-view logistic regression approach (using either the early fusion or the late fusion strategies) on a social media corpus - namely the ENRON email collection - on two very different categorization tasks (folder classification and recipient prediction).

Keywords

Mean Average Precision Late Fusion Early Fusion ENRON Corpus Late Fusion Strategy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Abney, S.P.: Bootstrapping. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 360–367 (2002)Google Scholar
  2. 2.
    Zhu, X.: Semi-supervised learning literature survey. Technical report (2008)Google Scholar
  3. 3.
    Ruping, S., Scheffer, T.: Learning with multiple views proposal for an icml workshop. In: Proceedings of the ICML 2005 Workshop on Learning With Multiple Views, Bonn, Germany, August 11, pp. 1–7 (2005)Google Scholar
  4. 4.
    Manning, C., Raghavan, P., Schutze, H.: Introduction to information retrieval. Cambridge University Press (2008)Google Scholar
  5. 5.
    Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora. Technical report, University of Massachusetts (2004)Google Scholar
  6. 6.
    Tam, T., Ferreira, A., Lourenço, A.: Automatic Foldering of Email Messages:A Combination Approach. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 232–243. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Liu, T., Xu, J., Qin, T., Xiong, W., Li, H.: Letor: Benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, pp. 3–10 (2007)Google Scholar
  8. 8.
    Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398. ACM (2007)Google Scholar
  9. 9.
    Yue, Y., Finley, T., Radlinski, F., Joachims, T.: A support vector method for optimizing average precision. In: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 271–278 (2007)Google Scholar
  10. 10.
    Clinchant, S., Renders, J.-M., Csurka, G.: Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 569–576. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Klimt, B., Yang, Y.: The Enron Corpus: A New Dataset for Email Classification Research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Farquhar, J.D.R., Hardoon, D.R., Meng, H., Shawe-Taylor, J., Szedmák, S.: Two view learning: Svm-2k, theory and practice. In: Proceedings of Advances in Neural Information Processing Systems, pp. 355–362 (2005)Google Scholar
  13. 13.
    Slattery, S., Mitchell, T.: Discovering test set regularities in relational domains. In: Proceedings of the 7th International Conference on Machine Learning (ICML 2000), pp. 895–902 (2000)Google Scholar
  14. 14.
    Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorisation. In: Proceedings of the International Conference on Machine Learning (ICML 2001), pp. 250–257 (2001)Google Scholar
  15. 15.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 307–318 (1998)Google Scholar
  16. 16.
    Oh, H., Myaeng, S., Lee, M.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd International ACM Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 264–271. ACM (2000)Google Scholar
  17. 17.
    Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, pp. 635–644 (2011)Google Scholar
  18. 18.
    Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 99(10) 99(10), 6567 (2002)CrossRefGoogle Scholar
  19. 19.
    Scholkopf, B., Smola, A.: Learning with kernels. The MIT Press (2002)Google Scholar
  20. 20.
    Joachims, T.: A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In: Proceedings of International Conference on Machine Learning (ICML 1997), pp. 143–151 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Amin Mantrach
    • 1
  • Jean-Michel Renders
    • 1
  1. 1.Yahoo! Research BarcelonaXerox Research Centre EuropeFrance

Personalised recommendations