Text Categorization Using an Ensemble Classifier Based on a Mean Co-association Matrix

  • Luís Moreira-Matias
  • João Mendes-Moreira
  • João Gama
  • Pavel Brazdil
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7376)


Text Categorization (TC) has attracted the attention of the research community in the last decade. Algorithms like Support Vector Machines, Naïve Bayes or k Nearest Neighbors have been used with good performance, confirmed by several comparative studies. Recently, several ensemble classifiers were also introduced in TC. However, many of those can only provide a category for a given new sample. Instead, in this paper, we propose a methodology – MECAC – to build an ensemble of classifiers that has two advantages to other ensemble methods: 1) it can be run using parallel computing, saving processing time and 2) it can extract important statistics from the obtained clusters. It uses the mean co-association matrix to solve binary TC problems. Our experiments revealed that our framework performed, on average, 2.04% better than the best individual classifier on the tested datasets. These results were statistically validated for a significance level of 0.05 using the Friedman Test.


Text Categorization Ensemble Classification Consensus Clustering Text Mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: ICML 1997, pp. 412–420 (1997)Google Scholar
  2. 2.
    Yang, Y., Liu, X.: A Re-Examination of Text Categorization Methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)Google Scholar
  3. 3.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1, 69–90 (1999)CrossRefGoogle Scholar
  4. 4.
    Colas, F., Brazdil, P.: Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks. In: Artificial Intelligence in Theory and Practice, pp. 169–178 (2006)Google Scholar
  5. 5.
    Cho, S., Lee, J.: Learning Neural Network Ensemble for Practical Text Classification. In: Liu, J., Cheung, Y.-m., Yin, H. (eds.) IDEAL 2003. LNCS, vol. 2690, pp. 1032–1036. Springer, Heidelberg (2003)Google Scholar
  6. 6.
    Bi, Y., Bell, D.A., Wang, H., Guo, G., Greer, K.: Combining Multiple Classifiers Using Dempster’s Rule of Combination for Text Categorization. In: Torra, V., Narukawa, Y. (eds.) MDAI 2004. LNCS (LNAI), vol. 3131, pp. 127–138. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Zhang, T., Oles, F.: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4, 5–31 (2001)zbMATHCrossRefGoogle Scholar
  8. 8.
    Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)zbMATHCrossRefGoogle Scholar
  9. 9.
    Bottcher, M., Hoppner, F., Spiliopoulou, M.: On Exploiting the Power of Time in Data Mining. SIGKDD Explor. Newsl. 10, 3–11 (2008)CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Khan, A., Baharudin, B., Lee, L., Khan, K.: A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology 1 (2010)Google Scholar
  12. 12.
    Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: 14th International Conference on Machine Learning, ICML 1997, pp. 143–151 (1997)Google Scholar
  13. 13.
    Nardiello, P., Sebastiani, F., Sperduti, A.: Discretizing Continuous Attributes in AdaBoost for Text Categorization. Advances in Information Retrieval (2003)Google Scholar
  14. 14.
    Dunn, J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4, 95–104Google Scholar
  15. 15.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Meila, M.: Comparing clusterings–an information based distance. Journal of Multivariate Analysis 98, 873–895 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    R Development Core Team: R: A Language and Environment for Statistical Computing., Vienna, Austria (2005) Google Scholar
  18. 18.
    Salton, G., Allan, J., Buckley, C., Singhal, A.: Automatic analysis, theme generation, and summarization of machine-readable texts. Readings in Information Retrieval, 478–483 (1997)Google Scholar
  19. 19.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 659–661. ACM, McLean (2002)Google Scholar
  20. 20.
    Venables, W., Ripley, B.: Modern Applied Statistics with S, New York, USA (2002)Google Scholar
  21. 21.
    Chang, C., Lin, C.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 1–27 (2011)Google Scholar
  22. 22.
    Hornik, K., Buchta, C., Zeileis, A.: Open-source machine learning: R meets Weka. Computational Statistics 24, 225–232 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 37–46 (1960)CrossRefGoogle Scholar
  25. 25.
    Iman, R., Davenport, J.: Approximations of the critical region of the Friedman statistic. Communications in Statistics 571–595 (1980)Google Scholar
  26. 26.
    Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, pp. 688–693 (2002)Google Scholar
  27. 27.
    Mendes-Moreira, J., Jorge, A.M., Soares, C., de Sousa, J.F.: Ensemble Learning: A Study on Different Variants of the Dynamic Selection Approach. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 191–205. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Luís Moreira-Matias
    • 1
    • 2
  • João Mendes-Moreira
    • 1
    • 2
  • João Gama
    • 2
    • 3
  • Pavel Brazdil
    • 2
    • 3
  1. 1.Departamento de Engenharia Informática, Faculdade de EngenhariaUniversidade do PortoPortoPortugal
  2. 2.LIAAD-INESC Porto L.A.PortoPortugal
  3. 3.Faculdade de EconomiaUniversidade do PortoPortoPortugal

Personalised recommendations