Clustering Ensemble for Spam Filtering

  • Santiago Porras
  • Bruno Baruque
  • Belén Vaquerizo
  • Emilio Corchado
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6679)


One of the main problems that modern e-mail systems face is the management of the high degree of spam or junk mail they recieve. Those systems are expected to be able to distinguish between legitimate mail and spam; in order to present the final user as much interesting information as possible. This study presents a novel hybrid intelligent system using both unsupervised and supervised learning that can be easily adapted to be used in an individual or collaborative system. The system divides the spam filtering problem into two stages: firstly it divides the input data space into different similar parts. Then it generates several simple classifiers that are used to classify correctly messages that are contained in one of the parts previously determined. That way the efficiency of each classifier increases, as they can specialize in separate the spam from certain types of related messages. The hybrid system presented has been tested with a real e-mail data base and a comparison of its results with those obtained from other common classification methods is also included. This novel hybrid technique proves to be effective in the problem under study.


Concept Drift Cluster Ensemble Collaborative System Hybrid Intelligent System Apache Software Foundation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ruta, D., Gabrys, B.: An overview of classifier fusion methods. Computing and Information Systems 7(1), 1–10 (2000)Google Scholar
  2. 2.
    Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990)Google Scholar
  3. 3.
    Baruque, B., Corchado, E.: A weighted voting summarization of SOM ensembles. Data Mining andKnowledge Discovery 21, 398–426 (2010), doi:10.1007/s10618-009-0160-3MathSciNetCrossRefGoogle Scholar
  4. 4.
    Corchado, E., Baruque, B.: Wevos-visom: An ensemble summarization algorithm for enhanced data visualization. Neurocomputing ( in press, 2011)Google Scholar
  5. 5.
    Sharkey, A., Sharkey, N.: Combining diverse neural nets. Knowledge Engineering Review 12(3), 1–17 (1997)CrossRefGoogle Scholar
  6. 6.
    Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Interscience, Hoboken (2004)CrossRefzbMATHGoogle Scholar
  7. 7.
    Jacobs, R., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Computation 3, 79–87 (1991)CrossRefGoogle Scholar
  8. 8.
    Polikar, R.: Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 6(3), 21–45 (2006)CrossRefGoogle Scholar
  9. 9.
    Kohonen, T.: Self-Organizing Maps, vol. 30. Springer, Berlin (1995)zbMATHGoogle Scholar
  10. 10.
    Lampinen, J., Oja, E.: Clustering properties of hierarchical self-organizing maps. Journal of Mathematical Imaging and Vision 2, 261–272 (1992)CrossRefzbMATHGoogle Scholar
  11. 11.
    Dara, R., Kremer, S.C., Stacey, D.A.: Clustering unlabelled data with SOMs improves classi cation of labelled real-world data. In: Proc. IEEE World Congress, on Computational Intelligence, pp. 2237–2242 ( May 2002)Google Scholar
  12. 12.
    Ultsch, A.: Self-organizing neural networks for visualization and classification. In: Proc. Conf. Soc. for Information and Classification (1992)Google Scholar
  13. 13.
    Ultsch, A.: U*-matrix: A tool to visualize clusters in high dimensional data. Tech. rep., Department of Computer Science, University of Marburg (2003)Google Scholar
  14. 14.
    Kuncheva, L.I.: Clustering-and-selection model for classifier combination. In: KES, pp. 185–188 (2000)Google Scholar
  15. 15.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  16. 16.
    Breiman, L.: Bagging predictors. In: Machine Learning, vol. 24(2), pp. 123–140 (1996)Google Scholar
  17. 17.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning, vol. 156, p. 148 (1996)Google Scholar
  18. 18.
    Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)Google Scholar
  19. 19.
    Apache Software Foundation. Spamassasin public corpus (2006)Google Scholar
  20. 20.
    Singhal, A.: Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)Google Scholar
  21. 21.
    Maron, M.E.: An historical note on the origins of probabilistic indexing. Information Processing and Management 44, 971–972 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Santiago Porras
    • 1
  • Bruno Baruque
    • 1
  • Belén Vaquerizo
    • 1
  • Emilio Corchado
    • 2
  1. 1.Civil Engineering DepartmentUniversity of BurgosSpain
  2. 2.Departamento de Informática y AutomáticaUniversidad de SalamancaSpain

Personalised recommendations