Skip to main content
Log in

Robust personalizable spam filtering via local and global discrimination modeling

Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Content-based e-mail spam filtering continues to be a challenging machine learning problem. Usually, the joint distribution of e-mails and labels changes from user to user and from time to time, and the training data are poor representatives of the true distribution. E-mail service providers have two options for automatic spam filtering at the service-side: a single global filter for all users or a personalized filter for each user. The practical usefulness of these options, however, depends upon the robustness and scalability of the filter. In this paper, we address these challenges by presenting a robust personalizable spam filter based on local and global discrimination modeling. Our filter exploits highly discriminating content terms, identified by their relative risk, to transform the input space into a two-dimensional feature space. This transformation is obtained by linearly pooling the discrimination information provided by each term for spam or non-spam classification. Following this local model, a linear discriminant is learned in the feature space for classification. We also present a strategy for personalizing the local and global models using unlabeled e-mails, without requiring user’s feedback. Experimental evaluations and comparisons are presented for global and personalized spam filtering, for varying distribution shift, for handling the problem of gray e-mails, on unseen e-mails, and with varying filter size. The results demonstrate the robustness and effectiveness of our filter and its suitability for global and personalized spam filtering at the service-side.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. Agirre E, de Lacalle OL (2008) On robustness and domain adaptation using SVD for word sense disambiguation. In: COLING-08: Proceedings of the 22nd international conference on computational Linguistics. Association for Computational Linguistics, pp 17–24

  2. Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge

    Google Scholar 

  3. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: SIGIR-00: Proceedings of the 23rd conference on research and development in information retrieval. ACM, New York, pp 160–167

  4. Atkins S (2003) Size and cost of the problem. In: IETF-03: in 56th meeting of the Internet engineering task force. San Francisco

  5. Ben-David S, Blitzer J, Crammer K, Pereira F (2007) Analysis of representations for domain adaptation. In:NIPS-07: Advances in neural information processing systems. MIT Press, Cambridge, pp 137–144

  6. Bickel S (2006) ECML-PKDD discovery challenge 2006 overview. In: Proceedings of ECML-PKDD discovery challenge workshop, pp 1–9

  7. Bickel S, Scheffer T (2006) Dirichlet-enhanced spam filtering based on biased samples. In: NIPS-06: Advances in neural information processing systems. MIT Press, Cambridge, pp 161–168

  8. Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10: 2137–2155

    MathSciNet  MATH  Google Scholar 

  9. Bigi B (2003) Using Kullback–Leibler distance for text categorization. In: ECIR:03 Proceedings of 25th European conference on information retrieval research. Springer, Berlin, pp 305–319

  10. Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: EMNLP-06: Proceedings of 11th conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 120–128

  11. Bringmann B, Nijssen S, Zimmermann A (2009) Pattern-based classification: a unifying perspective. In: Proceedings of ECML-PKDD workshop on from local patterns to global models, pp 36–50

  12. Chang M, Yih W, McCann R (2008) Personalized spam filtering for gray mail. In: CEAS-08: Proceedings of 5th conference on email and anti-spam

  13. Cheng V, Li CH (2006) Personalized spam filtering with semi-supervised classifier ensemble. In: WI-06: Proceedings of the IEEE/WIC/ACM international conference on Web intelligence. IEEE Computer Society, pp 195–201

  14. Chung YM, Lee JY (2001) A corpus-based approach to comparative evaluation of statistical term association measures. J Am Soc Inf Sci Technol 52(4): 283–296

    Article  Google Scholar 

  15. Cormack GV (2007) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4): 335–455

    Article  Google Scholar 

  16. Cormack GV (2006) Harnessing unlabeled examples through application of dynamic markov modeling. In: Proceedings of ECML-PKDD discovery challenge workshop, pp 10–15

  17. Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: NIPS-04: advances in neural information processing systems. MIT Press, Cambridge

  18. Dagan I, Karov Y, Roth D (1997) Mistake driven learning in text categorization. In: EMNLP-97: Proceedings of 2nd conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 55–63

  19. Delany SJ, Cunningham P, Coyle L (2005a) An assessment of case-based reasoning for spam filtering. J Artif Intell Rev 24(3–4): 359–378

    Article  Google Scholar 

  20. Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005b) A case-based technique for tracing concept drift in spam filtering. Knowl Based Syst 18: 187–195

    Article  Google Scholar 

  21. Delany SJ, Cunningham P, Tsymbal A (2006) A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In: FLAIRS-06: Proceedings of the 19th international Florida Artificial Intelligence Research Society Conference. AAAI Press, pp 340–345

  22. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MathSciNet  MATH  Google Scholar 

  23. Druck G, Pal C, McCallum A, Zhu X (2007) Semi-supervised classification with hybrid generative/discriminative methods. In: KDD-07: Proceedings of 13th conference on knowledge discovery and data mining. ACM, New York, pp 280–289

  24. Fawcett T (2003) “In vivo” spam filtering: a challenge problem for kdd. SIGKDD Explor Newsl 5(2): 140–148

    Article  Google Scholar 

  25. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305

    MATH  Google Scholar 

  26. Goodman J, Gormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun ACM 50: 24–33

    Article  Google Scholar 

  27. Gray A, Haahr M (2004) Personalised collaborative spam filtering. In: CEAS-04: Proceedings of 1st conference on email and anti-spam

  28. Hämäläïnen W (2010) StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl Inf Syst 23: 373–399

    Article  Google Scholar 

  29. Hsieh DA, Manski CF, McFadden D (1985) Estimation of response probabilities from augmented retrospective observations. J Am Stat Assoc 80(391): 651–662

    Article  MATH  Google Scholar 

  30. Jaakkola TS, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: NIPS 98: advances in neural information processing systems. MIT Press, Cambridge

  31. Jacobs RA (1995) Methods for combining experts’ probability assessments. Neural Comput 7: 867–888

    Article  Google Scholar 

  32. Jiang J (2007) A literature survey on domain adaptation of statistical classifiers. http://www.mysmu.edu/faculty/jingjiang/

  33. Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: ECML-98: Proceedings on 10th European conference on machine learning. Springer, Berlin, pp 137–142

  34. Joachims T (1999) Making large-scale support vector machine learning practical. MIT Press, Cambridge, pp 169–184. ISBN 0-262-19416-3

  35. Joachims T (2001) A statistical learning model of text classification for support vector machines. In: SIGIR-01: Proceedings of the 24th conference on research and development in information retrieval. ACM, New York, pp 128–136

  36. Juan A, Vilar D, Ney H (2007) Bridging the gap between naive Bayes and maximum entropy for text classification. In: PRIS-07: Proceedings of the 7th international workshop on pattern recognition in information systems. INSTICC Press, Setubal, pp 59–65

  37. Junejo KN, Karim A (2007) PSSF: a novel statistical approach for personalized service-side spam filtering. In: WI-07: Proceedings of the IEEE/WIC/ACM international conference on web intelligence. IEEE Computer Society, pp 228–234

  38. Junejo KN, Karim A (2008) A robust discriminative term weighting based linear discriminant method for text classification. In: ICDM-08: Proceedings of 8th international conference on data mining. IEEE Computer Society, pp 323–332

  39. Junejo KN, Yousaf MM, Karim A (2006) A two-pass statistical approach for automatic personalized spam filtering. In: Proceedings of ECML-PKDD discovery challenge workshop, pp 16–27

  40. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391

    Article  Google Scholar 

  41. Kennedy JE, Quine MP (1989) The total variation distance between the binomial and poisson distributions. Ann Probab 17: 396–400

    Article  MathSciNet  MATH  Google Scholar 

  42. Han KS, Rim HC, Myaeng SH (2006) Some techniques for naive Bayes text classification. IEEE Trans Knowl Data Eng 18(11): 1457–1466

    Article  Google Scholar 

  43. Knobbe A, Valkonet J (2009) Building classifiers from pattern teams. In: Proceedings of ECML-PKDD workshop on from local patterns to global models, pp 77–93

  44. Knobbe A, Cremileux B, Furnkranz J, Scholz M (2008) From local patterns to global models: the lego approach to data mining. In: Proceedings of ECML-PKDD workshop on from local patterns to global models, pp 1–16

  45. Kolcz A, Yih WT (2007) Raising the baseline for high-precision text classifiers. In: KDD-07: Proceedings of the 13th conference on knowledge discovery and data mining. ACM, New york, pp 400–409

  46. Kolcz A, Bond M, Sargent J (2006) The challenges of service-side personalized spam filtering: scalability and beyond. In: InfoScale-06: Proceedings of the 1st international conference on scalable information systems, ACM, New york, p 21

  47. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22: 79–86

    Article  MathSciNet  MATH  Google Scholar 

  48. Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: Proceedings of ECML-PKDD discovery challenge workshop, pp 28–38

  49. Leavitt N (2007) Vendors fight spam’s sudden rise. Computer 40(3): 16–19

    Article  Google Scholar 

  50. LeBlanc M, Crowley J (1992) Relative risk trees for censored survival data. Biometrics 48(2): 411–425

    Article  MathSciNet  Google Scholar 

  51. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397

    Google Scholar 

  52. Li H, Li J, Wong L, Feng M, Tan YP (2005) Relative risk and odds ratio: a data mining perspective. In: PODS ’05: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. ACM, New York, pp 368–377

  53. Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 430–439

  54. Luenberger DG (1984) Linear and nonlinear programming. 2. Addison-Wesley, Reading

    MATH  Google Scholar 

  55. Malik H, Fradkin D, Moerchen F (2011) Single pass text classification by direct feature weighting. Knowl Inf Syst 28: 79–98

    Article  Google Scholar 

  56. Mannila H (2002) Local and global methods in data mining: basic techniques and open problems. In: Automata, languages, and programming, lecture notes in computer science, vol 2380. Springer, Berlin, pp 778–778

  57. McCallum A (2002) Mallet: a machine learning for language toolkit. http://mallet.cs.umass.edu

  58. Mccallum A, Pal C, Druck G, Wang X (2006) Multi-conditional learning: generative/discriminative training for clustering and classification. In: AAAI-06: Proceedings of the 21st national conference on artificial intelligence. AAAI Press, Menlo Park, pp 433–439

  59. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering. pp 61–67

  60. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: EMNLP-02: Proceedings of 7th conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 79–86

  61. Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16(3): 281–301

    Article  Google Scholar 

  62. Raina R, Shen Y, Ng AY (2004) Classification with hybrid generative/discriminative models. In: NIPS 04: advances in neural information processing systems. MIT Press, Cambridge

  63. Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: ICML-07: Proceedings of the 24th international conference on machine learning. ACM, New York, pp 759–766

  64. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34: 1–47

    Article  MathSciNet  Google Scholar 

  65. Seewald AK (2007) An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell Data Anal 11(5): 497–524

    Google Scholar 

  66. Segal R (2007) Combining global and personal anti-spam filtering. In: CEAS-07: Proceedings of 4th conference on email and anti-spam

  67. Segal R, Crawford J, Kephart J, Leiba B (2004) Spamguru: an enterprise anti-spam filtering system. In: CEAS-04: Proceedings of 1st conference on email and anti-spam

  68. Stern H (2008) A survey of modern spam tools. In: CEAS-08: Proceedings of 5th conference on email and anti-spam

  69. Xing D, Dai W, Xue GR, Yu Y (2007) Bridged refinement for transfer learning. In: PKDD-07: Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases, pp 324–335. Springer, Berlin

  70. Xue JC, Weiss GM (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: KDD-09: Proceedings of the 15th conference on knowledge discovery and data mining. ACM, New york, pp 897–906

  71. Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inf Process (TALIP) 3(4): 243–269

    Article  Google Scholar 

  72. Zhou ZH, Li M (2010) Semi-supervised learning by disagreement. Knowl Inf Syst 24: 415–439

    Article  Google Scholar 

  73. Zhu X (2008) Semi-supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asim Karim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Junejo, K.N., Karim, A. Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst 34, 299–334 (2013). https://doi.org/10.1007/s10115-012-0477-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0477-x

Keywords

Navigation