Abstract
An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters.
Keywords
- Reverse polish notation (RPN)
- Naïve bayesian classifier
- Spam detection
- Linear genetic programming (LGP)
- Genetic programming (GP)
This is a preview of subscription content, access via your institution.



References
Cohen, W.: Learning rules that classify e-mail. In: Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25. AAAI Press
Clack, C., Farringdon, J., Lidwell, P., Yu, T.: Autonomous document classification for business. In: Proceedings of the first international conference on Autonomous Agents, pp. 201–208. ACM, New York, NY, USA (1997)
Brameier, M.: On linear genetic programming (2004). https://eldorado.tu-dortmund.de/handle/2003/20098
Brameier, M.F., Banzhaf, W.: Linear Genetic Programming. Springer (2006)
M. Brameier, W. Banzhaf, A comparison of linear genetic programming and neural networks in medical data mining. IEEE Trans. Evol. Comput. 5, 17–26 (2001)
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of Naive Bayesian anti-spam filtering (2000). arXiv:cs/0006013
Duda, R.O., Hart, P.E., Nilsson, N.J.: Subjective bayesian methods for rule-based inference systems. In: Proceedings of the June 7–10, 1976, National Computer Conference and Exposition, pp. 1075–1082. ACM, New York, NY, USA (1976)
Mitchell, T.M.: Machine Learning. McGraw-Hill Science/Engineering/Math (1997)
Zdziarski, J.: Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press (2005)
Reports| Press Panda Security. http://press.pandasecurity.com/press-room/reports/
Cranor, L.F., LaMacchia, B.A.: Spam! Commun. ACM 41, 74–83 (1998)
Graham, Paul: A Plan for Spam. http://www.paulgraham.com/spam.html
Graham, P.: Better Bayesian Filtering. http://www.paulgraham.com/better.html
Pantel, P., Lin, D.: SpamCop: A spam classification & organization program. In: Learning for Text Categorization: Papers from the 1998 Workshop, pp. 95–98 (1998)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proceedings of AAAI-98 Workshop Learn. Text Categ. (1998)
SpamAssassin Homepage. http://spamassassin.apache.org/
Bayler, G.: Penetrating Bayesian Spam Filters: Exploiting Redundancy in Natural Language to Disguise Spam Emails. Vdm Verlag Dr. Müller (2008)
Shmueli, G., Patel, N.R., Bruce, P.C.: Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. Wiley (2011)
C. Sangeetha, P. Amudha, S. Sivakumari, Feature extraction approach for spam filtering. Int. J. Adv. Res. Technol. 2, 89–93 (2012)
Goweder, A.M., Rashed, T.E., Ali, S., Alhammi, H.A.: An Anti-spam system using artificial neural networks and genetic algorithms. Proc. 2008 Int. Arab Conf. Inf. Technol. 1–8 (2008)
A. Khorsi, An overview of content-based spam filtering techniques. Inform. Slov. 31, 269–277 (2007)
Katirai, H.: Filtering Junk E-Mail: A Performance Comparison Between Genetic Programming and Naive Bayes (1999). http://citeseer.ist.psu.edu/310632.html
L. Hirsch, M. Saeedi, R. Hirsch, Evolving rules for document classification, in Genetic Programming, ed. by M. Keijzer, A. Tettamanzi, P. Collet, J. van Hemert, M. Tomassini (Springer, Berlin, 2005), pp. 85–95
Shengen, L., Xiaofei, N., Peiqi, L., Lin, W.: Generating new features using genetic programming to detect link spam. In: Proceedings of the 2011 Fourth International Conference on Intelligent Computation Technology and Automation, vol. 01. pp. 135–138. IEEE Computer Society, Washington, DC, USA (2011)
Payne, T., Payne, T.: Learning Email Filtering Rules with Magi A Mail Agent Interface. Presented at the Department of Computing Science, University of Aberdeen (1994)
Davenport, G.F., Ryan, M.D., Rayward-Smith, V.J.: Rule induction using a reverse polish representation. In: GECCO, pp. 990–995 (1999)
Lichman, M.: UCI Machine Learning Repository, Irvine, CA, University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988)
Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. A Bradford Book (1992)
Koza J.R.: Genetic evolution and co-evolution of computer programs. In: Artificial Life II, pp. 603–629. Addison-Wesley Publishing Company (1990)
Koza J.R., K.M.A.: Genetic Programming IV. Kluwer Academic Publishers (2003)
Downey, C.: Explorations in Parallel Linear Genetic Programming: A Thesis Submitted to the Victoria University of Wellington in Fulfilment of the Requirements for the Degree of Master of Science in Computer Science. Victoria University of Wellington (2011)
Downey, C., Zhang, M.: Parallel linear genetic programming. In: Proceedings of the 14th European conference on Genetic programming, pp. 178–189. Springer, Berlin (2011)
Abraham, A., Ramos, V.: Web usage mining using artificial ant colony clustering and linear genetic programming. In: The 2003 Congress on Evolutionary Computation, 2003. CEC’03, vol. 2, pp. 1384–1391 (2003)
A.H. Gandomi, A.H. Alavi, M.G. Sahab, New formulation for compressive strength of CFRP confined concrete cylinders using linear genetic programming. Mater. Struct. 43, 963–983 (2009)
A. Guven, Linear genetic programming for time-series modelling of daily flow rate. J. Earth Syst. Sci. 118, 137–146 (2009)
Song, D., Heywood, M.I., Zincir-Heywood, A.N.: A linear genetic programming approach to intrusion detection. In: Genetic and Evolutionary Computation—GECCO 2003, pp. 2325–2336. Springer, Berlin (2003)
S. Mukkamala, A.H. Sung, A. Abraham, Modeling Intrusion Detection Systems Using Linear Genetic Programming Approach, in Innovations in Applied Artificial Intelligence, ed. by B. Orchard, C. Yang, M. Ali (Springer, Berlin, 2004), pp. 633–642
I. Kononenko, Semi-naive bayesian classifier, in Machine Learning—EWSL-91, ed. by Y. Kodratoff (Springer, Berlin, 1991), pp. 206–219
C.L. Hamblin, Translation to and from polish notation. Comput. J. 5, 210–213 (1962)
RPN.: An Introduction To Reverse Polish Notation. http://h41111.www4.hp.com/calculators/uk/en/articles/rpn.html
A.W. Burks, Don W. Warren, J.B. Wright, An analysis of a logical machine using parenthesis-free notation. Math. Tables Aids Comput. 8, 53–57 (1954)
galculator—a GTK 2/GTK 3 algebraic and RPN calculator. http://galculator.sourceforge.net/
Bennett, P.N.: Assessing the Calibration of Naive Bayes’ Posterior Estimates. School of Computer Science, Carnegie Mellon University (2000)
Monti, S., Cooper, G.F.: A Bayesian Network Classifier that Combines a Finite Mixture Model and a Naive Bayes Model (2013). arXiv:1301.6723
Safe Browsing Tool| WOT (Web of Trust). http://www.mywot.com/
Safe Browsing API—Google Developers. https://developers.google.com/safe-browsing/
Damodaram, R., Valarmathi, D.M.L.: RBL Global Toolbar with Clustering Algorithm for Fake Website Detection
P.E. Bennett, The statistical measurement of a stylistic trait in julius caesar and as you like it. Shakespeare Q. 8, 33–50 (1957)
E. Stamatatos, N. Fakotakis, G. Kokkinakis, Computer-based authorship attribution without lexical measures. Comput. Humanit. 35, 193–214 (2001)
V.A. Yatsko, Automatic text classification method based on Zipf’s law. Autom. Doc. Math. Linguist. 49, 83–88 (2015)
M. Basavaraju, D.R. Prabhakar, A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5, 15–25 (2010)
M. Matsumoto, T. Nishimura, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3–30 (1998)
Pdnsd: pdnsd homepage. http://members.home.nl/p.a.rombouts/pdnsd/
Jong, K.A.D., Spears, W.M.: An analysis of the interacting roles of population size and crossover in genetic algorithms. In: Proceedings of the 1st Workshop on Parallel Problem Solving from Nature, pp. 38–47. Springer, London, UK (1991)
M. Zhang, V. Ciesielski, Genetic programming for multiple class object detection, in Advanced Topics in Artificial Intelligence, ed. by N. Foo (Springer, Berlin, 1999), pp. 180–192
Piszcz, A., Soule, T.: Genetic programming: analysis of optimal mutation rates in a problem with varying difficulty. In: FLAIRS Conference, pp. 451–456 (2006)
G.V. Cormack, T.R. Lynam, Online supervised spam filter evaluation. ACM Trans. Inf. Syst. 25, 11 (2007)
Graham-Cumming, John: Understanding Spam Filter Accuracy (Newsletter). http://www.jgc.org/antispam/11162004-baafcd719ec31936296c1fb3d74d2cbd.pdf
Mark, C., O’Brien, J.: An Analysis of Spam Filters. Computer Science Department, WPI (2003)
Acknowledgement
Acknowledgements go to my Ph.D. Supervisors Dr Vitezlav Nezval. Thanks also to Tom Fawcett who answered my email query about the subject of Bayesian classifiers and RPN. This work was supported by Grant Agency of the Czech Republic—GACR P103/15/06700S, further by financial support of research project NPU I No. MSMT-7778/2014 by the Ministry of Education of the Czech Republic and also by the European Regional Development Fund under the Project CEBIA-Tech No. CZ.1.05/2.1.00/03.0089.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
<SimplePara><Emphasis Type="Bold">Open Access</Emphasis> This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. </SimplePara> <SimplePara>The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.</SimplePara>
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Meli, C., Oplatkova, Z.K. (2016). SPAM Detection: Naïve Bayesian Classification and RPN Expression-Based LGP Approaches Compared. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Silhavy, P., Prokopova, Z. (eds) Software Engineering Perspectives and Application in Intelligent Systems. CSOC 2016. Advances in Intelligent Systems and Computing, vol 465. Springer, Cham. https://doi.org/10.1007/978-3-319-33622-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-33622-0_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33620-6
Online ISBN: 978-3-319-33622-0
eBook Packages: EngineeringEngineering (R0)