Spam E-Mail Classification Based on the IFWB Algorithm

Jou, Chichang

doi:10.1007/978-3-642-36546-1_33

Chichang Jou²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7802))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

2010 Accesses
1 Citations

Abstract

The problem of spam e-mails has been addressed for some time. Most of the solutions are based on spam e-mail classification and filtering. However, the content of spam e-mails drifts with new concepts or social events. Thus, several spam classifiers perform effectively when their models are initially established, and their performances deteriorate with time. A learning mechanism is required to adjust the classification parameters for new and old e-mails. Because of the spread of spam e-mails, the number of spam e-mails is larger than that of legitimate e-mails. Therefore, most classifiers produce high recall for spam e-mails and low recall for legitimate e-mails. Based on the Bayesian algorithm, we propose an incremental forgetting weighted algorithm with a misclassification cost mechanism that extracts features by IGICF (Information Gain and Inverse Class Frequency) to address the problem of concept drift and data skew in spam e-mail classification. We implemented the algorithm and performed detailed tests on the effectiveness of the mechanism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alguliev, R.M., Aliguliyev, R.M., Nazirova, S.A.: Classification of textual e-Mail spam using data mining techniques. Applied Computational Intelligence and Soft Computing, Article ID: 416308 (2011)
Google Scholar
Almeida, T., Almeida, J., Yamakami, A.: Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. Journal of Internet Services and Applications 1(3), 183–200 (2011)
Article Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1), 1–6 (2004)
Article Google Scholar
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tacking concept drift in spam filtering. Knowledge-Based Systems 18(4-5), 187–195 (2005)
Article Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Fawcett, T.: In vivo spam filtering: a challenge problem for data mining. ACM SIGKDD Explorations Newsletter 5(2), 140–148 (2004)
Article Google Scholar
Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications 33(1), 36–48 (2007)
Article Google Scholar
Hayat, M.Z., Basiri, J., Seyedhossein, L., Shakery, A.: Content-Based Concept Drift Detection for Email Spam Filtering. In: 5th International Symposium on Telecommunications, pp. 531–536 (2010)
Google Scholar
Katakis, I., Tsoumakas, G., Vlahavas, I.: On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 338–348. Springer, Heidelberg (2005)
Chapter Google Scholar
Koychev, I.: Gradual Forgetting for Adaptation to Concept Drift. In: Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, pp. 101–106 (2000)
Google Scholar
Monard, M.C., Batista, G.: Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics, 173–180 (2002)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program (Automated Library and Information Systems) 4(3), 130–137 (1980)
Article Google Scholar
Sculley, D., Wachman, G.M.: Relaxed Online SVMs for Spam Filtering. In: SIGIR 2007, pp. 415–422 (2007)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization, pp. 55–62 (1998)
Google Scholar
Tseng, C.Y., Chen, M.S.: Incremental SVM model for spam detection on dynamic email social networks. In: Proceedings of CSE 2009 International Conference on Computer Science and Engineering, pp. 128–135 (2009)
Google Scholar
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996)
Google Scholar
Xu, Y., Li, J., Wang, B., Sun, C., Zhang, S.: A Study of feature selection for text categorization on imbalanced data. Journal of Computer Research and Development 44, 58–62 (2007) (In Simplified Chinese)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Management, Tamkang University, Tamsui, New Taipei City, Taiwan, 25137
Chichang Jou

Authors

Chichang Jou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Systems, Department of Software Engineering, Universiti Teknologi Malaysia, 81310, Johar Baharu, Johor, Malaysia
Ali Selamat & Habibollah Haron &
Institute of Informatics, Division of Knowledge Managements Systems, Wrocław University of Technology, Str. Wybrzeże Wyspiańskiego 27, 50-370, Wrocław, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jou, C. (2013). Spam E-Mail Classification Based on the IFWB Algorithm. In: Selamat, A., Nguyen, N.T., Haron, H. (eds) Intelligent Information and Database Systems. ACIIDS 2013. Lecture Notes in Computer Science(), vol 7802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36546-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-36546-1_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36545-4
Online ISBN: 978-3-642-36546-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics