Near-Duplicate Mail Detection Based on URL Information for Spam Filtering

Yeh, Chun-Chao; Lin, Chia-Hui

doi:10.1007/11919568_84

Chun-Chao Yeh¹⁸ &
Chia-Hui Lin¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 3961))

Included in the following conference series:

International Conference on Information Networking

836 Accesses
3 Citations

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Weinstein, L.: Inside risks: Spam wars. Communication of ACM 46(8), 136–136 (2003)
Article Google Scholar
Corbato, F.J.: On computer system challenges. Journal of ACM 50(1), 30–31 (2003)
Article Google Scholar
Sahami, M., Dumaisy, S., Heckermany, D., Horvitzy, E.: A Bayesian approach to filtering junk E-Mail. In: Proc. Of AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, pp. 55–62 (1998)
Google Scholar
Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In: Proc. of the PKDD workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)
Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2001)
Article Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Carreras, X., Marquez, L.: Boosting trees for anti-Spam email filtering. In: Proc. of Euro Conference on Recent Advances in Natural Language Processing (RANLP 2001) (September 2001)
Google Scholar
Hulten, G., Penta, A., Seshadrinathan, G., Mishra, M.: Trends in spam products and methods. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)
Google Scholar
Machlis, S.: Uh-oh: spam’s getting more sophisticated. Computerworld (January 17, 2003), available at http://www.computerworld.com
Graham-Cumming, J.: How to beat an adaptive spam filter. In: Proc. of MIT Spam Conference (2004)
Google Scholar
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)
Google Scholar
Distributed Checksum Clearinghouse (DCC). Available at: http://www.rhyolite.com/anti-spam/dcc/
Chowdhury, A., Frieder, O., Grossman, O.D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. on Information Systems 20(2), 171–191 (2002)
Article Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of SIGMOD 2003, pp. 76–85 (2003)
Google Scholar
Yeh, C.-C., Yeh, N.-W.: Octet histogram-based near duplicate mail detection for spam filtering. In: Proc. of IEEE-EEE05-MEM 2005, Hong Kong, pp. 14–20 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National Taiwan Ocean University, Taiwan
Chun-Chao Yeh & Chia-Hui Lin

Authors

Chun-Chao Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Hui Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information and Communications Engineering, Hankuk University of Foreign Studies, Imun-dong, Dongdaemun-Gu, 130-790, Seoul, Korea
Ilyoung Chong
Kyusyu Institute of Technology, 680-4, Kawazu, Japan
Kenji Kawahara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yeh, CC., Lin, CH. (2006). Near-Duplicate Mail Detection Based on URL Information for Spam Filtering. In: Chong, I., Kawahara, K. (eds) Information Networking. Advances in Data Communications and Wireless Networks. ICOIN 2006. Lecture Notes in Computer Science, vol 3961. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11919568_84

Download citation

DOI: https://doi.org/10.1007/11919568_84
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48563-6
Online ISBN: 978-3-540-48564-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics