Near-Duplicate Mail Detection Based on URL Information for Spam Filtering
Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.
KeywordsInternet User Internet Resource Spam Detection Spam Message White List
Unable to display preview. Download preview PDF.
- 3.Sahami, M., Dumaisy, S., Heckermany, D., Horvitzy, E.: A Bayesian approach to filtering junk E-Mail. In: Proc. Of AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, pp. 55–62 (1998)Google Scholar
- 4.Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html
- 5.Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In: Proc. of the PKDD workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)Google Scholar
- 8.Carreras, X., Marquez, L.: Boosting trees for anti-Spam email filtering. In: Proc. of Euro Conference on Recent Advances in Natural Language Processing (RANLP 2001) (September 2001)Google Scholar
- 9.Hulten, G., Penta, A., Seshadrinathan, G., Mishra, M.: Trends in spam products and methods. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)Google Scholar
- 10.Machlis, S.: Uh-oh: spam’s getting more sophisticated. Computerworld (January 17, 2003), available at http://www.computerworld.com
- 11.Graham-Cumming, J.: How to beat an adaptive spam filter. In: Proc. of MIT Spam Conference (2004)Google Scholar
- 12.Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)Google Scholar
- 13.Distributed Checksum Clearinghouse (DCC). Available at: http://www.rhyolite.com/anti-spam/dcc/
- 15.Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of SIGMOD 2003, pp. 76–85 (2003)Google Scholar
- 16.Yeh, C.-C., Yeh, N.-W.: Octet histogram-based near duplicate mail detection for spam filtering. In: Proc. of IEEE-EEE05-MEM 2005, Hong Kong, pp. 14–20 (2005)Google Scholar