Advertisement

Near-Duplicate Mail Detection Based on URL Information for Spam Filtering

  • Chun-Chao Yeh
  • Chia-Hui Lin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3961)

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.

Keywords

Internet User Internet Resource Spam Detection Spam Message White List 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Weinstein, L.: Inside risks: Spam wars. Communication of ACM 46(8), 136–136 (2003)CrossRefGoogle Scholar
  2. 2.
    Corbato, F.J.: On computer system challenges. Journal of ACM 50(1), 30–31 (2003)CrossRefGoogle Scholar
  3. 3.
    Sahami, M., Dumaisy, S., Heckermany, D., Horvitzy, E.: A Bayesian approach to filtering junk E-Mail. In: Proc. Of AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, pp. 55–62 (1998)Google Scholar
  4. 4.
    Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html
  5. 5.
    Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In: Proc. of the PKDD workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)Google Scholar
  6. 6.
    Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2001)CrossRefGoogle Scholar
  7. 7.
    Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  8. 8.
    Carreras, X., Marquez, L.: Boosting trees for anti-Spam email filtering. In: Proc. of Euro Conference on Recent Advances in Natural Language Processing (RANLP 2001) (September 2001)Google Scholar
  9. 9.
    Hulten, G., Penta, A., Seshadrinathan, G., Mishra, M.: Trends in spam products and methods. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)Google Scholar
  10. 10.
    Machlis, S.: Uh-oh: spam’s getting more sophisticated. Computerworld (January 17, 2003), available at http://www.computerworld.com
  11. 11.
    Graham-Cumming, J.: How to beat an adaptive spam filter. In: Proc. of MIT Spam Conference (2004)Google Scholar
  12. 12.
    Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)Google Scholar
  13. 13.
    Distributed Checksum Clearinghouse (DCC). Available at: http://www.rhyolite.com/anti-spam/dcc/
  14. 14.
    Chowdhury, A., Frieder, O., Grossman, O.D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. on Information Systems 20(2), 171–191 (2002)CrossRefGoogle Scholar
  15. 15.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of SIGMOD 2003, pp. 76–85 (2003)Google Scholar
  16. 16.
    Yeh, C.-C., Yeh, N.-W.: Octet histogram-based near duplicate mail detection for spam filtering. In: Proc. of IEEE-EEE05-MEM 2005, Hong Kong, pp. 14–20 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Chun-Chao Yeh
    • 1
  • Chia-Hui Lin
    • 1
  1. 1.Department of Computer ScienceNational Taiwan Ocean UniversityTaiwan

Personalised recommendations