Skip to main content

Near-Duplicate Mail Detection Based on URL Information for Spam Filtering

  • Conference paper
Information Networking. Advances in Data Communications and Wireless Networks (ICOIN 2006)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 3961))

Included in the following conference series:

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Weinstein, L.: Inside risks: Spam wars. Communication of ACM 46(8), 136–136 (2003)

    Article  Google Scholar 

  2. Corbato, F.J.: On computer system challenges. Journal of ACM 50(1), 30–31 (2003)

    Article  Google Scholar 

  3. Sahami, M., Dumaisy, S., Heckermany, D., Horvitzy, E.: A Bayesian approach to filtering junk E-Mail. In: Proc. Of AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, pp. 55–62 (1998)

    Google Scholar 

  4. Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html

  5. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In: Proc. of the PKDD workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)

    Google Scholar 

  6. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2001)

    Article  Google Scholar 

  7. Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  8. Carreras, X., Marquez, L.: Boosting trees for anti-Spam email filtering. In: Proc. of Euro Conference on Recent Advances in Natural Language Processing (RANLP 2001) (September 2001)

    Google Scholar 

  9. Hulten, G., Penta, A., Seshadrinathan, G., Mishra, M.: Trends in spam products and methods. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)

    Google Scholar 

  10. Machlis, S.: Uh-oh: spam’s getting more sophisticated. Computerworld (January 17, 2003), available at http://www.computerworld.com

  11. Graham-Cumming, J.: How to beat an adaptive spam filter. In: Proc. of MIT Spam Conference (2004)

    Google Scholar 

  12. Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)

    Google Scholar 

  13. Distributed Checksum Clearinghouse (DCC). Available at: http://www.rhyolite.com/anti-spam/dcc/

  14. Chowdhury, A., Frieder, O., Grossman, O.D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. on Information Systems 20(2), 171–191 (2002)

    Article  Google Scholar 

  15. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of SIGMOD 2003, pp. 76–85 (2003)

    Google Scholar 

  16. Yeh, C.-C., Yeh, N.-W.: Octet histogram-based near duplicate mail detection for spam filtering. In: Proc. of IEEE-EEE05-MEM 2005, Hong Kong, pp. 14–20 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yeh, CC., Lin, CH. (2006). Near-Duplicate Mail Detection Based on URL Information for Spam Filtering. In: Chong, I., Kawahara, K. (eds) Information Networking. Advances in Data Communications and Wireless Networks. ICOIN 2006. Lecture Notes in Computer Science, vol 3961. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11919568_84

Download citation

  • DOI: https://doi.org/10.1007/11919568_84

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-48563-6

  • Online ISBN: 978-3-540-48564-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics