Skip to main content

Reference Data Sets for Spam Detection: Creation, Analysis, Propagation

  • Conference paper
  • 2446 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8073))

Abstract

A reference set is a set of data of network traffic whose form and content allows detecting an event or a group of events. Realistic and representative datasets based on real traffic can improve research in the fields of intruders and anomaly detection. Creating reference sets tackles a number of issues such as the collection and storage of large volumes of data, the privacy of information and the relevance of collected events. Moreover, rare events are hard to analyse among background traffic and need specialist detection tools. One of the common problems that can be detected in network traffic is spam. This paper presents the methodology for creating a network traffic reference set for spam detection. The methodology concerns the selection of significant features, the collection and storage of data, the analysis of the collected data, the enrichment of the data with additional events and the propagation of the set. Moreover, a hybrid classifier that detects spam on relatively high level is presented.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Behera, G.: Privacy preserving c4.5 using gini index, pp. 1–4 (March 2011)

    Google Scholar 

  2. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)

    MATH  Google Scholar 

  3. Deri, L.: nprobe: an open source netflow probe for gigabit networks. In: Proc. of Terena TNC 2003 (2003)

    Google Scholar 

  4. Fomenkov, M., Claffy, K.: Internet measurement data management challenges. In: Workshop on Research Data Lifecycle Management, Princeton, NJ (July 2011)

    Google Scholar 

  5. Grzenda, M.: Towards the reduction of data used for the classification of network flows. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012, Part II. LNCS, vol. 7209, pp. 68–77. Springer, Heidelberg (2012), http://dx.doi.org/10.1007/978-3-642-28931-6_7

    Chapter  Google Scholar 

  6. Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT 2008, pp. 11:1–11:12. ACM, New York (2008)

    Google Scholar 

  7. Kobiersky, P., Korenek, J., Polcak, L.: Packet header analysis and field extraction for multigigabit networks, pp. 96–101 (April 2009)

    Google Scholar 

  8. Limwiwatkul, L., Rungsawang, A.: Distributed denial of service detection using tcp/ip header and traffic measurement analysis, vol. 1, pp. 605–610 (October 2004)

    Google Scholar 

  9. Moore, A., Crogan, M., Moore, A.W., Mary, Q., Zuev, D., Zuev, D., Crogan, M.L.: Discriminators for use in flow-based classification. Tech. rep. (2005)

    Google Scholar 

  10. Ouyang, T., Ray, S., Rabinovich, M., Allman, M.: Can network characteristics detect spam effectively in a stand-alone enterprise? In: Spring, N., Riley, G.F. (eds.) PAM 2011. LNCS, vol. 6579, pp. 92–101. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Schatzmann, D., Burkhart, M., Spyropoulos, T.: Flow-level characteristics of spam and ham (291) (August 2008)

    Google Scholar 

  12. Žádník, M., Michlovský, Z.: Is spam visible in flow-level statistics? Tech. rep. (2009), http://www.fit.vutbr.cz/research/view_pub.php?id=9277

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luckner, M., Filasiak, R. (2013). Reference Data Sets for Spam Detection: Creation, Analysis, Propagation. In: Pan, JS., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2013. Lecture Notes in Computer Science(), vol 8073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40846-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40846-5_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40845-8

  • Online ISBN: 978-3-642-40846-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics