Reference Data Sets for Spam Detection: Creation, Analysis, Propagation

  • Marcin Luckner
  • Robert Filasiak
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8073)


A reference set is a set of data of network traffic whose form and content allows detecting an event or a group of events. Realistic and representative datasets based on real traffic can improve research in the fields of intruders and anomaly detection. Creating reference sets tackles a number of issues such as the collection and storage of large volumes of data, the privacy of information and the relevance of collected events. Moreover, rare events are hard to analyse among background traffic and need specialist detection tools. One of the common problems that can be detected in network traffic is spam. This paper presents the methodology for creating a network traffic reference set for spam detection. The methodology concerns the selection of significant features, the collection and storage of data, the analysis of the collected data, the enrichment of the data with additional events and the propagation of the set. Moreover, a hybrid classifier that detects spam on relatively high level is presented.


Reference sets Spam detection Flow analysis Anomaly detection Hybrid classifiers 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Behera, G.: Privacy preserving c4.5 using gini index, pp. 1–4 (March 2011)Google Scholar
  2. 2.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)zbMATHGoogle Scholar
  3. 3.
    Deri, L.: nprobe: an open source netflow probe for gigabit networks. In: Proc. of Terena TNC 2003 (2003)Google Scholar
  4. 4.
    Fomenkov, M., Claffy, K.: Internet measurement data management challenges. In: Workshop on Research Data Lifecycle Management, Princeton, NJ (July 2011)Google Scholar
  5. 5.
    Grzenda, M.: Towards the reduction of data used for the classification of network flows. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012, Part II. LNCS, vol. 7209, pp. 68–77. Springer, Heidelberg (2012), CrossRefGoogle Scholar
  6. 6.
    Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT 2008, pp. 11:1–11:12. ACM, New York (2008)Google Scholar
  7. 7.
    Kobiersky, P., Korenek, J., Polcak, L.: Packet header analysis and field extraction for multigigabit networks, pp. 96–101 (April 2009)Google Scholar
  8. 8.
    Limwiwatkul, L., Rungsawang, A.: Distributed denial of service detection using tcp/ip header and traffic measurement analysis, vol. 1, pp. 605–610 (October 2004)Google Scholar
  9. 9.
    Moore, A., Crogan, M., Moore, A.W., Mary, Q., Zuev, D., Zuev, D., Crogan, M.L.: Discriminators for use in flow-based classification. Tech. rep. (2005)Google Scholar
  10. 10.
    Ouyang, T., Ray, S., Rabinovich, M., Allman, M.: Can network characteristics detect spam effectively in a stand-alone enterprise? In: Spring, N., Riley, G.F. (eds.) PAM 2011. LNCS, vol. 6579, pp. 92–101. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Schatzmann, D., Burkhart, M., Spyropoulos, T.: Flow-level characteristics of spam and ham (291) (August 2008)Google Scholar
  12. 12.
    Žádník, M., Michlovský, Z.: Is spam visible in flow-level statistics? Tech. rep. (2009),

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Marcin Luckner
    • 1
  • Robert Filasiak
    • 2
  1. 1.Faculty of Mathematics and Information ScienceWarsaw University of TechnologyWarsawPoland
  2. 2.Orange Labs PolandWarszawaPoland

Personalised recommendations